0:00:14however
0:00:15my name is a weird
0:00:17this is trained in the signals the standard a traditional a accuracy los angeles
0:00:23to the be presenting our work
0:00:25try to an umbilical analysis of information coder
0:00:29in this and then the neural speaker representations
0:00:32and here the people that have
0:00:34well average of it for this work
0:00:38so first
0:00:40i'll introduce what i referred to as speaker meetings in the rest of the talk
0:00:44speaker limiting the lower dimensions these two presentations
0:00:48that or discriminative of speaker identity
0:00:52these other applications
0:00:54such as
0:00:55in voice biometrics but the task is to verify wasn't sounded different speech
0:01:01the house at application can speaker adapted a set of models
0:01:06they can also be used in speaker diarization
0:01:08with the task is to domain
0:01:10who spoke when in multiparty conversations
0:01:14this can be of particular use in meeting an x and many other applications
0:01:19good speaker ramblings should satisfy two properties
0:01:23first there should be discriminative of speaker factors
0:01:26second is that addition be invariant to other factors
0:01:30so what are the fact of information that could be encoded speaker embedding
0:01:34for ease of analysis be broadly categorized them as follows
0:01:39so as to the speaker factors these are related to the speaker's identity but example
0:01:44that gender age et cetera
0:01:47content factors a these are quite during speech production by the speaker
0:01:51for example
0:01:53emotional state output a in the speech signal
0:01:58sentiment whether it is a positive landed one year
0:02:00the language being spoken
0:02:02and most importantly the lexicon containing the signal
0:02:06and
0:02:07that is the channel factors these factors that quite given signal captured of the microphone
0:02:12we could be the room acoustics
0:02:14the microphone on a linear is applied on acoustic noise
0:02:18and also artificial and also the artifacts related to the competition
0:02:22on signal vector
0:02:26as i mentioned previously good speaker the minister supposed to be invariant nuisance factors
0:02:30these other factors that in that in order to the speaker's identity
0:02:34such emergencies useful for robust speaker recognition
0:02:38in the presence of a bad on acoustic noise
0:02:42they're also useful for detecting a speaker's identity
0:02:45irrespective of the emotional state of the speaker
0:02:48and
0:02:49also independent of all speakers is
0:02:52this is particularly useful
0:02:54in text-independent speaker verification applications
0:02:58so with those that don't have the motivation the goal of our work is to
0:03:03four
0:03:03first
0:03:04is to quantify the amount of misinformation in speaker meetings
0:03:08second is to investigate
0:03:10what extent
0:03:11unsupervised learning and hence
0:03:13to remove the misinformation
0:03:18most existing digits
0:03:20only performed analysis based on one or two datasets
0:03:24and
0:03:24compared to analysis is lacking
0:03:27also most of this work do not consider the dependence
0:03:30but in the individual variables in the dataset
0:03:32for example
0:03:33note addressed dataset a lexical content and the speaker identity sad and angry
0:03:38but some sentences that spoken only vectors speakers
0:03:42therefore
0:03:42it should be possible to predict the speakers based on lexical content on
0:03:47being can to mitigate these limitations our previous work
0:03:51by making the following contributions
0:03:53firstly we use multiple datasets to comprehensively and lies information and are denoted speaker different
0:03:59additions
0:04:00secondly we analyze the
0:04:02effect of disentangling speaker factors from uses factors on then down information
0:04:11briefly detail what they mean made disentanglement
0:04:14in the
0:04:15orders of the talk
0:04:17we define a disentanglement broadly as the task of separating out information streams from advancing
0:04:23signal
0:04:24is a coke example
0:04:26the input speech signal from belief you good
0:04:29who is happy that just bought a civilised like super
0:04:33contain such information related to various factors
0:04:36it contains information about because identity including have with him gender and age
0:04:42the information put into the good emotional state is also encoder
0:04:46more importantly
0:04:47the language identity and the lexical content i don't same but in the signal
0:04:52the goal of additional embedding extractor
0:04:54is to separate all these information streams
0:04:59and in the context of speaker the meetings i which is supposed to capture speaker
0:05:02and get additional information
0:05:04all other factors such as an emotional state and the lexical content
0:05:08i considered nuisance factors
0:05:11it is these factors which we propose to remove from the speaker meanings
0:05:15to make the more robust
0:05:18no and explain the methodology behind it is and then a speaker domain extraction
0:05:23this is a model b is
0:05:24as input of we can use any speech representation sort of that's either spectrogram
0:05:29only one speaker meeting from pre-denned model statistics vectors
0:05:33and
0:05:34using than suppose disentanglement adapted from
0:05:38method that as previously proposed in the computer vision domain
0:05:41we try to separate out
0:05:42these speaker later information
0:05:44from the loses information
0:05:47please note that this method with previously proposed in our earlier work
0:05:51and you can find more details
0:05:54in that paper
0:05:55however for completeness that explained in that he rested
0:06:01i don't think that comprises two models the main model
0:06:04which are shown in the clean
0:06:07blocks here
0:06:08and
0:06:09the address it and models shown in the blue
0:06:12then put it is first processed in court of misfits fit into two
0:06:16and weighting function in is trash shown in the figure
0:06:19the embedding hits them
0:06:21is starting to the predictive
0:06:22which predictions speaker labels like that
0:06:25the embedding has two is concatenated with the noisy version of h one
0:06:30which is denoted by hits and prime here
0:06:32it's and frame is obtained by thing it's one
0:06:34to drop what martin
0:06:36two randomly remove certain elements of h from
0:06:40and has two along with the noisy
0:06:43hatch on which is session pine
0:06:45i concatenated
0:06:47and fed into a decoder
0:06:49which tries really consider that the origin input x
0:06:54the motivation behind using the top or
0:06:56is to make sure that
0:06:58hatch one
0:06:59is an and eleven source of information for the reconstruction task
0:07:03and training in this and make sure that
0:07:05the information required for reconstruction is not storage and
0:07:08and only
0:07:09the information required for
0:07:11speaker and weightings are stored
0:07:14here
0:07:16in addition
0:07:17we also used to disentangle models we just one and low
0:07:21these models are jointly trained
0:07:23to perform poorly in predicting hits on from is to
0:07:27and has to from its own
0:07:29the goal of these models is to ensure that
0:07:31and so the nist two are not very to a feature that
0:07:35doesn't make sure that did not contain similar information
0:07:38this way
0:07:40we can team for this and then there's other conditions
0:07:44and the questions that we used a present one fish one here the main model
0:07:49produces two losses a one is a standard cross entropy loss from the predicate
0:07:52which pretty speakers
0:07:54and the second is the means greater reconstruction us from the decoder
0:07:59and the adversarial
0:08:00a model is a use means could've lost
0:08:04the overall loss function is shown here
0:08:07we try to minimize the loss with respect to the main models
0:08:10when advert of by maximizing the twisted in knots
0:08:14this training process further apart from previous work as i mentioned before
0:08:18basically use this technique
0:08:20on it
0:08:20because the digit recognition task
0:08:24on successful training
0:08:26them but enhancement is expected to capture speaker discriminative information
0:08:30and them in his to is expected to captain useless information
0:08:34notice that we are not used any labels of that uses factors such as a
0:08:38nice tight channel conditions
0:08:40extractor
0:08:44for training the models we use the standard box in the training corpus now which
0:08:47consists of
0:08:48in the way we use of interviews with celebrities
0:08:51the additive noise and reverberation which is standard practice in a day in examining
0:08:56this results in two point four million utterances from i don't seven thousand two hundred
0:09:00speakers
0:09:01as mentioned before we can you either you spectrograms atoms is and what
0:09:06well it also is decoder meetings from kate and models which we do in
0:09:09this work
0:09:11so we use i x it is extracted from a publicly available played in models
0:09:15as input
0:09:17exactly that's most of you already know are speaker demanding a hint on the automatically
0:09:21rubber and related work
0:09:23that is trained to classify speakers
0:09:25from a large dataset artificial augmented with noise and reverberation
0:09:29and this model has shown to provide state-of-the-art performance and multiple tasks
0:09:35not require speaker discriminant discriminately
0:09:39we use multiple datasets i not evaluations as mentioned here
0:09:43and by evaluating some factors for example
0:09:46i emotion on my calculator that
0:09:48we could also
0:09:50too low the
0:09:51issue of dataset bias
0:09:53creating in the model
0:09:55and following others in the looks the make the assumption that
0:09:59better classification performance
0:10:01all of the speaker remaining for the factors
0:10:04in light
0:10:06there is more information present in the embedding with respect to that factors
0:10:11and as a baseline views expected that speaker eminence since our model a data accepted
0:10:16as input
0:10:17we can consider a speaker ramblings as a refinement of detectors
0:10:21but speaker different information today and uses factorisation will
0:10:26the also reduce the dimension of expected by using pca
0:10:30or to match the
0:10:32the and meetings in vermont models
0:10:37so us of the results
0:10:41and the first set of results shows the accuracy of predicting speaker factors using x
0:10:46vectors
0:10:47shown in blue
0:10:48and using alignment actually hindered
0:10:50and in this case high it is better
0:10:53the first two of graphs here so speaker classification accuracy and the other two sure
0:10:58gender prediction accuracy
0:11:00so we find that in general both expect is an atom bearings
0:11:04but from creativity in just thank speakers and genders
0:11:07and we see a slight degradation a using another
0:11:11however the differences that women
0:11:14one other observation is that
0:11:15in i'm okay final performance of
0:11:19both axes and i model
0:11:21we conjecture that this the eight it could be due to a speaker overlap
0:11:25and also this dataset is not what ideally suited for speaker
0:11:29recognition task since
0:11:31the purpose of this dataset was emotion recognition
0:11:36no the more enticing results
0:11:39of a show the results of predicting the and in factors using x s and
0:11:44are speaker dominance and in this case since is then used actors you know it
0:11:48is british
0:11:49we find that in
0:11:51on
0:11:51the cases are model it is the model is information
0:11:56in particular
0:11:58emotion and lexical information added used to a greater extent
0:12:02here the lexical accuracy
0:12:04is accuracy of predicting the sentence
0:12:06spoken given speaker the meeting of that sentence
0:12:10and apart from the election emotional lexical content we also see a detection
0:12:14no information but into sentiment
0:12:18we just was used to motion
0:12:20and also language
0:12:25in this side of a report the results of predicting the channel factors using x
0:12:28vectors
0:12:29and a speaker dominance
0:12:31okay in this case a low respective
0:12:33in particular of we focus on three factors
0:12:36the room microphone distance are the microphone location
0:12:41and then i start
0:12:44we find that in predicting the location of the microphone use
0:12:48and the type of agonise present
0:12:50except is have a much higher accuracy than a to predict
0:12:54this means that being able to successfully reduced and what of this isn't information from
0:12:58extractors
0:13:00however we notice that
0:13:03in panic and the room
0:13:05in this the recording with me
0:13:07because so present to see that what extent this and i gnostic animating that very
0:13:11effective
0:13:12this needs further investigation
0:13:18we show the results of like evaluation
0:13:21then evaluated models for speaker verification task
0:13:24and our competitors
0:13:27the detection update of "'cause" actual
0:13:29where the false positive rate and the according to be exact scale only lately
0:13:34right and they because model you get compared to the articles
0:13:38and the "'cause"
0:13:40that it was at the origin
0:13:42you don't better models
0:13:44the black dotted lines a show the except the model
0:13:47and all the other
0:13:49lines do not are modeled they then without
0:13:52lda based dimensionality reduction
0:13:57be found statistically significant differences only in the graphs
0:14:00based numbers dimension
0:14:05well most notably in challenging scenarios
0:14:09babble in television lies in the background
0:14:11all models perform better than extractors
0:14:13also in the are distant microphone condition i've models perform significantly better than extractors
0:14:21we also found that at the model and do that is trained with a metadata
0:14:26what was slightly better compared to the model in one
0:14:29that is staying with not additional conditions
0:14:31this actually confronted expected be
0:14:38so finally like to quickly present a discussion based on experiments which hopefully will be
0:14:44useful pointers for future research
0:14:46in this domain
0:14:48first we find that speaker the meetings to captain right of information what into a
0:14:52nuisance factors
0:14:54and this can sometimes be detrimental to robustness
0:14:58and we also found that just introducing
0:15:02bottleneck on the dimension of the speaker automatic by using pca
0:15:06doesn't seem all this information
0:15:09this points of the need for explicitly more the fusion starters
0:15:13and using the
0:15:14on suppose that wasn't invariance technique which is the
0:15:19taking that using a model
0:15:21we can it is then uses information
0:15:23from the speaker meetings
0:15:25and the because advantages that unlabeled of nuisance factors are not required for this matter
0:15:31we also found that and the voice disentanglement retains gender information
0:15:36this actually such as that speaker gender
0:15:38as captured when you know conditions
0:15:41is a crucial part of identity
0:15:43this is quite intuitive from human perception point hasn't
0:15:47essentially what the shows is that mute of conditions and sounds and
0:15:51for though human perception
0:15:54finally a disincentive speaker representations shall
0:15:57a better verification performance the presence of ability of tiny conditions
0:16:01but it only babble in television i features consider
0:16:05very challenging for this test
0:16:10going forward we would like to explore methods to further improve the sentiment
0:16:14and
0:16:15so far we have not as a mention of all of not used in uses
0:16:18labeled so
0:16:20we would like to see if
0:16:22if we use this it's a with variable data available
0:16:25danny
0:16:26achieve better disentanglement
0:16:29so that brings me to the
0:16:33invested in different conditions of those of the differences
0:16:36finally i would like to acknowledge the support for us to for this work
0:16:40and
0:16:42that's it utterance that's what is into my presentation
0:16:45please feel free to us and many men with any questions or stations you might
0:16:48have
0:16:49thank you