0:00:13yeah as you five a mentor machine you i minor
0:00:16and i'm from that's of university and this is a joint what my adviser
0:00:20john looked large
0:00:21and all that title of all walk is log spectral enhancement using speaker dependent as for speaker verification
0:00:29and the aim of this what the key idea behind it is how we can use
0:00:33set and in
0:00:34parameter estimation techniques to improve the robustness of about
0:00:39a verification systems to noise and miss much
0:00:42and here
0:00:43why we want to do
0:00:46based and up
0:00:46technique is that
0:00:48a bayesian approach is
0:00:50a i'll what's to use
0:00:51but to have a principled way of accounting for parameter down setting to
0:00:55noise estimation task
0:00:57and you know like every most button recognition system
0:01:01the range is up
0:01:02a a a a a a a key component
0:01:04in use yes to extract thing
0:01:07keep the parameters of interest from your role signal in this case when we have a speech
0:01:12which is corrupted by noise and we want to extract features of interest
0:01:16that we using a or
0:01:17but and classification algorithm
0:01:20the the the the noise makes all but i
0:01:23parameterized estimates it's it wrong yes in some case
0:01:26and depending on the severity of the noise you know
0:01:29this may
0:01:30that that is also go ones are or well how much of an fact you are on
0:01:35and the parameter estimates not if we use
0:01:37if we can
0:01:39a like than to put in a bayesian up uh estimate
0:01:43uh uh we can probably you know and hans all speaker verification system and here will can see that
0:01:50you know the the two main courses of
0:01:53and what's degradation noise
0:01:55which have what you discussed and ms much because you know in speaker verification system we we need a more
0:02:02model to to a model a or a speaker distribution
0:02:07depending on the acoustic environment in which you
0:02:10you you in the training data that this may not be the same environment in which you using
0:02:15which are using the system
0:02:16and this results in me much
0:02:19and and hence performance degradation
0:02:22for the cup joe what you're trying to do here
0:02:24so the aim of all lot that
0:02:26the title to just me using speaker dependent in the log spectral domain
0:02:31and the K yeah he is that we want to
0:02:35two system which we feel are closely much that speech enhancement system
0:02:39and the recognition cyst
0:02:41the intuition behind it is that
0:02:44feel doing speech enhancement than you you you and enhancing features
0:02:48and you know
0:02:49because dependent priors
0:02:54the intuition is that if you have a better idea of who is speaking and you good brad in
0:02:58and that domain
0:03:00then you can do a better job of enhancing and we the N signal
0:03:04you can do a better job of
0:03:06oh or recognition
0:03:07so these an to play between this two systems
0:03:10and they they are what we do the eight week up to this in doubly plea as a message passing
0:03:16along nodes you not not have colour model
0:03:20so that be little be
0:03:22message passing and this will fall out in our formulation
0:03:27just a brief outline of what the rest of the talk would be like
0:03:31i just
0:03:31briefly go over a little bit of verification for any members of the audience who may need it
0:03:37and then
0:03:38going to
0:03:40uh but but any in inference and then
0:03:42for to how all into a variational bayesian inference which is a
0:03:47a what we walk in
0:03:48and then a discuss our model
0:03:51and then
0:03:52going to the experimental results
0:03:59he in verification that task is you know you a given an utterance and a claimed identity and the task
0:04:05it's a hypothesis test is the
0:04:08given the speech segment X is the speech
0:04:10from from speaker S a not
0:04:15this is a hypothesis test as of say the and uh
0:04:17what we do is we have to model model
0:04:20uh out target
0:04:22a target speakers
0:04:24using splits speaker-specific specific gmms
0:04:27and then we can
0:04:29user a
0:04:30i universal background model to test it out and it i what this it and this is the
0:04:34you know usually the this line system
0:04:37you ubm
0:04:38gmm system which is you know
0:04:41with that starting point to most verification systems there at once is but
0:04:46this is the the the most basic
0:04:49and this is where we'll try try a more calm enhancement
0:04:52in the log spectral domain to see if we can
0:04:54have improved
0:04:58no no uh
0:04:59so the classification deciding when we type of the C C
0:05:02you compute a scroll
0:05:04just just a lot like who who
0:05:06log likelihood ratio and then
0:05:08you know
0:05:09this do not threshold
0:05:11decide which uh
0:05:12which type of this is it correct and
0:05:15you know we can plot to i'll two
0:05:18but i'm for all
0:05:19same well as of formants matrix we can plot that a an also to compute equal error rate
0:05:25you know to determine the trade-off between
0:05:27missed detection
0:05:29and false alarms
0:05:31so that's just a a speaker verification part just a little bit of a bayesian inference
0:05:41we can say that that two main approaches to parameter estimation you can go the maximum likelihood route
0:05:47or the bayesian inference rule
0:05:50here we see
0:05:52if you have a data X
0:05:53represented presented in this
0:05:54figure by X
0:05:55be the generative model
0:05:57i one by a parameter a
0:06:00now in them market to some like you
0:06:02a a day
0:06:04we assume that this parameter is an unknown constant
0:06:07and then the
0:06:08they're quantity of interest is that likelihood and then we can estimate
0:06:12so it are based on the map them a maximum likelihood criterion
0:06:16and the but they didn't paradigm i'm one the other hand
0:06:19we assume that they to uh is uh
0:06:21is a
0:06:22but one by is a random variable good up that one by a prior
0:06:26and this is where the robe
0:06:28the the robustness to but it down setting to comes in
0:06:32the fact that we have a prior out what the what this to over the parameter of interest
0:06:37and then the clean quantity in these cases that
0:06:39posterior which is proportional that
0:06:42is given that the problem is proportional to the product of a like the then prior
0:06:46and then
0:06:49uh the issue is
0:06:52how we obtain i estimates we obtain based an estimate does that minimize expect expect
0:06:56costs and
0:06:57for instance if we we have the
0:07:01if the cost is the squared
0:07:03is the squared norm
0:07:05of the big
0:07:06a difference there between now
0:07:10this expression he a fit so
0:07:12the difference between the estimate and the true value
0:07:16the it well known that
0:07:17this this an estimate a the minimum mean square error estimate a just the posterior mean
0:07:23note that this is easy to write
0:07:26the what happens is that in most practical cases and even in the one we can see that here
0:07:32you know
0:07:33import a almost impossible to perform from this tech
0:07:36so now what do we do
0:07:43we can
0:07:45we can use
0:07:46the problem lies in the instructor stability
0:07:49if the problem lies in the ability of the posterior
0:07:52then we can apply what uh approximate bayesian techniques and for instance are we can use V B or variational
0:07:59of base
0:08:00uh where we approximate
0:08:03i what true posterior
0:08:05by one that's
0:08:06constrained to be them
0:08:10we need a metric the mapping between two from
0:08:14and intractable for maybe and a tractable farm
0:08:19of distributions
0:08:20and we need a metric so that we know
0:08:24and the uh
0:08:25you know what's the close
0:08:26approximation to the true posterior indestructible family
0:08:30and and we measure yeah
0:08:33we we obtain the approximation that minimize is the a out that dense
0:08:37to to in our all our approximation and the true to to
0:08:42oh in cases where i'll but i'm it that's set that uh
0:08:46consists of a and
0:08:47number of parameters in this case
0:08:50and parameters as we can and shot ability by
0:08:54assuming that
0:08:55the product of the the posterior factor like that shown in this expression one
0:09:02no the the question what is that
0:09:04we boils down to
0:09:07estimating what a
0:09:08no computing the forms of uh
0:09:12this approximate posterior each of the five does
0:09:15and then
0:09:17a for if
0:09:18up updating the sufficient statistics
0:09:21i can be shown that these and
0:09:23uh uh an expression for the
0:09:28for the approximate from of the distributions in this uh
0:09:32we computed by taking an expectation with respect to the logarithm a of that
0:09:37the joint distribution between observations
0:09:40and the parameters of interest
0:09:47no that
0:09:48get but to our speaker verification context and by in particular let's discuss the
0:09:53the model the probabilistic model
0:09:56so here what did we are in the log spectral domain
0:10:00and uh
0:10:01but we assume is that our our or signal Y of T of the observed signal
0:10:07is corrupted by additive noise
0:10:10and if we take the dft we can compute the log spectrum much shown
0:10:14but the can the look at the
0:10:17that's to
0:10:18a a F T
0:10:20and then we can it can be shown
0:10:23but uh these a nice
0:10:25a proximity relationship between
0:10:27then the the log spectrum of the up signal
0:10:31that the clean log spectrum and that log spectrum of the noise
0:10:35of this
0:10:36just a lot
0:10:38i what our likelihood
0:10:39you look you in the bayesian paradigm and we we have the likelihood and the prior
0:10:48is our likelihood
0:10:49now we need to
0:10:51to write out what is out joint distribution how does it five
0:10:55because this will help was when we come to compute a
0:10:58the that box the approximate distribution because that you the called the
0:11:02the expression
0:11:04for each of the optimum for does
0:11:06depends on an expectation
0:11:08like to
0:11:10like an expectation of the look that a beam of the joint distribution
0:11:14so this is a how the joint distribution in this content
0:11:20a factor arises
0:11:21you have all all of that out
0:11:24uh i log spectrum
0:11:25the clean log spectrum
0:11:27this is that what it what which tell explain later that we introduce one might lead to up the ability
0:11:32to like an indicator variable than the noise
0:11:34so here you have the likelihood tao
0:11:37and that prior what
0:11:39or what this
0:11:40speech log spectrum we assume that it is speaker dependent
0:11:46and uh
0:11:49so what happens is
0:11:52yes the speaker dependent ubm so in a speaker I D context this would
0:11:57in mean that we we'll and models for
0:12:00each speaker
0:12:01not id context but in know a verification context what we do is we approximate that
0:12:06that would be that you snap not
0:12:08mean in this but if kitchen context we assume that we can
0:12:12model the light bright you'll speakers as as
0:12:15just the target speaker and the ubm so this is what happens is that the library dynamic
0:12:20for each at that your testing you your when you like
0:12:24and we have a what
0:12:25i it is that this indicator the variable
0:12:28uh that was you
0:12:32who peeking
0:12:33oh in other what where they'd the target the ubm and which mixture the component is active
0:12:39so this
0:12:41just shows you the forms of the five does that we compute
0:12:44and we can see that there
0:12:47the well-known known fans
0:12:49and the V be able but and what the don't to each realising this a
0:12:53but the sufficient statistics in a in a in a case the mean and
0:12:59and the covariance
0:13:01and then and this out of a function of the observations and the prior
0:13:06and then cycling through until some convergence is that thing
0:13:11good is that once you obtain
0:13:15the clean posterior a an estimate of the clean posterior we can derive mfccs easy lee for from them
0:13:22for verification
0:13:24so just some experimental results what we do it is
0:13:28we we use three datasets initially we use to
0:13:32then we to use the M T mobile device because verification corpus
0:13:36then we have we also tried it out on a
0:13:38S the sre two thousand and four corpora
0:13:42so initial
0:13:43results here a for
0:13:45oh to make
0:13:48we did did we trained a ubm with that subset using training data from a subset of the and it's
0:13:53because that six hundred and that is because in
0:13:57and then we corrupted the speech
0:14:00additive white gaussian noise
0:14:02i present results for that
0:14:04for realistic noise later
0:14:06and then we used to test utterances by speaker
0:14:12what happens is that we can generate from for the six hundred and that is because you can generate
0:14:17or hundred and sixty
0:14:19true trials and then we select a random subset of ten speakers
0:14:23but in posters
0:14:25and then we compute its
0:14:26scores for each trial
0:14:28and we also compare we tried to implement the
0:14:31this one by
0:14:34a a and is corpora
0:14:35which is a feature domain intersession compensation technique
0:14:40which entails it
0:14:41a a a a a a a pro uh
0:14:44and in a a a a projection matrix to project the features into a
0:14:48session session independent subspace
0:14:51we have a a the recognition the i i
0:14:54verification would be more robust the details i that it but will go through them
0:14:59and just some
0:15:00brief uh
0:15:02table of some result
0:15:04or the timit case when we add in additive white gaussian noise we sweep through some snr
0:15:10and then we just
0:15:12it from the raw data we compute mfccs
0:15:15and the top line shows you if we just up to in the mfccs without
0:15:20note fast applying anything to
0:15:23i out you know just roll
0:15:26and then what if we obtain uh mfccs after that we've and hans
0:15:30a log spectra
0:15:32we in the second line using that B technique
0:15:35i implementation of F D I C was able to draw
0:15:40i was able to walk in this
0:15:42i in the low
0:15:44as some case is in the high and that case it shouldn't draw broken down in our implementation
0:15:50we can investigate
0:15:53this is does a plot for
0:15:54and the that to db case for timit
0:15:57and we see that the equal error rate uh dropped that by half and that's the case
0:16:03a course this snr a we investigated
0:16:07oh we also looked at
0:16:08uh uh i had that types of noise
0:16:11a a a a a what we had for three noise
0:16:14and this noise was obtained from the noise X ninety two dataset
0:16:18and the the the the results are similar
0:16:21only the figure that different you know a different snrs because of the type of noise
0:16:27oh but this to see that
0:16:30that i almost
0:16:31have been in this domain yeah it's not as good but this is because there
0:16:36oh no this is
0:16:37a very
0:16:38oh almost clean condition
0:16:40now then
0:16:41when we applied this to the mit T um
0:16:51we we want to show
0:16:54the the difference
0:16:56you know what happens when we have missed much
0:16:59data obtained in an all is and uh
0:17:02and tested it with
0:17:05has has data from was noise noisy street intersection
0:17:09when we we observe the means much you know a it jumps up to twenty percent when the test data
0:17:15he's from one intersection when that all models were change of this data
0:17:20and when we apply the the B technique
0:17:22to use uses to twenty four percent
0:17:29for sorry experiments we we use
0:17:31this with it will for corpora
0:17:34we so the details we use ubm with the fifty
0:17:37five to of mixture coefficients
0:17:39and nineteen dimensional mfccs with a stands
0:17:43a from mean normalization
0:17:45but up and the is that we only obtain more disk gang
0:17:48and we applied the whole that's
0:17:50and this may be due to you know
0:17:53baseline line system with then
0:17:55he are of that thirteen point eight
0:17:57i the only able to get to that in point for
0:18:00this may be due to the fact that uh
0:18:03we think that uh
0:18:06the the the formulation to the models trained on clean speech
0:18:10and uh and uh it is for all on the L that um what is gained when compared to what
0:18:15you get to meet and
0:18:17the M of that data set
0:18:21and that's it
0:18:21then you
0:18:26i i one time for one quick concern
0:18:38i have as a question uh did you try to use uh
0:18:42and as as a type of voiced speech you hands had reasons such as a wiener filtering
0:18:47to obtain the enhanced the speech and then
0:18:50using hands to speech to to to do speaker verification
0:18:55a no we did not but we tried a a a a at the in a a what we tried
0:18:59using F frame are
0:19:02i and they have a but we where you you to getting that you not speaker id context but not
0:19:06in this context
0:19:07that is something we we should do
0:19:10okay yes thank you
0:19:12let's has a
0:19:13oh go