yeah as you five a mentor machine you i minor and i'm from that's of university and this is a joint what my adviser john looked large and all that title of all walk is log spectral enhancement using speaker dependent as for speaker verification and the aim of this what the key idea behind it is how we can use set and in parameter estimation techniques to improve the robustness of about a verification systems to noise and miss much and here why we want to do uh based and up technique is that a bayesian approach is a i'll what's to use but to have a principled way of accounting for parameter down setting to noise estimation task and you know like every most button recognition system the range is up a a a a a a a key component in use yes to extract thing keep the parameters of interest from your role signal in this case when we have a speech which is corrupted by noise and we want to extract features of interest that we using a or but and classification algorithm the the the the noise makes all but i parameterized estimates it's it wrong yes in some case and depending on the severity of the noise you know uh this may that that is also go ones are or well how much of an fact you are on and the parameter estimates not if we use if we can a like than to put in a bayesian up uh estimate uh uh we can probably you know and hans all speaker verification system and here will can see that to you know the the two main courses of and what's degradation noise which have what you discussed and ms much because you know in speaker verification system we we need a more model to to a model a or a speaker distribution and depending on the acoustic environment in which you you you in the training data that this may not be the same environment in which you using which are using the system and this results in me much and and hence performance degradation for the cup joe what you're trying to do here so the aim of all lot that the title to just me using speaker dependent in the log spectral domain and the K yeah he is that we want to link two system which we feel are closely much that speech enhancement system and the recognition cyst the intuition behind it is that feel doing speech enhancement than you you you and enhancing features and you know because dependent priors uh this the intuition is that if you have a better idea of who is speaking and you good brad in and that domain then you can do a better job of enhancing and we the N signal you can do a better job of oh or recognition so these an to play between this two systems and they they are what we do the eight week up to this in doubly plea as a message passing along nodes you not not have colour model so so that be little be message passing and this will fall out in our formulation just a brief outline of what the rest of the talk would be like i just briefly go over a little bit of verification for any members of the audience who may need it and then going to uh but but any in inference and then for to how all into a variational bayesian inference which is a a what we walk in and then a discuss our model and then going to the experimental results so he in verification that task is you know you a given an utterance and a claimed identity and the task is it's a hypothesis test is the given the speech segment X is the speech from from speaker S a not so this is a hypothesis test as of say the and uh what we do is we have to model model uh out target get a target speakers using splits speaker-specific specific gmms and then we can user a i universal background model to test it out and it i what this it and this is the you know usually the this line system you ubm gmm system which is you know but with that starting point to most verification systems there at once is but this is the the the most basic oh and this is where we'll try try a more calm enhancement in the log spectral domain to see if we can have improved so no no uh so the classification deciding when we type of the C C you compute a scroll just just a lot like who who log likelihood ratio and then you know this do not threshold decide which uh which type of this is it correct and yeah you know we can plot to i'll two but i'm for all same well as of formants matrix we can plot that a an also to compute equal error rate you know to determine the trade-off between missed detection and false alarms so that's just a a speaker verification part just a little bit of a bayesian inference so i we can say that that two main approaches to parameter estimation you can go the maximum likelihood route or the bayesian inference rule uh here we see if you have a data X represented presented in this figure by X be the generative model i one by a parameter a data now in them market to some like you a a day we assume that this parameter is an unknown constant and then the they're quantity of interest is that likelihood and then we can estimate so it are based on the map them a maximum likelihood criterion and the but they didn't paradigm i'm one the other hand we assume that they to uh is uh is a but one by is a random variable good up that one by a prior and this is where the robe the the robustness to but it down setting to comes in the fact that we have a prior out what the what this to over the parameter of interest and then the clean quantity in these cases that posterior which is proportional that is given that the problem is proportional to the product of a like the then prior and then uh the issue is how we obtain i estimates we obtain based an estimate does that minimize expect expect costs and for instance if we we have the ah if the cost is the squared is the squared norm of the big a difference there between now this expression he a fit so the difference between the estimate and the true value the it well known that this this an estimate a the minimum mean square error estimate a just the posterior mean note that this is easy to write but the what happens is that in most practical cases and even in the one we can see that here it's you know import a almost impossible to perform from this tech so now what do we do uh we can we can use the problem lies in the instructor stability if the problem lies in the ability of the posterior then we can apply what uh approximate bayesian techniques and for instance are we can use V B or variational of base uh where we approximate well i what true posterior by one that's constrained to be them now we need a metric the mapping between two from and intractable for maybe and a tractable farm of distributions and we need a metric so that we know yeah and the uh you know what's the close approximation to the true posterior indestructible family and and we measure yeah we we obtain the approximation that minimize is the a out that dense to to in our all our approximation and the true to to oh in cases where i'll but i'm it that's set that uh consists of a and number of parameters in this case and parameters as we can and shot ability by assuming that the product of the the posterior factor like that shown in this expression one so no the the question what is that we boils down to estimating what a no computing the forms of uh this approximate posterior each of the five does and then a for if up updating the sufficient statistics i can be shown that these and uh uh an expression for the for the approximate from of the distributions in this uh we computed by taking an expectation with respect to the logarithm a of that the joint distribution between observations and the parameters of interest oh so no that let's get but to our speaker verification context and by in particular let's discuss the the model the probabilistic model so here what did we are in the log spectral domain and uh but we assume is that our our or signal Y of T of the observed signal is corrupted by additive noise and if we take the dft we can compute the log spectrum much shown but the can the look at the this that's to a a F T and then we can it can be shown but uh these a nice a proximity relationship between then the the log spectrum of the up signal that the clean log spectrum and that log spectrum of the noise of this just a lot i what our likelihood you look you in the bayesian paradigm and we we have the likelihood and the prior so this this is our likelihood now we need to two to write out what is out joint distribution how does it five because this will help was when we come to compute a the that box the approximate distribution because that you the called the the expression for each of the optimum for does depends on an expectation like to like an expectation of the look that a beam of the joint distribution so this is a how the joint distribution in this content uh a factor arises you have all all of that out uh i log spectrum the clean log spectrum this is that what it what which tell explain later that we introduce one might lead to up the ability to like an indicator variable than the noise so here you have the likelihood tao and that prior what or what this B clean speech log spectrum we assume that it is speaker dependent and uh so what happens is yes the speaker dependent ubm so in a speaker I D context this would in mean that we we'll and models for each speaker not id context but in know a verification context what we do is we approximate that that would be that you snap not mean in this but if kitchen context we assume that we can model the light bright you'll speakers as as just the target speaker and the ubm so this is what happens is that the library dynamic for each at that your testing you your when you like and we have a what i it is that this indicator the variable uh that was you who peeking oh in other what where they'd the target the ubm and which mixture the component is active so this just shows you the forms of the five does that we compute and we can see that there yeah the well-known known fans and the V be able but and what the don't to each realising this a this but the sufficient statistics in a in a in a case the mean and and the covariance and then and this out of a function of the observations and the prior and then cycling through until some convergence is that thing and yeah what's good is that once you obtain uh the clean posterior a an estimate of the clean posterior we can derive mfccs easy lee for from them for verification so just some experimental results what we do it is we we use three datasets initially we use to then we to use the M T mobile device because verification corpus then we have we also tried it out on a S the sre two thousand and four corpora so initial results here a for oh to make and uh we did did we trained a ubm with that subset using training data from a subset of the and it's because that six hundred and that is because in and then we corrupted the speech using additive white gaussian noise i present results for that for realistic noise later and then we used to test utterances by speaker so what happens is that we can generate from for the six hundred and that is because you can generate uh or hundred and sixty true trials and then we select a random subset of ten speakers but in posters and then we compute its scores for each trial and we also compare we tried to implement the this one by a a and is corpora which is a feature domain intersession compensation technique which entails it a a a a a a a pro uh and in a a a a projection matrix to project the features into a session session independent subspace we have a a the recognition the i i verification would be more robust the details i that it but will go through them oh and just some brief uh table of some result or the timit case when we add in additive white gaussian noise we sweep through some snr and then we just it from the raw data we compute mfccs and the top line shows you if we just up to in the mfccs without note fast applying anything to i out you know just roll and then what if we obtain uh mfccs after that we've and hans a log spectra we in the second line using that B technique i implementation of F D I C was able to draw uh i was able to walk in this i in the low as some case is in the high and that case it shouldn't draw broken down in our implementation uh we can investigate this is does a plot for and the that to db case for timit and we see that the equal error rate uh dropped that by half and that's the case a a course this snr a we investigated oh we also looked at uh uh i had that types of noise a a a a a what we had for three noise and this noise was obtained from the noise X ninety two dataset and the the the the results are similar only the figure that different you know a different snrs because of the type of noise oh but this to see that uh that i almost have been in this domain yeah it's not as good but this is because there oh no this is a very oh almost clean condition now then when we applied this to the mit T um dataset a uh we we want to show the the difference you know what happens when we have missed much we data obtained in an all is and uh and tested it with yeah has has data from was noise noisy street intersection when we we observe the means much you know a it jumps up to twenty percent when the test data he's from one intersection when that all models were change of this data and when we apply the the B technique to use uses to twenty four percent uh for sorry experiments we we use this with it will for corpora we so the details we use ubm with the fifty five to of mixture coefficients and nineteen dimensional mfccs with a stands a from mean normalization but up and the is that we only obtain more disk gang and we applied the whole that's and this may be due to you know oh baseline line system with then he are of that thirteen point eight i the only able to get to that in point for this may be due to the fact that uh we think that uh the the the formulation to the models trained on clean speech and uh and uh it is for all on the L that um what is gained when compared to what you get to meet and the M of that data set and and that's it then you i i one time for one quick concern i have as a question uh did you try to use uh and as as a type of voiced speech you hands had reasons such as a wiener filtering to obtain the enhanced the speech and then using hands to speech to to to do speaker verification a no we did not but we tried a a a a at the in a a what we tried using F frame are i and they have a but we where you you to getting that you not speaker id context but not in this context that is something we we should do okay yes thank you let's has a oh go okay