0:00:13 | yeah as you five a mentor machine you i minor |
---|---|

0:00:16 | and i'm from that's of university and this is a joint what my adviser |

0:00:20 | john looked large |

0:00:21 | and all that title of all walk is log spectral enhancement using speaker dependent as for speaker verification |

0:00:29 | and the aim of this what the key idea behind it is how we can use |

0:00:33 | set and in |

0:00:34 | parameter estimation techniques to improve the robustness of about |

0:00:39 | a verification systems to noise and miss much |

0:00:42 | and here |

0:00:43 | why we want to do |

0:00:45 | uh |

0:00:46 | based and up |

0:00:46 | technique is that |

0:00:48 | a bayesian approach is |

0:00:50 | a i'll what's to use |

0:00:51 | but to have a principled way of accounting for parameter down setting to |

0:00:55 | noise estimation task |

0:00:57 | and you know like every most button recognition system |

0:01:01 | the range is up |

0:01:02 | a a a a a a a key component |

0:01:04 | in use yes to extract thing |

0:01:07 | keep the parameters of interest from your role signal in this case when we have a speech |

0:01:12 | which is corrupted by noise and we want to extract features of interest |

0:01:16 | that we using a or |

0:01:17 | but and classification algorithm |

0:01:20 | the the the the noise makes all but i |

0:01:23 | parameterized estimates it's it wrong yes in some case |

0:01:26 | and depending on the severity of the noise you know |

0:01:28 | uh |

0:01:29 | this may |

0:01:30 | that that is also go ones are or well how much of an fact you are on |

0:01:35 | and the parameter estimates not if we use |

0:01:37 | if we can |

0:01:39 | a like than to put in a bayesian up uh estimate |

0:01:43 | uh uh we can probably you know and hans all speaker verification system and here will can see that |

0:01:49 | to |

0:01:50 | you know the the two main courses of |

0:01:53 | and what's degradation noise |

0:01:55 | which have what you discussed and ms much because you know in speaker verification system we we need a more |

0:02:02 | model to to a model a or a speaker distribution |

0:02:07 | and |

0:02:07 | depending on the acoustic environment in which you |

0:02:10 | you you in the training data that this may not be the same environment in which you using |

0:02:15 | which are using the system |

0:02:16 | and this results in me much |

0:02:19 | and and hence performance degradation |

0:02:22 | for the cup joe what you're trying to do here |

0:02:24 | so the aim of all lot that |

0:02:26 | the title to just me using speaker dependent in the log spectral domain |

0:02:31 | and the K yeah he is that we want to |

0:02:34 | link |

0:02:35 | two system which we feel are closely much that speech enhancement system |

0:02:39 | and the recognition cyst |

0:02:41 | the intuition behind it is that |

0:02:44 | feel doing speech enhancement than you you you and enhancing features |

0:02:48 | and you know |

0:02:49 | because dependent priors |

0:02:52 | uh |

0:02:53 | this |

0:02:54 | the intuition is that if you have a better idea of who is speaking and you good brad in |

0:02:58 | and that domain |

0:03:00 | then you can do a better job of enhancing and we the N signal |

0:03:04 | you can do a better job of |

0:03:06 | oh or recognition |

0:03:07 | so these an to play between this two systems |

0:03:10 | and they they are what we do the eight week up to this in doubly plea as a message passing |

0:03:16 | along nodes you not not have colour model |

0:03:19 | so |

0:03:20 | so that be little be |

0:03:22 | message passing and this will fall out in our formulation |

0:03:27 | just a brief outline of what the rest of the talk would be like |

0:03:31 | i just |

0:03:31 | briefly go over a little bit of verification for any members of the audience who may need it |

0:03:37 | and then |

0:03:38 | going to |

0:03:40 | uh but but any in inference and then |

0:03:42 | for to how all into a variational bayesian inference which is a |

0:03:47 | a what we walk in |

0:03:48 | and then a discuss our model |

0:03:51 | and then |

0:03:52 | going to the experimental results |

0:03:57 | so |

0:03:59 | he in verification that task is you know you a given an utterance and a claimed identity and the task |

0:04:04 | is |

0:04:05 | it's a hypothesis test is the |

0:04:08 | given the speech segment X is the speech |

0:04:10 | from from speaker S a not |

0:04:13 | so |

0:04:15 | this is a hypothesis test as of say the and uh |

0:04:17 | what we do is we have to model model |

0:04:20 | uh out target |

0:04:22 | get |

0:04:22 | a target speakers |

0:04:24 | using splits speaker-specific specific gmms |

0:04:27 | and then we can |

0:04:29 | user a |

0:04:30 | i universal background model to test it out and it i what this it and this is the |

0:04:34 | you know usually the this line system |

0:04:37 | you ubm |

0:04:38 | gmm system which is you know |

0:04:40 | but |

0:04:41 | with that starting point to most verification systems there at once is but |

0:04:46 | this is the the the most basic |

0:04:48 | oh |

0:04:49 | and this is where we'll try try a more calm enhancement |

0:04:52 | in the log spectral domain to see if we can |

0:04:54 | have improved |

0:04:56 | so |

0:04:58 | no no uh |

0:04:59 | so the classification deciding when we type of the C C |

0:05:02 | you compute a scroll |

0:05:04 | just just a lot like who who |

0:05:06 | log likelihood ratio and then |

0:05:08 | you know |

0:05:09 | this do not threshold |

0:05:11 | decide which uh |

0:05:12 | which type of this is it correct and |

0:05:15 | yeah |

0:05:15 | you know we can plot to i'll two |

0:05:18 | but i'm for all |

0:05:19 | same well as of formants matrix we can plot that a an also to compute equal error rate |

0:05:25 | you know to determine the trade-off between |

0:05:27 | missed detection |

0:05:29 | and false alarms |

0:05:31 | so that's just a a speaker verification part just a little bit of a bayesian inference |

0:05:39 | so |

0:05:40 | i |

0:05:41 | we can say that that two main approaches to parameter estimation you can go the maximum likelihood route |

0:05:47 | or the bayesian inference rule |

0:05:49 | uh |

0:05:50 | here we see |

0:05:52 | if you have a data X |

0:05:53 | represented presented in this |

0:05:54 | figure by X |

0:05:55 | be the generative model |

0:05:57 | i one by a parameter a |

0:05:59 | data |

0:06:00 | now in them market to some like you |

0:06:02 | a a day |

0:06:04 | we assume that this parameter is an unknown constant |

0:06:07 | and then the |

0:06:08 | they're quantity of interest is that likelihood and then we can estimate |

0:06:12 | so it are based on the map them a maximum likelihood criterion |

0:06:16 | and the but they didn't paradigm i'm one the other hand |

0:06:19 | we assume that they to uh is uh |

0:06:21 | is a |

0:06:22 | but one by is a random variable good up that one by a prior |

0:06:26 | and this is where the robe |

0:06:28 | the the robustness to but it down setting to comes in |

0:06:32 | the fact that we have a prior out what the what this to over the parameter of interest |

0:06:37 | and then the clean quantity in these cases that |

0:06:39 | posterior which is proportional that |

0:06:42 | is given that the problem is proportional to the product of a like the then prior |

0:06:46 | and then |

0:06:49 | uh the issue is |

0:06:52 | how we obtain i estimates we obtain based an estimate does that minimize expect expect |

0:06:56 | costs and |

0:06:57 | for instance if we we have the |

0:07:00 | ah |

0:07:01 | if the cost is the squared |

0:07:03 | is the squared norm |

0:07:05 | of the big |

0:07:06 | a difference there between now |

0:07:10 | this expression he a fit so |

0:07:12 | the difference between the estimate and the true value |

0:07:16 | the it well known that |

0:07:17 | this this an estimate a the minimum mean square error estimate a just the posterior mean |

0:07:23 | note that this is easy to write |

0:07:25 | but |

0:07:26 | the what happens is that in most practical cases and even in the one we can see that here |

0:07:31 | it's |

0:07:32 | you know |

0:07:33 | import a almost impossible to perform from this tech |

0:07:36 | so now what do we do |

0:07:37 | uh |

0:07:43 | we can |

0:07:45 | we can use |

0:07:46 | the problem lies in the instructor stability |

0:07:49 | if the problem lies in the ability of the posterior |

0:07:52 | then we can apply what uh approximate bayesian techniques and for instance are we can use V B or variational |

0:07:59 | of base |

0:08:00 | uh where we approximate |

0:08:03 | well |

0:08:03 | i what true posterior |

0:08:05 | by one that's |

0:08:06 | constrained to be them |

0:08:09 | now |

0:08:10 | we need a metric the mapping between two from |

0:08:14 | and intractable for maybe and a tractable farm |

0:08:19 | of distributions |

0:08:20 | and we need a metric so that we know |

0:08:23 | yeah |

0:08:24 | and the uh |

0:08:25 | you know what's the close |

0:08:26 | approximation to the true posterior indestructible family |

0:08:30 | and and we measure yeah |

0:08:33 | we we obtain the approximation that minimize is the a out that dense |

0:08:37 | to to in our all our approximation and the true to to |

0:08:42 | oh in cases where i'll but i'm it that's set that uh |

0:08:46 | consists of a and |

0:08:47 | number of parameters in this case |

0:08:50 | and parameters as we can and shot ability by |

0:08:54 | assuming that |

0:08:55 | the product of the the posterior factor like that shown in this expression one |

0:09:01 | so |

0:09:02 | no the the question what is that |

0:09:04 | we boils down to |

0:09:07 | estimating what a |

0:09:08 | no computing the forms of uh |

0:09:12 | this approximate posterior each of the five does |

0:09:15 | and then |

0:09:17 | a for if |

0:09:18 | up updating the sufficient statistics |

0:09:21 | i can be shown that these and |

0:09:23 | uh uh an expression for the |

0:09:28 | for the approximate from of the distributions in this uh |

0:09:32 | we computed by taking an expectation with respect to the logarithm a of that |

0:09:37 | the joint distribution between observations |

0:09:40 | and the parameters of interest |

0:09:44 | oh |

0:09:45 | so |

0:09:47 | no that |

0:09:48 | let's |

0:09:48 | get but to our speaker verification context and by in particular let's discuss the |

0:09:53 | the model the probabilistic model |

0:09:56 | so here what did we are in the log spectral domain |

0:10:00 | and uh |

0:10:01 | but we assume is that our our or signal Y of T of the observed signal |

0:10:07 | is corrupted by additive noise |

0:10:10 | and if we take the dft we can compute the log spectrum much shown |

0:10:14 | but the can the look at the |

0:10:16 | this |

0:10:17 | that's to |

0:10:18 | a a F T |

0:10:20 | and then we can it can be shown |

0:10:23 | but uh these a nice |

0:10:25 | a proximity relationship between |

0:10:27 | then the the log spectrum of the up signal |

0:10:31 | that the clean log spectrum and that log spectrum of the noise |

0:10:35 | of this |

0:10:36 | just a lot |

0:10:38 | i what our likelihood |

0:10:39 | you look you in the bayesian paradigm and we we have the likelihood and the prior |

0:10:46 | so |

0:10:47 | this |

0:10:47 | this |

0:10:48 | is our likelihood |

0:10:49 | now we need to |

0:10:50 | two |

0:10:51 | to write out what is out joint distribution how does it five |

0:10:55 | because this will help was when we come to compute a |

0:10:58 | the that box the approximate distribution because that you the called the |

0:11:02 | the expression |

0:11:04 | for each of the optimum for does |

0:11:06 | depends on an expectation |

0:11:08 | like to |

0:11:10 | like an expectation of the look that a beam of the joint distribution |

0:11:14 | so this is a how the joint distribution in this content |

0:11:18 | uh |

0:11:20 | a factor arises |

0:11:21 | you have all all of that out |

0:11:24 | uh i log spectrum |

0:11:25 | the clean log spectrum |

0:11:27 | this is that what it what which tell explain later that we introduce one might lead to up the ability |

0:11:32 | to like an indicator variable than the noise |

0:11:34 | so here you have the likelihood tao |

0:11:37 | and that prior what |

0:11:39 | or what this |

0:11:39 | B |

0:11:40 | clean |

0:11:40 | speech log spectrum we assume that it is speaker dependent |

0:11:46 | and uh |

0:11:49 | so what happens is |

0:11:52 | yes the speaker dependent ubm so in a speaker I D context this would |

0:11:57 | in mean that we we'll and models for |

0:12:00 | each speaker |

0:12:01 | not id context but in know a verification context what we do is we approximate that |

0:12:06 | that would be that you snap not |

0:12:08 | mean in this but if kitchen context we assume that we can |

0:12:12 | model the light bright you'll speakers as as |

0:12:15 | just the target speaker and the ubm so this is what happens is that the library dynamic |

0:12:20 | for each at that your testing you your when you like |

0:12:24 | and we have a what |

0:12:25 | i it is that this indicator the variable |

0:12:28 | uh that was you |

0:12:32 | who peeking |

0:12:33 | oh in other what where they'd the target the ubm and which mixture the component is active |

0:12:39 | so this |

0:12:41 | just shows you the forms of the five does that we compute |

0:12:44 | and we can see that there |

0:12:46 | yeah |

0:12:47 | the well-known known fans |

0:12:49 | and the V be able but and what the don't to each realising this a |

0:12:53 | this |

0:12:53 | but the sufficient statistics in a in a in a case the mean and |

0:12:59 | and the covariance |

0:13:01 | and then and this out of a function of the observations and the prior |

0:13:06 | and then cycling through until some convergence is that thing |

0:13:10 | and |

0:13:10 | yeah |

0:13:11 | what's |

0:13:11 | good is that once you obtain |

0:13:14 | uh |

0:13:15 | the clean posterior a an estimate of the clean posterior we can derive mfccs easy lee for from them |

0:13:22 | for verification |

0:13:24 | so just some experimental results what we do it is |

0:13:28 | we we use three datasets initially we use to |

0:13:32 | then we to use the M T mobile device because verification corpus |

0:13:36 | then we have we also tried it out on a |

0:13:38 | S the sre two thousand and four corpora |

0:13:42 | so initial |

0:13:43 | results here a for |

0:13:45 | oh to make |

0:13:46 | and |

0:13:47 | uh |

0:13:48 | we did did we trained a ubm with that subset using training data from a subset of the and it's |

0:13:53 | because that six hundred and that is because in |

0:13:57 | and then we corrupted the speech |

0:13:59 | using |

0:14:00 | additive white gaussian noise |

0:14:02 | i present results for that |

0:14:04 | for realistic noise later |

0:14:06 | and then we used to test utterances by speaker |

0:14:11 | so |

0:14:12 | what happens is that we can generate from for the six hundred and that is because you can generate |

0:14:16 | uh |

0:14:17 | or hundred and sixty |

0:14:19 | true trials and then we select a random subset of ten speakers |

0:14:23 | but in posters |

0:14:25 | and then we compute its |

0:14:26 | scores for each trial |

0:14:28 | and we also compare we tried to implement the |

0:14:31 | this one by |

0:14:34 | a a and is corpora |

0:14:35 | which is a feature domain intersession compensation technique |

0:14:40 | which entails it |

0:14:41 | a a a a a a a pro uh |

0:14:44 | and in a a a a projection matrix to project the features into a |

0:14:48 | session session independent subspace |

0:14:51 | we have a a the recognition the i i |

0:14:54 | verification would be more robust the details i that it but will go through them |

0:14:58 | oh |

0:14:59 | and just some |

0:15:00 | brief uh |

0:15:02 | table of some result |

0:15:04 | or the timit case when we add in additive white gaussian noise we sweep through some snr |

0:15:10 | and then we just |

0:15:12 | it from the raw data we compute mfccs |

0:15:15 | and the top line shows you if we just up to in the mfccs without |

0:15:20 | note fast applying anything to |

0:15:23 | i out you know just roll |

0:15:26 | and then what if we obtain uh mfccs after that we've and hans |

0:15:30 | a log spectra |

0:15:32 | we in the second line using that B technique |

0:15:35 | i implementation of F D I C was able to draw |

0:15:39 | uh |

0:15:40 | i was able to walk in this |

0:15:42 | i in the low |

0:15:44 | as some case is in the high and that case it shouldn't draw broken down in our implementation |

0:15:49 | uh |

0:15:50 | we can investigate |

0:15:53 | this is does a plot for |

0:15:54 | and the that to db case for timit |

0:15:57 | and we see that the equal error rate uh dropped that by half and that's the case |

0:16:02 | a |

0:16:03 | a course this snr a we investigated |

0:16:07 | oh we also looked at |

0:16:08 | uh uh i had that types of noise |

0:16:11 | a a a a a what we had for three noise |

0:16:14 | and this noise was obtained from the noise X ninety two dataset |

0:16:18 | and the the the the results are similar |

0:16:21 | only the figure that different you know a different snrs because of the type of noise |

0:16:27 | oh but this to see that |

0:16:29 | uh |

0:16:30 | that i almost |

0:16:31 | have been in this domain yeah it's not as good but this is because there |

0:16:36 | oh no this is |

0:16:37 | a very |

0:16:38 | oh almost clean condition |

0:16:40 | now then |

0:16:41 | when we applied this to the mit T um |

0:16:45 | dataset |

0:16:47 | a |

0:16:50 | uh |

0:16:51 | we we want to show |

0:16:54 | the the difference |

0:16:56 | you know what happens when we have missed much |

0:16:58 | we |

0:16:59 | data obtained in an all is and uh |

0:17:02 | and tested it with |

0:17:04 | yeah |

0:17:05 | has has data from was noise noisy street intersection |

0:17:09 | when we we observe the means much you know a it jumps up to twenty percent when the test data |

0:17:15 | he's from one intersection when that all models were change of this data |

0:17:20 | and when we apply the the B technique |

0:17:22 | to use uses to twenty four percent |

0:17:28 | uh |

0:17:29 | for sorry experiments we we use |

0:17:31 | this with it will for corpora |

0:17:34 | we so the details we use ubm with the fifty |

0:17:37 | five to of mixture coefficients |

0:17:39 | and nineteen dimensional mfccs with a stands |

0:17:43 | a from mean normalization |

0:17:45 | but up and the is that we only obtain more disk gang |

0:17:48 | and we applied the whole that's |

0:17:50 | and this may be due to you know |

0:17:53 | oh |

0:17:53 | baseline line system with then |

0:17:55 | he are of that thirteen point eight |

0:17:57 | i the only able to get to that in point for |

0:18:00 | this may be due to the fact that uh |

0:18:03 | we think that uh |

0:18:06 | the the the formulation to the models trained on clean speech |

0:18:10 | and uh and uh it is for all on the L that um what is gained when compared to what |

0:18:15 | you get to meet and |

0:18:17 | the M of that data set |

0:18:20 | and |

0:18:21 | and that's it |

0:18:21 | then you |

0:18:26 | i i one time for one quick concern |

0:18:38 | i have as a question uh did you try to use uh |

0:18:42 | and as as a type of voiced speech you hands had reasons such as a wiener filtering |

0:18:47 | to obtain the enhanced the speech and then |

0:18:50 | using hands to speech to to to do speaker verification |

0:18:55 | a no we did not but we tried a a a a at the in a a what we tried |

0:18:59 | using F frame are |

0:19:02 | i and they have a but we where you you to getting that you not speaker id context but not |

0:19:06 | in this context |

0:19:07 | that is something we we should do |

0:19:10 | okay yes thank you |

0:19:12 | let's has a |

0:19:13 | oh go |

0:19:17 | okay |