0:00:15 | okay this is going to be account where |
---|---|

0:00:17 | technical |

0:00:19 | so that recovery problem |

0:00:22 | maybe uncertainty modeling in |

0:00:25 | text dependent speaker recognition |

0:00:27 | of the issues that i'm concerned with here is that |

0:00:34 | what to do about the |

0:00:38 | the problem of a speaker recognition context where you have very little data and the |

0:00:43 | features you extract |

0:00:46 | are necessarily going to be noisy and the and the statistical sense |

0:00:50 | okay the this comes straight to the for in text dependent speaker recognition where you |

0:00:55 | have maybe just two seconds of data |

0:00:59 | it's also an important problem in a text-independent speaker recognition |

0:01:05 | because of the need to be able to |

0:01:09 | just set a uniform threshold even in cases where you're test utterances are variable duration |

0:01:17 | that would be interesting to say what happens in the forthcoming nist evaluation |

0:01:23 | with that a particular problem |

0:01:26 | some progress has been made on a with |

0:01:29 | but subspace methods with i-vectors you try to quantify three |

0:01:34 | the statistical noise in the i-vector extractor process |

0:01:39 | and see that into the into p of you get a |

0:01:45 | but |

0:01:46 | i've taken a possibility of the table for the purposes of interest and said look |

0:01:51 | of subspace methods |

0:01:53 | in general are not gonna work in text-independent speaker record text dependent speaker recognition because |

0:01:59 | of the data distribution |

0:02:01 | okay so |

0:02:02 | what i attempted to do it was to tackle this problem |

0:02:06 | well and speaker variability when one could not able to characterize that by a subspace |

0:02:14 | methods |

0:02:17 | i realise what was preparing the presentation that the that the paper in the proceedings |

0:02:21 | is very tense it's rather difficult tree |

0:02:25 | but |

0:02:26 | the idea it's a tricky but it's fairly simple okay so i have made an |

0:02:32 | effort in the slides just to communicate the core idea and if you are interested |

0:02:37 | them i recommend that you |

0:02:40 | that you look at the slides rather than the paper i poles these lines one |

0:02:44 | on my web page |

0:02:48 | so the |

0:02:50 | the test that for this task we took the dior store for three that's the |

0:02:54 | random digit |

0:02:56 | a portion or |

0:02:58 | of the arms or data i just mentioned two things about this |

0:03:02 | a because the a design we have five random digits at test time |

0:03:10 | well all ten digits were repeated three times at enrollment time in random order you |

0:03:15 | only see half of the digits at test time |

0:03:18 | and actually turns out that under |

0:03:22 | those conditions |

0:03:24 | a gmm methods have an advantage okay because you can use all of the |

0:03:31 | all of your enrollment data no matter what the test utterances or as if you |

0:03:36 | pre-segment the data into digits you are actually constraining yourself chose to using the enrollment |

0:03:43 | data that corresponds to the digits are actually think |

0:03:46 | in the in the vocabulary so in practice you actually need to you need |

0:03:53 | one of the thing a dimension is that this paper is about the back end |

0:03:56 | we used a standard sixty dimensional plp front end which is maybe not hiding |

0:04:01 | i think that'll come up and the |

0:04:03 | in the next talk you can get much better results on female speakers if you |

0:04:08 | use a low dimensional front-end |

0:04:10 | which i think of resort was the was the first two of discover |

0:04:15 | so |

0:04:18 | the model that i was using here is a is that you have a model |

0:04:22 | which uses the low dimensional of hidden variables to characterize speaker effects but does not |

0:04:29 | attempt to something to characterize channel effects |

0:04:33 | but does not attend so to characterize speakers using subspace methods |

0:04:40 | and these the z factor that characterizes speakers would be used as a feature vector |

0:04:47 | for |

0:04:50 | speaker recognition |

0:04:52 | and the problem i wanted to address was to design a back and that would |

0:04:57 | take account of the fact |

0:04:59 | that the number of observations to re |

0:05:03 | available to estimate the components of this vector is very small |

0:05:09 | in general a step of it has them |

0:05:12 | one frame per mixture component if you have a two second utterance and the ubm |

0:05:17 | with five hundred and twelve gaussians to calculation you see that you have extremely sparse |

0:05:23 | data |

0:05:26 | so there are two a backends the but are present |

0:05:31 | one the joint density back end uses a point estimates of the features that are |

0:05:38 | extracted at enrollment time and at test time and it's |

0:05:43 | a models the correlation between the two |

0:05:46 | okay to construct a likelihood ratio and the innovation in this paper the hidden supervector |

0:05:52 | back end it treats |

0:05:54 | those two feature vectors as hidden variables as in the original formulation jfa |

0:06:03 | so that the key ingredient is to supply up a prior distribution that on the |

0:06:11 | correlations between those hidden variables |

0:06:17 | how much time don't have like i'm sorry i dependence one a strong |

0:06:21 | how much |

0:06:23 | okay good |

0:06:24 | okay so |

0:06:26 | the |

0:06:28 | so i was just digress a minute |

0:06:30 | the way uncertainty modeling is used is usually tackled in text independent speaker recognition is |

0:06:39 | that you try to characterize the uncertainty in a point estimate of an i-vector using |

0:06:44 | a posterior covariance matrix |

0:06:47 | but is calculated using zero order statistics |

0:06:50 | and you do this on the enrollment side and on the test side independently |

0:06:56 | if you think about this you realise that this isn't quite the right way to |

0:07:00 | do |

0:07:01 | okay the reason is that if you are hypothesized thing a target trial then what |

0:07:07 | you see on the test site has to be highly correlated with what you see |

0:07:10 | on the from inside |

0:07:13 | they are not statistically independent |

0:07:16 | and that has to be of benefit |

0:07:18 | from |

0:07:19 | using those correlations to quantify the uncertainty in the feature that comes out of the |

0:07:26 | test utterance we could something called a little total variance that says on average |

0:07:32 | when you condition one random variable on another u models reduce the variance |

0:07:37 | okay so |

0:07:39 | this the critical thing that i introduced in this paper is there is this correlation |

0:07:46 | between the enrollment on the on the test |

0:07:50 | okay so here's a the mechanics of how the joint density back end work that's |

0:07:54 | pretty straight the features has point estimates |

0:07:58 | at this was inspired by us central command is work at the last odyssey |

0:08:04 | he implemented this the upper level of i-vectors there's nothing to start with just a |

0:08:10 | few doing at the level of supervectors as well |

0:08:13 | you can't obviously training |

0:08:17 | correlation matrices of supervector dimension but you can implement this idea at the level of |

0:08:22 | individual mixture components and that's |

0:08:25 | so that gives you a trainable |

0:08:27 | back end for text dependent speaker recognition even if you can't use subspace methods |

0:08:36 | so that's the |

0:08:37 | that's our best back end and thus the one that i used as a benchmark |

0:08:41 | for our experiments |

0:08:46 | so |

0:08:46 | the supervector backend is the hidden version of this |

0:08:51 | okay that says |

0:08:53 | you're in a position to observe by one statistics but you're not position to observe |

0:08:58 | the these that factors you have to make inferences |

0:09:02 | about the posterior distribution of those features and bayes |

0:09:06 | you're likelihood ratio on a on that calculation |

0:09:11 | now it turns out that three |

0:09:14 | probability calculations are formally just mathematically equivalent to calculations with and then the fusion i-vector |

0:09:23 | extractor that has just two gaussians in a |

0:09:25 | weekly effects mixture component from the ubm |

0:09:28 | you say you observed that mixture component once on the enrollment side once on the |

0:09:32 | test side so you're two |

0:09:34 | hidden gaussians |

0:09:36 | okay you have a variable number of observations on the on one side of variable |

0:09:40 | number of observations on the test site that's the type of situation that we model |

0:09:45 | with an i-vector extractor so here |

0:09:49 | the there is an i-vector extractor |

0:09:51 | but it's only being used to do probability calculations is not going to be used |

0:09:57 | to |

0:09:59 | to extract features |

0:10:03 | no |

0:10:04 | one thing about this i-vector extractors that you're not going to use a to impose |

0:10:08 | subspace constraints because it's just have the two gaussians right |

0:10:12 | you don't need to say that those two gaussians |

0:10:16 | line of low dimensional subspace of the |

0:10:21 | of the supervector space |

0:10:24 | so you might as well just take the total variability matrix to be the identity |

0:10:28 | matrix and shift all of the burden |

0:10:31 | modeling the data |

0:10:32 | so the to the prior distribution |

0:10:36 | in a i-vector modeling we always take us time or prior a standard normal prior |

0:10:42 | zero mean identity covariance matrix that's because there is in fact norm general stick to |

0:10:49 | be gained by using a non-standard prior okay you can always compensate for a non-standard |

0:10:55 | prior |

0:10:56 | by fiddling with the |

0:10:58 | with the total variability matrix here |

0:11:01 | take the total variability matrix of the to be the identity but you have to |

0:11:05 | train the prior |

0:11:07 | and |

0:11:09 | that involves doing well you can do a posterior calculations if you look at those |

0:11:15 | formulas you see they look just like the standard ones except i now have and |

0:11:20 | mean and the precision matrix which |

0:11:23 | would be zero and the identity matrix in the case of the |

0:11:28 | standard |

0:11:30 | standard normal prior |

0:11:31 | and you can do minimum divergence estimation okay which is the way in the fact |

0:11:37 | of training the prior if you think about the way you minimum divergence estimation wiretaps |

0:11:42 | in fact what you're doing is your estimating a prior |

0:11:47 | okay and then we say well there's no gain in using a non-standard prior |

0:11:53 | so with standard as the prior and modified the total variability matrix instead so here |

0:11:59 | we just |

0:12:00 | estimate the prior to put that were estimated in inverted commas estimating a prior is |

0:12:06 | not something |

0:12:08 | variation due but we do it all the time so it works |

0:12:12 | so how would you how would you training assuming a have to organise your training |

0:12:16 | data into target trials |

0:12:18 | you with collect |

0:12:20 | the |

0:12:21 | the i-vector for each trial three each mixture component in the ubm you would have |

0:12:27 | an observation on the enrollment set an observational that over multiple observations so you're bound |

0:12:32 | was statistics |

0:12:33 | and you just implement this minimum divergence estimation procedure |

0:12:38 | then get a prior distribution that tells you what correlations to expect |

0:12:44 | between the enrollment data and the test data |

0:12:47 | in the case of a target trial |

0:12:53 | if you want to handle non-target trials and you just a impose a statistical independence |

0:12:59 | assumption you just zero while the correlations |

0:13:04 | okay so the way you would use this machinery to calculate a likelihood ratio |

0:13:10 | is that |

0:13:11 | given enrollment data and test data |

0:13:14 | you would calculate the evidence but this is just the likelihood of the data |

0:13:19 | but you get when you integrate out the hidden variables |

0:13:24 | it's not usually done but i think everybody should of a gender implementation of i-vectors |

0:13:29 | we should always calculate the evidence |

0:13:31 | okay the because it tells you |

0:13:33 | it's a very it's a very good a diagnostic |

0:13:36 | that tells you whether you're implementation is correct you have to evaluate an integral it's |

0:13:41 | a gaussian entropy role |

0:13:43 | the answer can be expressed and close form in terms of a bomb or statistics |

0:13:47 | as in the paper |

0:13:49 | so you in order to use that's for speaker recognition you evaluate the evidence in |

0:13:54 | two different ways one with the prior for target trials and one with prior for |

0:13:59 | nontarget trials you take the ratio of the two and that gives you your likelihood |

0:14:03 | ratio |

0:14:04 | for speaker recognition |

0:14:07 | so the mechanics then of getting destroyed depends critically on how you prepare |

0:14:13 | the baumwelch statistics |

0:14:15 | that us summarize the enrollment data and the and the test data |

0:14:21 | the first thing you need to do |

0:14:23 | is that stick the enrollment things i'm |

0:14:26 | okay each of those is potentially contaminated by channel effects so you take the role |

0:14:31 | by one statistics and |

0:14:32 | filter out the channel effects |

0:14:35 | just using the jfa model |

0:14:39 | so in that way you get a set of syntactic up on well statistics which |

0:14:43 | characterizes a speaker you just pooled at the bottom one statistics together after you have |

0:14:48 | filtered out the channel effects you do that on the enrollment side you do the |

0:14:52 | huntley on the test side and you end up in and the trial having to |

0:14:57 | compare |

0:14:58 | one set upon was statistics with another using this hadn't supervector back end |

0:15:06 | i here's a here's a new wrinkle that really makes a more |

0:15:13 | we know that in order to |

0:15:16 | the sort of achilles heel of jfa models is the gaussian assumption |

0:15:22 | okay and the reason why we do length normalization |

0:15:28 | in between extracting i-vectors and feeding them to be able to get a is in |

0:15:34 | order to fix the are only a scale some assumptions in really i |

0:15:40 | we have to do a similar |

0:15:43 | track here but the normalization is a bit tricky because |

0:15:46 | you have to normalize one statistics are not normalising a vector |

0:15:50 | obviously the first order statistics |

0:15:53 | the magnitude is going to depend on zero order statistics so it's not |

0:15:57 | immediately obvious that the one |

0:16:00 | so the this recipe that a the data used it comes from well go back |

0:16:08 | to the jfa model and see how the jfa model is trained in these that |

0:16:12 | i-vectors this treat them as hidden variables |

0:16:16 | okay that come with |

0:16:18 | both a point estimate and an uncertainty of posterior covariance matrix that tells you |

0:16:25 | well how answer you more about the about the observations |

0:16:30 | a lot of the underlying hidden vector |

0:16:34 | and the thing to |

0:16:36 | but it turns out to be convened in to normalize is the expected norm of |

0:16:41 | that hidden variable |

0:16:45 | set of making the normal equal to one you make the expected norm one |

0:16:50 | okay so |

0:16:52 | a curious thing is that the second term on the right hand side of the |

0:16:55 | trace of the posterior covariance matrix is actually the dominant term |

0:17:00 | we can be because the encircling is you |

0:17:04 | and there is an experiment and paper that shows that you better not neglect factor |

0:17:11 | and the role of the relevance factor in the in the experiments that i reported |

0:17:17 | in the paper |

0:17:19 | as you fiddle with the relevance factor you're actually filling with the relative |

0:17:23 | magnitude of this term here so you have you do have just without |

0:17:28 | all possible relevance factors in order to get this thing working problem |

0:17:38 | okay so here some results using what i call global feature vectors that's where we |

0:17:46 | don't bother to pre-segment the data |

0:17:49 | in two digits |

0:17:51 | okay |

0:17:51 | member a set of the beginning |

0:17:53 | that the was an advantage on this task to not |

0:17:57 | segment |

0:17:58 | okay in other words just ignoring left to right structure |

0:18:01 | that you give and in the problem |

0:18:04 | so there is a gmm ubm benchmark |

0:18:08 | joint density benchmark and the two versions of the hidden supervector back and one without |

0:18:14 | but length normalization is applied to baumwelch statistics and the other with that |

0:18:20 | so you see that the length normalization really is the key |

0:18:25 | to getting this thing to work |

0:18:30 | i should mention that those you use it is you should reduction in error rate |

0:18:34 | there on the female side from eight percent to six percent |

0:18:38 | that appears to have to do with the with the front and i |

0:18:42 | we fixed a standard front-end for these experiments |

0:18:46 | but it appears that if you use |

0:18:50 | a lower dimensional feature vectors for female speakers you get better results and think that |

0:18:55 | sticks explanation of that |

0:18:59 | so there are there's a actually fairly big improvement okay if you go for one |

0:19:04 | twenty eight gaussians to five twelve even though the uncertainty in the case of five |

0:19:08 | twelve is necessarily going to be more ones |

0:19:10 | it was this |

0:19:12 | this phenomena that celibate motivated us to look at the uncertainty modelling problem |

0:19:20 | you can also implement this with |

0:19:23 | if you pre-segment the data into digits and extract what i call locals that vectors |

0:19:28 | and the paper |

0:19:30 | and it works in that case as well |

0:19:33 | there is a tree that we use here famous circles at component fusion |

0:19:40 | you can break the likelihood ratio up into a contribution from the individual gaussians and |

0:19:46 | wait and where the weights are |

0:19:50 | calculated using a logistic regression |

0:19:54 | that helps quite a lot |

0:19:57 | it requires however that you have a development set |

0:20:03 | in order just to choose the fusion weights |

0:20:07 | so in fact you see in the paper that |

0:20:11 | the results with these locals that vectors although we did obtain an improvement on the |

0:20:17 | evaluation set it was not as big nice improvement we obtained on we |

0:20:22 | on the development set |

0:20:24 | unique data if you're going to use a regularized logistic regression |

0:20:29 | so there is a way the way we found a way around that |

0:20:35 | instead of presegmenting the data into individual digits |

0:20:41 | we used a speech recognition system |

0:20:44 | in order to collect rebound one statistics |

0:20:47 | i mean if you can |

0:20:48 | in text-independent speaker recognition if you can use |

0:20:54 | signal and discriminant neural network to collect almost statistics be obvious thing to do in |

0:20:59 | text dependent speaker recognition for you know we |

0:21:03 | the phonetic transcription is just use a speech recognizer to collect the |

0:21:09 | to collect the common statistics |

0:21:11 | there because |

0:21:13 | individual signals are very unlikely to occur more than once |

0:21:17 | i in a digit string you are implicitly imposing a left-to-right structure |

0:21:22 | but you're not so you don't have to do it explicitly |

0:21:26 | the that works the horse just as well |

0:21:31 | okay so there's some fusion results |

0:21:35 | three two approaches with and without the |

0:21:39 | paying attention to that remind structure but they work if you use them you |

0:21:43 | you do get better results |

0:21:46 | okay so just adjustments one thing that i didn't well on |

0:21:51 | this thing can be implemented very efficiently |

0:21:54 | okay you come |

0:21:56 | you can basically things up in such a way that the linear algebra that needs |

0:22:00 | to be performed at runtime just involves |

0:22:03 | diagonal matrices |

0:22:05 | so it's not so it's nothing like the i-vector back end but i that i |

0:22:11 | presented |

0:22:13 | the last interspeech conference which |

0:22:16 | was just a trial run for this wasn't intended to be a realistic solution problem |

0:22:23 | it involves essentially |

0:22:26 | extracting an i-vector approach which is not something you would so you would normally do |

0:22:32 | but this is very computationally reason so it is effective in a lisp right |

0:22:38 | okay that's the that's only have to say thank you |

0:22:48 | okay we |

0:22:50 | question |

0:23:00 | are normalising the channel effect in the baum-welch statistics and he also normalized the for |

0:23:06 | any the phoneme variability there as well |

0:23:10 | well what was i said this is future work i to do something about but |

0:23:19 | i |

0:23:19 | i think this is a problem we should pay attention to |

0:23:23 | okay so famous that some preliminary work on a but it it's a vector |

0:23:28 | and sell something we well here for a very this |

0:23:33 | fanatics israel email down for you in |

0:23:37 | text dependent speaker recognition it's not so much an issue as it is and in |

0:23:42 | text independent where it's really going to come from |

0:23:47 | a neural network that's trying to discriminate |

0:23:57 | okay and ask questions any using a skinny channel |

0:24:03 | we estimate |

0:24:06 | that's right that's what the rest of the calls for i think can be somehow |

0:24:13 | we define it |

0:24:15 | it's what the jfa recipe calls for even though the channel |

0:24:21 | variables are treated as hidden variables that have |

0:24:25 | posterior expectation the posterior covariance matrix |

0:24:28 | if you look |

0:24:31 | the role that |

0:24:32 | likely |

0:24:34 | if you merrily interest and filtering of the channel thanks to turns out that all |

0:24:38 | you need is the |

0:24:41 | is the posterior expectation |

0:24:43 | this is just with the model sets |

0:24:44 | so i |

0:24:46 | the model is very simple you just two |

0:24:50 | this the only way i'm not once the last |

0:24:55 | okay tend to stick |