0:00:15 okay so might don't women both generative better ways model for speaker recognition i was some of you may know had been working quite not dealing with what some sucks the sling discriminative models for i-vector our classification and in particular i've been working mostly with a discriminative models able to directly classify but also i-vectors that is i-vector trials directly as belonging to same speaker or different speaker classes this discriminative models will first introduced as a way to discriminatively trained p lda parameters and then have all when we get then we get some explanations some interpretation of this model sells discriminative more training all model parameters for a second order taylor expansion of a log-likelihood ratio so i've been working mostly in trials place here the idea was to go back from discriminative to denote the but remaining target space so the question was whether would it be possible to better to train a generative model and trial space and how well would it behave does out that it's very easy to do it in practice and it works pretty my well i would say was more or less like all the other states of the art models so in this talk a we show you how we define these model which is a very easy model which employs two gaussian distributions to model trials and then why we show the relationship of this model with p lda and the discriminately plp am pair-wise svm approach and then i will also show how this model can be very easily extended to handle more complicated distributions in particular i will work with heavy tailed distributions follow in the work from but the canny about a bit lp lda so to eigenspace so actually to the final tire we take two i-vectors we stick then two k we stuck them together and we get our definition of trial here i have a couple of pictures we show what would happen if we were working in with one dimensional i-vectors so on the left here i have i've a one dimensional i-vectors which of the black dots and on the right then taking all cross pairs of i-vectors we can see you have that there is a nowhere the final region where i-vectors belonging to the same speaker are and and which is quite well separated from the region where the i-vectors coming from the where per person coming from different regions are so overweight the discriminative training we try to discriminatively trained so fail surveys to separate is the region and now i'm going to try to build a generative model to describe these two sets of points so the easiest generative model we can think all okay we have two class problem so it's a binary problem we can assume that the trials are what buttons and that they can be modeled by question distributions so we would have a gaussian distribution describing the trials which belongs to the same speaker class and the flyers which belong to the different speaker class each of them would have their its own parameters and for symmetries on i with a will assume that the mean of the two distributions is the same so reasoning about so the symmetry of the target that is if we take a pair of i-vectors we can stick them in two ways we can take enrollment and test force or vice versa but we don't want to give any any particular altogether vectors so we want generative models which treats both version of the trial in the same way this imposes some constraints on a war one ances matrices which are sorry described here that's actually we have this to make this is which this would be the same as well as these two and the same for the other distribution in practice when working with the a all pairs from a single i-vector dataset we don't even need to impose this selection because it that arises naturally during the training so how can we trained these weights use just the simple thing we can think of we did it by maximum likelihood then we did not assuming that i-vector priors are independent of course i-vector trials are not independent because they are all bands that we can built from a single i-vector set however in practice these does not really affect our results even though the assumption is very not curate so this is a representation of what would happen if we were working one dimensional space so i'll assuming that the mean is zero for the two distribution which is essentially what we with the recovery if we center i-vectors we would end up with our look like a racial which is just the racial between two gaussian distributions which is up with a tick for mean the i-vectors per in the i-vector trial space you can see two plots of two different no syntactic the one they may show synthetic i-vectors whether you can see the a level some the log-likelihood ratio a as a function of the trial and you cannot is that essentially we have separating with quadratic surfaces the same speaker area which is the this diagonal from the rest of the of the points so this involves nice we show you the results in a moment but force the one to show you the relationship between this model and the other state-of-the-art approach is like be lda in the discriminative be lda so this is the classical p lda approach the simplified version where we have full around channel factors merge will together with the residual noise and we have a subspace for speaker for the speaker space so if we think this model and try to jointly modeled the distribution of apparel i-vectors were we can consider separately the case the when the two i-vectors of from the same speaker then when they are from different speakers in the first case we would have that the speaker variable for the latent variable for the speaker would be shared so we would have only one speaker and we would that this expression for the jaw for the trial while in the case of different speaker trial we would have one different speaker latent variable for each of the two i-vectors now with the standard lda all these but it wasn't question distribute this so we can integrates over the speaker latent variables and if we integrate it would end up with a distribution for same-speaker pairs and different speaker pairs which is like going ocean and which has this form so again we see that it does not share mean and to go one else matters is which looks very similar which have that very similar structure to what i was showing before so i in practice p lda here is a what is telling these it's telling us that the p lda is estimating and model one which is coherent with our assumption we want that want to go shown model assumptions and the spatially difference from our model just in the objective function that is optimized here we are optimising for i-vector like to the while in our two gaussian model real optimising for trial likelihood so again for the where and are we goal when we compute look like a racial we end up with very similar separation surface is allows our two gaussian model in one this one dimensional space i-vector space and we will see that this also reflects in the real i-vector space that since the to model performs pretty much the same so going to the relationship with the discriminative approach this is the scoring function we were used for the pairwise svm so we have assumed the this was the scoring function which corresponds which is a scoring function we used to compute the loss of the of the svm from and it's going function is actually formally equivalent to the score look like a racial function we've seen for our to go some model and of course this is also equivalent to the plp a scoring function as it was forced to the right from that approach horace all we can think about the svm as a way to discriminative train these matrix which which if we think about it in the two gaussian model is nothing as than the difference between the procedure might this is of the two distribution so i can we have a mother which is also the same kind of separation of star feces and the gain the only difference is the objective function we are optimising so to see some results about this first part okay desire was done on nist two thousand on the ten telephone condition and i'm comparing essentially p lda with this to go some model so the first line a first one p lda without dimensionality reduction which is also known as two covariance model and spatially here it means that i'm taking full around speaker space and both case design doing length normalization and is the two lines of the results of the plp a wood flooring speaker space and the two gaussian model trained by maximum likelihood in the i-vector space in the trial space and as you can see they perform pretty much the same while a well of course to go two covariance model is for us to train this logo some model is even faster than the test they the same the same requirement computational requirements the problem is when we moved to r p lda with an overall speaker with and low rank speaker subspace in this case values one on the twenty dimensional speaker subspace what i-vector were four hundred dimensional we cannot directly apply this the dimensionality reduction onto the two gaussian model so we and we replaced it by are dimensionality reduction down by lda projection and that's good enough so here we have p lda with the radius of speaker subspace and two covariance model well the the dimensionality reduction is done by lda they perform i would say the same and then in these reduced one on the domain and twenty dimensional i-vector space we trained our go show model on trials and it performs again pretty much the same as the p lda model for compare is on these are the results we had with the discriminative model the difference between all these models the discriminative model didn't required length normalization so this means that we can do are generative model in trial space it's very easy to do actually and it works very well so let's see i if we can make things a little more complicated than how do i becomes training and testing so to complicate things we took we did something similar to what about the can indeed with this a bit lp lda we said okay let's replace i one gaussian distributions with t distribution and see what happens so it does all that training can still be done or using an em algorithm although it's not that fast becomes more or less the same computational expensive as the discriminative approach but the good thing is that in test we can perform close sorry we can use closed-form integration and sour look like a racial becomes simply the racial between two students this is distributions so a testing time this thing is well as fast as be lda or the to go some more the well you i've shown before how the soul i said all these yes okay as with a with lp lda we don't need length normalization if we use these heavy tailed distributions of course the separation surfaces are slightly more complicated complex because we don't ever anymore quadratic separation of sources is but we have this kind of scenes and for the results what happens is that we managed to get more or less the same results of the go show model without bits for length normalization which is i would say aligned with the finding about p lda or again this model is and what's different between the with p lda is that is model is more expensive in training button testing is us fossils all the others so to summarise what we get here we get that we can use a very simple question classifier to in the target space which can be very easily trained then despite the does we use incorrectly make incorrect assumption about via independence is still work very well and it turns out that is more that is quite easy to extend to handle more complicated distributions so while with p lda for example just about to the heavy tailed the distribution it becomes very difficult to train the model and test the model we can is the use for example for the students these solutions without almost any hassle saw from here we hope to be able to find some better way to model i a trial distribution on the in a trial space which will still allow us to have fast solution for scoring without incurring in too big problems for training and that was like that's the first question the reason freedom in that i think that case yes a i don't remember exactly but it was or something like five six i maybe in something like that i remember the are we had that all you in the war should but they had a bug then when it was work then a fixed speech yes telephone speech rather than microphone will just telephone i didn't trial microphone well i tried something on microphone rates what can slightly worse than p lda but it's not that different anyway i didn't write the retail version yet i think it might run into problems without length normalization that was my expert i didn't really tried to maybe ten one on a the microphone data i have a common which may be standard and had to i source and are used em algorithm to estimate that the heavy tailed parameters and for example in the paper that are presented on monday i was using at t distribution in score space and within em algorithm to they help me to estimate the parameters and i found that didn't i would generate synthetic data where i knew what that degrees-of-freedom would be and then i tried to recover that using an em algorithm and that is very frustrating i just good navigate recover the same degrees-of-freedom and then i switched from using an em algorithm to using that the wreck optimisation i think it is b s g s of all that of the likelihood and that is much better to recovering the that degrees-of-freedom okay for a while so that for this synthetic models here i was generating then with the retail distribution and that was getting more or less the same estimates for this but i'd similar problem when i was assigned to do some things you know to what you did for calibration with like non gaussian distribution but skewed distribution and those kind of things and they realise that em there was not that would i was doing it numerically and was working but so maybe i was lucky with the two distributions i think to combine that that's really heavy-tail then that let's but if it's not that doesn't so probably the degrees-of-freedom is allow you can recover it but if it's like be around ten or twenty then you can't recovered anymore one question the speaker again