0:00:15 | okay so might don't women both generative better ways model for speaker recognition |
---|---|

0:00:21 | i was some of you may know had been working quite not dealing with what |

0:00:25 | some sucks the sling discriminative models |

0:00:28 | for i-vector our classification and in particular i've been working |

0:00:33 | mostly with a discriminative models able to directly classify but also i-vectors that is i-vector |

0:00:39 | trials directly as belonging to same speaker or different speaker classes |

0:00:44 | this discriminative models will first introduced as a way to discriminatively trained p lda parameters |

0:00:51 | and then have all |

0:00:53 | when we get then we get some explanations some interpretation of this model sells discriminative |

0:00:59 | more training all model parameters for a second order taylor expansion of a log-likelihood ratio |

0:01:05 | so i've been working mostly in trials place here the idea was to go back |

0:01:11 | from discriminative to denote the but remaining target space so the question was |

0:01:17 | whether would it be possible to better to train a generative model and trial space |

0:01:21 | and how well would it behave |

0:01:23 | does out that it's very easy to do it |

0:01:25 | in practice and it works pretty my well |

0:01:28 | i would say was more or less like all the other states of the art |

0:01:31 | models |

0:01:33 | so in this talk a we show you how |

0:01:36 | we define these model which is a very easy model which |

0:01:41 | employs two gaussian distributions to model trials and then why we show the relationship of |

0:01:46 | this model with p lda and |

0:01:48 | the discriminately plp am pair-wise svm approach |

0:01:51 | and then i will also show how this model can be very easily extended to |

0:01:55 | handle more |

0:01:57 | complicated distributions in particular i will work with |

0:02:00 | heavy tailed distributions follow in the work from but the canny about a bit lp |

0:02:04 | lda |

0:02:06 | so to eigenspace |

0:02:08 | so actually to the final tire we take two i-vectors we stick then two k |

0:02:12 | we stuck them together and we get our definition of trial |

0:02:16 | here i have a couple of pictures we show what would happen if we were |

0:02:20 | working in with one dimensional i-vectors so on the |

0:02:24 | left here i have i've a one dimensional i-vectors which of the black dots and |

0:02:30 | on the right then taking all cross pairs of i-vectors |

0:02:34 | we can see you have that there is a nowhere the final region where |

0:02:38 | i-vectors belonging to the same speaker are |

0:02:40 | and |

0:02:41 | and which is quite well separated from the region where the i-vectors coming from the |

0:02:46 | where per person coming from different regions are |

0:02:49 | so overweight the discriminative training we try to discriminatively trained so fail surveys to separate |

0:02:56 | is the region |

0:02:57 | and now i'm going to try to build a generative model to describe |

0:03:01 | these two sets of points |

0:03:05 | so the easiest generative model we can think all okay we have two class |

0:03:10 | problem so it's a binary problem we can assume that |

0:03:14 | the trials are what buttons and that they can be modeled by question distributions |

0:03:19 | so we would have a gaussian distribution describing |

0:03:22 | the |

0:03:23 | trials which belongs to the same speaker class |

0:03:26 | and the flyers which belong to the different speaker class |

0:03:29 | each of them would have their its own parameters |

0:03:32 | and for symmetries on i with a will assume that the mean of the two |

0:03:36 | distributions is the same |

0:03:39 | so |

0:03:41 | reasoning about |

0:03:42 | so the symmetry of the target that is |

0:03:45 | if we take a pair of i-vectors we can stick them in two ways we |

0:03:48 | can take enrollment and test force or vice versa but we don't want to give |

0:03:52 | any |

0:03:54 | any particular altogether vectors so we want |

0:03:58 | generative models which treats |

0:04:00 | both version of the trial in the same way |

0:04:03 | this imposes some constraints on a war one ances matrices which are |

0:04:08 | sorry described here that's actually |

0:04:12 | we have this to make this is which this would be the same as well |

0:04:15 | as these two and the same for the other distribution |

0:04:18 | in practice when working with the a all pairs from a single i-vector dataset we |

0:04:24 | don't even need to impose this selection because it that arises naturally during the training |

0:04:29 | so how can we trained these weights |

0:04:31 | use |

0:04:32 | just the simple thing we can think of we did it by maximum likelihood then |

0:04:36 | we did not assuming that i-vector priors are independent |

0:04:39 | of course i-vector trials are not independent because they are all bands that we can |

0:04:44 | built from a single i-vector set |

0:04:47 | however in practice these does not really affect our results even though the assumption is |

0:04:53 | very not curate |

0:04:56 | so this is a representation of what would happen |

0:05:01 | if we were working one dimensional space so i'll assuming that the mean is zero |

0:05:06 | for the two distribution which is |

0:05:08 | essentially what we with the recovery if we center i-vectors we would end up with |

0:05:12 | our look like a racial which is just the racial between two gaussian distributions which |

0:05:17 | is up with a tick for mean the i-vectors per in the i-vector trial space |

0:05:23 | you can see two plots of two different no syntactic the one they may show |

0:05:26 | synthetic i-vectors whether you can see the |

0:05:30 | a level some the log-likelihood ratio a as a function of the trial |

0:05:34 | and you cannot is that essentially we have separating with quadratic surfaces the same speaker |

0:05:40 | area which is the |

0:05:43 | this diagonal from the rest of the |

0:05:46 | of the |

0:05:48 | points |

0:05:49 | so |

0:05:51 | this involves nice we show you the results in a moment but force the one |

0:05:55 | to show you the relationship between this model and |

0:05:59 | the other state-of-the-art approach is like be lda in the discriminative be lda |

0:06:04 | so this is the classical p lda approach the simplified version where we have full |

0:06:09 | around |

0:06:10 | channel factors merge will together with the residual noise |

0:06:14 | and we have a subspace for speaker for the speaker space |

0:06:20 | so if we think this model and try to jointly modeled the distribution of apparel |

0:06:24 | i-vectors were we |

0:06:26 | can consider separately the case the when the two i-vectors of from the same speaker |

0:06:31 | then |

0:06:32 | when they are from different speakers in the first |

0:06:35 | case we would have that the speaker variable |

0:06:38 | for the |

0:06:39 | latent variable for the speaker would be shared so we would have only one speaker |

0:06:44 | and we would that this expression for the jaw |

0:06:46 | for the trial |

0:06:48 | while in the case of different speaker trial we would have one different speaker latent |

0:06:52 | variable for each of the two i-vectors |

0:06:56 | now with the standard lda all these but it wasn't question distribute this so we |

0:07:01 | can integrates over the speaker |

0:07:04 | latent variables and if we integrate |

0:07:06 | it would end up with a distribution for same-speaker pairs and different speaker pairs which |

0:07:11 | is like going ocean |

0:07:12 | and which has this form so again we see that it does not share mean |

0:07:16 | and to go one else matters is which |

0:07:18 | looks very similar which have that very similar structure to what i was showing before |

0:07:23 | so i in practice p lda here is a what is telling these it's telling |

0:07:28 | us that the p lda is estimating |

0:07:30 | and model one which is coherent with our assumption we want that want to go |

0:07:35 | shown model assumptions |

0:07:36 | and the spatially difference from our model just in the |

0:07:40 | objective function that is optimized here we are optimising for i-vector like to the while |

0:07:45 | in our two gaussian model real optimising for trial likelihood |

0:07:51 | so again for the where and are we |

0:07:55 | goal |

0:07:56 | when we compute look like a racial we end up with very similar separation surface |

0:08:00 | is allows our two gaussian model in one this one dimensional space i-vector space |

0:08:05 | and we will see that |

0:08:07 | this also reflects in the real i-vector space that since the to model performs pretty |

0:08:12 | much the same |

0:08:15 | so going to the |

0:08:18 | relationship with the discriminative approach |

0:08:22 | this is the scoring function we were used for the pairwise svm |

0:08:27 | so we have assumed the this was the scoring function which |

0:08:31 | corresponds which is a scoring function we used to compute the loss of the of |

0:08:35 | the svm from |

0:08:37 | and it's going function is actually formally equivalent to the |

0:08:43 | score look like a racial function we've seen for our to go some model |

0:08:47 | and of course this is also equivalent to the plp a scoring function as it |

0:08:51 | was forced to the right from |

0:08:53 | that approach |

0:08:55 | horace all we can think about the svm as a way to discriminative train these |

0:09:01 | matrix which |

0:09:03 | which if we think about it in the two gaussian model is nothing as than |

0:09:07 | the difference between the procedure might this is of the two distribution |

0:09:11 | so i can we have a mother which is also the |

0:09:14 | same kind of separation of star feces |

0:09:17 | and the gain the only difference is the objective function we are optimising |

0:09:23 | so to see some results about this first part |

0:09:27 | okay desire was done on nist two thousand on the ten telephone condition |

0:09:33 | and i'm comparing essentially p lda with this |

0:09:35 | to go some model |

0:09:37 | so the first line a first one p lda without dimensionality reduction which is also |

0:09:42 | known as two covariance model and |

0:09:45 | spatially here it means that i'm taking full around |

0:09:49 | speaker space |

0:09:50 | and both case design doing length normalization and is the two lines of the results |

0:09:55 | of the plp a wood flooring speaker space and the two gaussian model trained by |

0:10:00 | maximum likelihood in the i-vector space in the trial space |

0:10:03 | and as you can see they perform pretty much the same |

0:10:07 | while a well of course to go two covariance model is for us to train |

0:10:12 | this logo some model is even faster than the test they the same |

0:10:17 | the same requirement computational requirements |

0:10:21 | the problem is when we moved to r p lda with |

0:10:26 | an overall speaker with |

0:10:28 | and low rank speaker subspace in this case values one on the twenty dimensional speaker |

0:10:33 | subspace what i-vector were four hundred dimensional |

0:10:36 | we cannot directly apply this |

0:10:39 | the dimensionality reduction onto the two gaussian model so we |

0:10:44 | and we |

0:10:45 | replaced it by are dimensionality reduction down by lda projection |

0:10:50 | and that's good enough so here we have p lda with the radius of speaker |

0:10:54 | subspace and two covariance model well the |

0:10:57 | the dimensionality reduction is done by lda they perform |

0:11:01 | i would say the same |

0:11:03 | and then in these |

0:11:04 | reduced one on the domain and twenty dimensional i-vector space we trained our |

0:11:09 | go show model on trials and it performs again pretty much the same as the |

0:11:13 | p lda model |

0:11:15 | for compare is on these are the results we had with the discriminative model |

0:11:19 | the difference between all these models the discriminative model didn't required |

0:11:24 | length normalization |

0:11:26 | so this means that we can |

0:11:29 | do are generative model in trial space it's very easy to do actually and it |

0:11:33 | works very well so let's see i if we can |

0:11:37 | make things a little more complicated than how do i becomes training and testing so |

0:11:43 | to complicate things we |

0:11:45 | took |

0:11:46 | we did something similar to what about the can indeed with this a bit lp |

0:11:51 | lda we said okay let's replace |

0:11:53 | i one gaussian distributions with |

0:11:55 | t distribution and see what happens |

0:11:58 | so it does all that training can still be done or using an em algorithm |

0:12:03 | although it's not that fast becomes more or less the same computational expensive as the |

0:12:09 | discriminative approach |

0:12:10 | but the good thing is that in test we can perform close |

0:12:13 | sorry |

0:12:14 | we can use closed-form integration |

0:12:17 | and sour look like a racial becomes simply the racial between two students this is |

0:12:22 | distributions |

0:12:23 | so a testing time this thing is well as fast as |

0:12:28 | be lda or the to go some more the well you i've shown before |

0:12:33 | how the soul |

0:12:35 | i said all these yes okay as with a with lp lda we don't need |

0:12:39 | length normalization if we use these heavy tailed distributions |

0:12:45 | of course the separation surfaces are slightly more complicated complex because we don't ever anymore |

0:12:51 | quadratic separation of sources is but |

0:12:53 | we have this kind of |

0:12:56 | scenes |

0:12:58 | and for the results |

0:13:00 | what happens is that we managed to get more or less the same results of |

0:13:04 | the go show model without bits |

0:13:06 | for length normalization which is |

0:13:09 | i would say aligned with |

0:13:10 | the finding about |

0:13:12 | p lda |

0:13:14 | or again this model is |

0:13:16 | and what's different between the with p lda is that is model is more expensive |

0:13:20 | in training button testing is us fossils all the others |

0:13:25 | so |

0:13:28 | to summarise what we get here |

0:13:31 | we get that we can use a very simple question classifier to in the target |

0:13:38 | space which can be very easily trained then |

0:13:40 | despite the |

0:13:41 | does we use incorrectly |

0:13:44 | make incorrect assumption about via independence is still work very well |

0:13:49 | and it turns out that is more that is quite easy to extend to handle |

0:13:53 | more complicated distributions |

0:13:55 | so while with p lda for example just about to the heavy tailed |

0:13:59 | the distribution it becomes |

0:14:01 | very difficult to train the model and test the model we can is the use |

0:14:06 | for example for the students these solutions without |

0:14:09 | almost any hassle |

0:14:12 | saw from here we hope to be able to find some better way to model |

0:14:16 | i a trial distribution on the in a trial space which will still allow us |

0:14:21 | to have |

0:14:23 | fast solution for scoring without incurring in |

0:14:25 | too big |

0:14:27 | problems for training |

0:14:30 | and that was like that's |

0:14:45 | the first question |

0:14:47 | the reason freedom in that i think that case |

0:14:50 | yes |

0:14:53 | a i don't remember exactly but it was or something like five six |

0:14:58 | i maybe in something like that |

0:15:01 | i remember the are we had that all you in the war should but they |

0:15:04 | had a bug then |

0:15:06 | when it was work |

0:15:08 | then a fixed |

0:15:21 | speech yes telephone speech rather than microphone |

0:15:25 | will just telephone i didn't trial microphone well i tried something on microphone rates |

0:15:31 | what can slightly worse than p lda but it's not that different anyway i didn't |

0:15:36 | write the retail version yet |

0:15:39 | i think it might run into problems without length normalization |

0:15:45 | that was my expert |

0:15:47 | i didn't really tried to maybe ten one on a the microphone data |

0:16:00 | i have a common |

0:16:02 | which may be standard and had to |

0:16:05 | i source and are used em algorithm to estimate that the heavy tailed parameters |

0:16:11 | and |

0:16:12 | for example in the paper that are presented on monday i was using at t |

0:16:17 | distribution in score space |

0:16:19 | and within em algorithm to |

0:16:23 | they help me to estimate the parameters and i found |

0:16:26 | that |

0:16:27 | didn't i would generate synthetic data where i knew what that degrees-of-freedom would be |

0:16:34 | and then i tried to recover that |

0:16:37 | using an em algorithm and that is very frustrating i just |

0:16:41 | good navigate recover |

0:16:44 | the same |

0:16:45 | degrees-of-freedom and then i switched from using an em algorithm to using that the wreck |

0:16:51 | optimisation i think it is b s g s |

0:16:54 | of all that |

0:16:55 | of the likelihood |

0:16:57 | and that is much better to recovering the that degrees-of-freedom |

0:17:03 | okay for a while so that for this synthetic models here i was generating then |

0:17:08 | with the retail distribution and that was getting more or less the same estimates |

0:17:13 | for this but i'd similar problem when i was assigned to |

0:17:17 | do some things you know to what you did for calibration with |

0:17:21 | like non gaussian distribution but skewed distribution and those kind of things and they realise |

0:17:25 | that |

0:17:26 | em there was not that would i was doing it numerically and was working but |

0:17:32 | so maybe i was lucky with the two distributions |

0:17:36 | i think to combine that that's really heavy-tail then that let's but if it's not |

0:17:41 | that doesn't so probably the degrees-of-freedom is allow you can recover it but if it's |

0:17:46 | like be around ten or twenty then you can't recovered anymore |

0:17:53 | one question |

0:17:57 | the speaker again |