0:00:15okay so might don't women both generative better ways model for speaker recognition
0:00:21i was some of you may know had been working quite not dealing with what
0:00:25some sucks the sling discriminative models
0:00:28for i-vector our classification and in particular i've been working
0:00:33mostly with a discriminative models able to directly classify but also i-vectors that is i-vector
0:00:39trials directly as belonging to same speaker or different speaker classes
0:00:44this discriminative models will first introduced as a way to discriminatively trained p lda parameters
0:00:51and then have all
0:00:53when we get then we get some explanations some interpretation of this model sells discriminative
0:00:59more training all model parameters for a second order taylor expansion of a log-likelihood ratio
0:01:05so i've been working mostly in trials place here the idea was to go back
0:01:11from discriminative to denote the but remaining target space so the question was
0:01:17whether would it be possible to better to train a generative model and trial space
0:01:21and how well would it behave
0:01:23does out that it's very easy to do it
0:01:25in practice and it works pretty my well
0:01:28i would say was more or less like all the other states of the art
0:01:31models
0:01:33so in this talk a we show you how
0:01:36we define these model which is a very easy model which
0:01:41employs two gaussian distributions to model trials and then why we show the relationship of
0:01:46this model with p lda and
0:01:48the discriminately plp am pair-wise svm approach
0:01:51and then i will also show how this model can be very easily extended to
0:01:55handle more
0:01:57complicated distributions in particular i will work with
0:02:00heavy tailed distributions follow in the work from but the canny about a bit lp
0:02:04lda
0:02:06so to eigenspace
0:02:08so actually to the final tire we take two i-vectors we stick then two k
0:02:12we stuck them together and we get our definition of trial
0:02:16here i have a couple of pictures we show what would happen if we were
0:02:20working in with one dimensional i-vectors so on the
0:02:24left here i have i've a one dimensional i-vectors which of the black dots and
0:02:30on the right then taking all cross pairs of i-vectors
0:02:34we can see you have that there is a nowhere the final region where
0:02:38i-vectors belonging to the same speaker are
0:02:40and
0:02:41and which is quite well separated from the region where the i-vectors coming from the
0:02:46where per person coming from different regions are
0:02:49so overweight the discriminative training we try to discriminatively trained so fail surveys to separate
0:02:56is the region
0:02:57and now i'm going to try to build a generative model to describe
0:03:01these two sets of points
0:03:05so the easiest generative model we can think all okay we have two class
0:03:10problem so it's a binary problem we can assume that
0:03:14the trials are what buttons and that they can be modeled by question distributions
0:03:19so we would have a gaussian distribution describing
0:03:22the
0:03:23trials which belongs to the same speaker class
0:03:26and the flyers which belong to the different speaker class
0:03:29each of them would have their its own parameters
0:03:32and for symmetries on i with a will assume that the mean of the two
0:03:36distributions is the same
0:03:39so
0:03:41reasoning about
0:03:42so the symmetry of the target that is
0:03:45if we take a pair of i-vectors we can stick them in two ways we
0:03:48can take enrollment and test force or vice versa but we don't want to give
0:03:52any
0:03:54any particular altogether vectors so we want
0:03:58generative models which treats
0:04:00both version of the trial in the same way
0:04:03this imposes some constraints on a war one ances matrices which are
0:04:08sorry described here that's actually
0:04:12we have this to make this is which this would be the same as well
0:04:15as these two and the same for the other distribution
0:04:18in practice when working with the a all pairs from a single i-vector dataset we
0:04:24don't even need to impose this selection because it that arises naturally during the training
0:04:29so how can we trained these weights
0:04:31use
0:04:32just the simple thing we can think of we did it by maximum likelihood then
0:04:36we did not assuming that i-vector priors are independent
0:04:39of course i-vector trials are not independent because they are all bands that we can
0:04:44built from a single i-vector set
0:04:47however in practice these does not really affect our results even though the assumption is
0:04:53very not curate
0:04:56so this is a representation of what would happen
0:05:01if we were working one dimensional space so i'll assuming that the mean is zero
0:05:06for the two distribution which is
0:05:08essentially what we with the recovery if we center i-vectors we would end up with
0:05:12our look like a racial which is just the racial between two gaussian distributions which
0:05:17is up with a tick for mean the i-vectors per in the i-vector trial space
0:05:23you can see two plots of two different no syntactic the one they may show
0:05:26synthetic i-vectors whether you can see the
0:05:30a level some the log-likelihood ratio a as a function of the trial
0:05:34and you cannot is that essentially we have separating with quadratic surfaces the same speaker
0:05:40area which is the
0:05:43this diagonal from the rest of the
0:05:46of the
0:05:48points
0:05:49so
0:05:51this involves nice we show you the results in a moment but force the one
0:05:55to show you the relationship between this model and
0:05:59the other state-of-the-art approach is like be lda in the discriminative be lda
0:06:04so this is the classical p lda approach the simplified version where we have full
0:06:09around
0:06:10channel factors merge will together with the residual noise
0:06:14and we have a subspace for speaker for the speaker space
0:06:20so if we think this model and try to jointly modeled the distribution of apparel
0:06:24i-vectors were we
0:06:26can consider separately the case the when the two i-vectors of from the same speaker
0:06:31then
0:06:32when they are from different speakers in the first
0:06:35case we would have that the speaker variable
0:06:38for the
0:06:39latent variable for the speaker would be shared so we would have only one speaker
0:06:44and we would that this expression for the jaw
0:06:46for the trial
0:06:48while in the case of different speaker trial we would have one different speaker latent
0:06:52variable for each of the two i-vectors
0:06:56now with the standard lda all these but it wasn't question distribute this so we
0:07:01can integrates over the speaker
0:07:04latent variables and if we integrate
0:07:06it would end up with a distribution for same-speaker pairs and different speaker pairs which
0:07:11is like going ocean
0:07:12and which has this form so again we see that it does not share mean
0:07:16and to go one else matters is which
0:07:18looks very similar which have that very similar structure to what i was showing before
0:07:23so i in practice p lda here is a what is telling these it's telling
0:07:28us that the p lda is estimating
0:07:30and model one which is coherent with our assumption we want that want to go
0:07:35shown model assumptions
0:07:36and the spatially difference from our model just in the
0:07:40objective function that is optimized here we are optimising for i-vector like to the while
0:07:45in our two gaussian model real optimising for trial likelihood
0:07:51so again for the where and are we
0:07:55goal
0:07:56when we compute look like a racial we end up with very similar separation surface
0:08:00is allows our two gaussian model in one this one dimensional space i-vector space
0:08:05and we will see that
0:08:07this also reflects in the real i-vector space that since the to model performs pretty
0:08:12much the same
0:08:15so going to the
0:08:18relationship with the discriminative approach
0:08:22this is the scoring function we were used for the pairwise svm
0:08:27so we have assumed the this was the scoring function which
0:08:31corresponds which is a scoring function we used to compute the loss of the of
0:08:35the svm from
0:08:37and it's going function is actually formally equivalent to the
0:08:43score look like a racial function we've seen for our to go some model
0:08:47and of course this is also equivalent to the plp a scoring function as it
0:08:51was forced to the right from
0:08:53that approach
0:08:55horace all we can think about the svm as a way to discriminative train these
0:09:01matrix which
0:09:03which if we think about it in the two gaussian model is nothing as than
0:09:07the difference between the procedure might this is of the two distribution
0:09:11so i can we have a mother which is also the
0:09:14same kind of separation of star feces
0:09:17and the gain the only difference is the objective function we are optimising
0:09:23so to see some results about this first part
0:09:27okay desire was done on nist two thousand on the ten telephone condition
0:09:33and i'm comparing essentially p lda with this
0:09:35to go some model
0:09:37so the first line a first one p lda without dimensionality reduction which is also
0:09:42known as two covariance model and
0:09:45spatially here it means that i'm taking full around
0:09:49speaker space
0:09:50and both case design doing length normalization and is the two lines of the results
0:09:55of the plp a wood flooring speaker space and the two gaussian model trained by
0:10:00maximum likelihood in the i-vector space in the trial space
0:10:03and as you can see they perform pretty much the same
0:10:07while a well of course to go two covariance model is for us to train
0:10:12this logo some model is even faster than the test they the same
0:10:17the same requirement computational requirements
0:10:21the problem is when we moved to r p lda with
0:10:26an overall speaker with
0:10:28and low rank speaker subspace in this case values one on the twenty dimensional speaker
0:10:33subspace what i-vector were four hundred dimensional
0:10:36we cannot directly apply this
0:10:39the dimensionality reduction onto the two gaussian model so we
0:10:44and we
0:10:45replaced it by are dimensionality reduction down by lda projection
0:10:50and that's good enough so here we have p lda with the radius of speaker
0:10:54subspace and two covariance model well the
0:10:57the dimensionality reduction is done by lda they perform
0:11:01i would say the same
0:11:03and then in these
0:11:04reduced one on the domain and twenty dimensional i-vector space we trained our
0:11:09go show model on trials and it performs again pretty much the same as the
0:11:13p lda model
0:11:15for compare is on these are the results we had with the discriminative model
0:11:19the difference between all these models the discriminative model didn't required
0:11:24length normalization
0:11:26so this means that we can
0:11:29do are generative model in trial space it's very easy to do actually and it
0:11:33works very well so let's see i if we can
0:11:37make things a little more complicated than how do i becomes training and testing so
0:11:43to complicate things we
0:11:45took
0:11:46we did something similar to what about the can indeed with this a bit lp
0:11:51lda we said okay let's replace
0:11:53i one gaussian distributions with
0:11:55t distribution and see what happens
0:11:58so it does all that training can still be done or using an em algorithm
0:12:03although it's not that fast becomes more or less the same computational expensive as the
0:12:09discriminative approach
0:12:10but the good thing is that in test we can perform close
0:12:13sorry
0:12:14we can use closed-form integration
0:12:17and sour look like a racial becomes simply the racial between two students this is
0:12:22distributions
0:12:23so a testing time this thing is well as fast as
0:12:28be lda or the to go some more the well you i've shown before
0:12:33how the soul
0:12:35i said all these yes okay as with a with lp lda we don't need
0:12:39length normalization if we use these heavy tailed distributions
0:12:45of course the separation surfaces are slightly more complicated complex because we don't ever anymore
0:12:51quadratic separation of sources is but
0:12:53we have this kind of
0:12:56scenes
0:12:58and for the results
0:13:00what happens is that we managed to get more or less the same results of
0:13:04the go show model without bits
0:13:06for length normalization which is
0:13:09i would say aligned with
0:13:10the finding about
0:13:12p lda
0:13:14or again this model is
0:13:16and what's different between the with p lda is that is model is more expensive
0:13:20in training button testing is us fossils all the others
0:13:25so
0:13:28to summarise what we get here
0:13:31we get that we can use a very simple question classifier to in the target
0:13:38space which can be very easily trained then
0:13:40despite the
0:13:41does we use incorrectly
0:13:44make incorrect assumption about via independence is still work very well
0:13:49and it turns out that is more that is quite easy to extend to handle
0:13:53more complicated distributions
0:13:55so while with p lda for example just about to the heavy tailed
0:13:59the distribution it becomes
0:14:01very difficult to train the model and test the model we can is the use
0:14:06for example for the students these solutions without
0:14:09almost any hassle
0:14:12saw from here we hope to be able to find some better way to model
0:14:16i a trial distribution on the in a trial space which will still allow us
0:14:21to have
0:14:23fast solution for scoring without incurring in
0:14:25too big
0:14:27problems for training
0:14:30and that was like that's
0:14:45the first question
0:14:47the reason freedom in that i think that case
0:14:50yes
0:14:53a i don't remember exactly but it was or something like five six
0:14:58i maybe in something like that
0:15:01i remember the are we had that all you in the war should but they
0:15:04had a bug then
0:15:06when it was work
0:15:08then a fixed
0:15:21speech yes telephone speech rather than microphone
0:15:25will just telephone i didn't trial microphone well i tried something on microphone rates
0:15:31what can slightly worse than p lda but it's not that different anyway i didn't
0:15:36write the retail version yet
0:15:39i think it might run into problems without length normalization
0:15:45that was my expert
0:15:47i didn't really tried to maybe ten one on a the microphone data
0:16:00i have a common
0:16:02which may be standard and had to
0:16:05i source and are used em algorithm to estimate that the heavy tailed parameters
0:16:11and
0:16:12for example in the paper that are presented on monday i was using at t
0:16:17distribution in score space
0:16:19and within em algorithm to
0:16:23they help me to estimate the parameters and i found
0:16:26that
0:16:27didn't i would generate synthetic data where i knew what that degrees-of-freedom would be
0:16:34and then i tried to recover that
0:16:37using an em algorithm and that is very frustrating i just
0:16:41good navigate recover
0:16:44the same
0:16:45degrees-of-freedom and then i switched from using an em algorithm to using that the wreck
0:16:51optimisation i think it is b s g s
0:16:54of all that
0:16:55of the likelihood
0:16:57and that is much better to recovering the that degrees-of-freedom
0:17:03okay for a while so that for this synthetic models here i was generating then
0:17:08with the retail distribution and that was getting more or less the same estimates
0:17:13for this but i'd similar problem when i was assigned to
0:17:17do some things you know to what you did for calibration with
0:17:21like non gaussian distribution but skewed distribution and those kind of things and they realise
0:17:25that
0:17:26em there was not that would i was doing it numerically and was working but
0:17:32so maybe i was lucky with the two distributions
0:17:36i think to combine that that's really heavy-tail then that let's but if it's not
0:17:41that doesn't so probably the degrees-of-freedom is allow you can recover it but if it's
0:17:46like be around ten or twenty then you can't recovered anymore
0:17:53one question
0:17:57the speaker again