Speech Transcript - Generative pairwise models for speaker recognition

0:00:15	okay so might don't women both generative better ways model for speaker recognition
0:00:21	i was some of you may know had been working quite not dealing with what
0:00:25	some sucks the sling discriminative models
0:00:28	for i-vector our classification and in particular i've been working
0:00:33	mostly with a discriminative models able to directly classify but also i-vectors that is i-vector
0:00:39	trials directly as belonging to same speaker or different speaker classes
0:00:44	this discriminative models will first introduced as a way to discriminatively trained p lda parameters
0:00:51	and then have all
0:00:53	when we get then we get some explanations some interpretation of this model sells discriminative
0:00:59	more training all model parameters for a second order taylor expansion of a log-likelihood ratio
0:01:05	so i've been working mostly in trials place here the idea was to go back
0:01:11	from discriminative to denote the but remaining target space so the question was
0:01:17	whether would it be possible to better to train a generative model and trial space
0:01:21	and how well would it behave
0:01:23	does out that it's very easy to do it
0:01:25	in practice and it works pretty my well
0:01:28	i would say was more or less like all the other states of the art
0:01:31	models
0:01:33	so in this talk a we show you how
0:01:36	we define these model which is a very easy model which
0:01:41	employs two gaussian distributions to model trials and then why we show the relationship of
0:01:46	this model with p lda and
0:01:48	the discriminately plp am pair-wise svm approach
0:01:51	and then i will also show how this model can be very easily extended to
0:01:55	handle more
0:01:57	complicated distributions in particular i will work with
0:02:00	heavy tailed distributions follow in the work from but the canny about a bit lp
0:02:04	lda
0:02:06	so to eigenspace
0:02:08	so actually to the final tire we take two i-vectors we stick then two k
0:02:12	we stuck them together and we get our definition of trial
0:02:16	here i have a couple of pictures we show what would happen if we were
0:02:20	working in with one dimensional i-vectors so on the
0:02:24	left here i have i've a one dimensional i-vectors which of the black dots and
0:02:30	on the right then taking all cross pairs of i-vectors
0:02:34	we can see you have that there is a nowhere the final region where
0:02:38	i-vectors belonging to the same speaker are
0:02:40	and
0:02:41	and which is quite well separated from the region where the i-vectors coming from the
0:02:46	where per person coming from different regions are
0:02:49	so overweight the discriminative training we try to discriminatively trained so fail surveys to separate
0:02:56	is the region
0:02:57	and now i'm going to try to build a generative model to describe
0:03:01	these two sets of points
0:03:05	so the easiest generative model we can think all okay we have two class
0:03:10	problem so it's a binary problem we can assume that
0:03:14	the trials are what buttons and that they can be modeled by question distributions
0:03:19	so we would have a gaussian distribution describing
0:03:22	the
0:03:23	trials which belongs to the same speaker class
0:03:26	and the flyers which belong to the different speaker class
0:03:29	each of them would have their its own parameters
0:03:32	and for symmetries on i with a will assume that the mean of the two
0:03:36	distributions is the same
0:03:39	so
0:03:41	reasoning about
0:03:42	so the symmetry of the target that is
0:03:45	if we take a pair of i-vectors we can stick them in two ways we
0:03:48	can take enrollment and test force or vice versa but we don't want to give
0:03:52	any
0:03:54	any particular altogether vectors so we want
0:03:58	generative models which treats
0:04:00	both version of the trial in the same way
0:04:03	this imposes some constraints on a war one ances matrices which are
0:04:08	sorry described here that's actually
0:04:12	we have this to make this is which this would be the same as well
0:04:15	as these two and the same for the other distribution
0:04:18	in practice when working with the a all pairs from a single i-vector dataset we
0:04:24	don't even need to impose this selection because it that arises naturally during the training
0:04:29	so how can we trained these weights
0:04:31	use
0:04:32	just the simple thing we can think of we did it by maximum likelihood then
0:04:36	we did not assuming that i-vector priors are independent
0:04:39	of course i-vector trials are not independent because they are all bands that we can
0:04:44	built from a single i-vector set
0:04:47	however in practice these does not really affect our results even though the assumption is
0:04:53	very not curate
0:04:56	so this is a representation of what would happen
0:05:01	if we were working one dimensional space so i'll assuming that the mean is zero
0:05:06	for the two distribution which is
0:05:08	essentially what we with the recovery if we center i-vectors we would end up with
0:05:12	our look like a racial which is just the racial between two gaussian distributions which
0:05:17	is up with a tick for mean the i-vectors per in the i-vector trial space
0:05:23	you can see two plots of two different no syntactic the one they may show
0:05:26	synthetic i-vectors whether you can see the
0:05:30	a level some the log-likelihood ratio a as a function of the trial
0:05:34	and you cannot is that essentially we have separating with quadratic surfaces the same speaker
0:05:40	area which is the
0:05:43	this diagonal from the rest of the
0:05:46	of the
0:05:48	points
0:05:49	so
0:05:51	this involves nice we show you the results in a moment but force the one
0:05:55	to show you the relationship between this model and
0:05:59	the other state-of-the-art approach is like be lda in the discriminative be lda
0:06:04	so this is the classical p lda approach the simplified version where we have full
0:06:09	around
0:06:10	channel factors merge will together with the residual noise
0:06:14	and we have a subspace for speaker for the speaker space
0:06:20	so if we think this model and try to jointly modeled the distribution of apparel
0:06:24	i-vectors were we
0:06:26	can consider separately the case the when the two i-vectors of from the same speaker
0:06:31	then
0:06:32	when they are from different speakers in the first
0:06:35	case we would have that the speaker variable
0:06:38	for the
0:06:39	latent variable for the speaker would be shared so we would have only one speaker
0:06:44	and we would that this expression for the jaw
0:06:46	for the trial
0:06:48	while in the case of different speaker trial we would have one different speaker latent
0:06:52	variable for each of the two i-vectors
0:06:56	now with the standard lda all these but it wasn't question distribute this so we
0:07:01	can integrates over the speaker
0:07:04	latent variables and if we integrate
0:07:06	it would end up with a distribution for same-speaker pairs and different speaker pairs which
0:07:11	is like going ocean
0:07:12	and which has this form so again we see that it does not share mean
0:07:16	and to go one else matters is which
0:07:18	looks very similar which have that very similar structure to what i was showing before
0:07:23	so i in practice p lda here is a what is telling these it's telling
0:07:28	us that the p lda is estimating
0:07:30	and model one which is coherent with our assumption we want that want to go
0:07:35	shown model assumptions
0:07:36	and the spatially difference from our model just in the
0:07:40	objective function that is optimized here we are optimising for i-vector like to the while
0:07:45	in our two gaussian model real optimising for trial likelihood
0:07:51	so again for the where and are we
0:07:55	goal
0:07:56	when we compute look like a racial we end up with very similar separation surface
0:08:00	is allows our two gaussian model in one this one dimensional space i-vector space
0:08:05	and we will see that
0:08:07	this also reflects in the real i-vector space that since the to model performs pretty
0:08:12	much the same
0:08:15	so going to the
0:08:18	relationship with the discriminative approach
0:08:22	this is the scoring function we were used for the pairwise svm
0:08:27	so we have assumed the this was the scoring function which
0:08:31	corresponds which is a scoring function we used to compute the loss of the of
0:08:35	the svm from
0:08:37	and it's going function is actually formally equivalent to the
0:08:43	score look like a racial function we've seen for our to go some model
0:08:47	and of course this is also equivalent to the plp a scoring function as it
0:08:51	was forced to the right from
0:08:53	that approach
0:08:55	horace all we can think about the svm as a way to discriminative train these
0:09:01	matrix which
0:09:03	which if we think about it in the two gaussian model is nothing as than
0:09:07	the difference between the procedure might this is of the two distribution
0:09:11	so i can we have a mother which is also the
0:09:14	same kind of separation of star feces
0:09:17	and the gain the only difference is the objective function we are optimising
0:09:23	so to see some results about this first part
0:09:27	okay desire was done on nist two thousand on the ten telephone condition
0:09:33	and i'm comparing essentially p lda with this
0:09:35	to go some model
0:09:37	so the first line a first one p lda without dimensionality reduction which is also
0:09:42	known as two covariance model and
0:09:45	spatially here it means that i'm taking full around
0:09:49	speaker space
0:09:50	and both case design doing length normalization and is the two lines of the results
0:09:55	of the plp a wood flooring speaker space and the two gaussian model trained by
0:10:00	maximum likelihood in the i-vector space in the trial space
0:10:03	and as you can see they perform pretty much the same
0:10:07	while a well of course to go two covariance model is for us to train
0:10:12	this logo some model is even faster than the test they the same
0:10:17	the same requirement computational requirements
0:10:21	the problem is when we moved to r p lda with
0:10:26	an overall speaker with
0:10:28	and low rank speaker subspace in this case values one on the twenty dimensional speaker
0:10:33	subspace what i-vector were four hundred dimensional
0:10:36	we cannot directly apply this
0:10:39	the dimensionality reduction onto the two gaussian model so we
0:10:44	and we
0:10:45	replaced it by are dimensionality reduction down by lda projection
0:10:50	and that's good enough so here we have p lda with the radius of speaker
0:10:54	subspace and two covariance model well the
0:10:57	the dimensionality reduction is done by lda they perform
0:11:01	i would say the same
0:11:03	and then in these
0:11:04	reduced one on the domain and twenty dimensional i-vector space we trained our
0:11:09	go show model on trials and it performs again pretty much the same as the
0:11:13	p lda model
0:11:15	for compare is on these are the results we had with the discriminative model
0:11:19	the difference between all these models the discriminative model didn't required
0:11:24	length normalization
0:11:26	so this means that we can
0:11:29	do are generative model in trial space it's very easy to do actually and it
0:11:33	works very well so let's see i if we can
0:11:37	make things a little more complicated than how do i becomes training and testing so
0:11:43	to complicate things we
0:11:45	took
0:11:46	we did something similar to what about the can indeed with this a bit lp
0:11:51	lda we said okay let's replace
0:11:53	i one gaussian distributions with
0:11:55	t distribution and see what happens
0:11:58	so it does all that training can still be done or using an em algorithm
0:12:03	although it's not that fast becomes more or less the same computational expensive as the
0:12:09	discriminative approach
0:12:10	but the good thing is that in test we can perform close
0:12:13	sorry
0:12:14	we can use closed-form integration
0:12:17	and sour look like a racial becomes simply the racial between two students this is
0:12:22	distributions
0:12:23	so a testing time this thing is well as fast as
0:12:28	be lda or the to go some more the well you i've shown before
0:12:33	how the soul
0:12:35	i said all these yes okay as with a with lp lda we don't need
0:12:39	length normalization if we use these heavy tailed distributions
0:12:45	of course the separation surfaces are slightly more complicated complex because we don't ever anymore
0:12:51	quadratic separation of sources is but
0:12:53	we have this kind of
0:12:56	scenes
0:12:58	and for the results
0:13:00	what happens is that we managed to get more or less the same results of
0:13:04	the go show model without bits
0:13:06	for length normalization which is
0:13:09	i would say aligned with
0:13:10	the finding about
0:13:12	p lda
0:13:14	or again this model is
0:13:16	and what's different between the with p lda is that is model is more expensive
0:13:20	in training button testing is us fossils all the others
0:13:25	so
0:13:28	to summarise what we get here
0:13:31	we get that we can use a very simple question classifier to in the target
0:13:38	space which can be very easily trained then
0:13:40	despite the
0:13:41	does we use incorrectly
0:13:44	make incorrect assumption about via independence is still work very well
0:13:49	and it turns out that is more that is quite easy to extend to handle
0:13:53	more complicated distributions
0:13:55	so while with p lda for example just about to the heavy tailed
0:13:59	the distribution it becomes
0:14:01	very difficult to train the model and test the model we can is the use
0:14:06	for example for the students these solutions without
0:14:09	almost any hassle
0:14:12	saw from here we hope to be able to find some better way to model
0:14:16	i a trial distribution on the in a trial space which will still allow us
0:14:21	to have
0:14:23	fast solution for scoring without incurring in
0:14:25	too big
0:14:27	problems for training
0:14:30	and that was like that's
0:14:45	the first question
0:14:47	the reason freedom in that i think that case
0:14:50	yes
0:14:53	a i don't remember exactly but it was or something like five six
0:14:58	i maybe in something like that
0:15:01	i remember the are we had that all you in the war should but they
0:15:04	had a bug then
0:15:06	when it was work
0:15:08	then a fixed
0:15:21	speech yes telephone speech rather than microphone
0:15:25	will just telephone i didn't trial microphone well i tried something on microphone rates
0:15:31	what can slightly worse than p lda but it's not that different anyway i didn't
0:15:36	write the retail version yet
0:15:39	i think it might run into problems without length normalization
0:15:45	that was my expert
0:15:47	i didn't really tried to maybe ten one on a the microphone data
0:16:00	i have a common
0:16:02	which may be standard and had to
0:16:05	i source and are used em algorithm to estimate that the heavy tailed parameters
0:16:11	and
0:16:12	for example in the paper that are presented on monday i was using at t
0:16:17	distribution in score space
0:16:19	and within em algorithm to
0:16:23	they help me to estimate the parameters and i found
0:16:26	that
0:16:27	didn't i would generate synthetic data where i knew what that degrees-of-freedom would be
0:16:34	and then i tried to recover that
0:16:37	using an em algorithm and that is very frustrating i just
0:16:41	good navigate recover
0:16:44	the same
0:16:45	degrees-of-freedom and then i switched from using an em algorithm to using that the wreck
0:16:51	optimisation i think it is b s g s
0:16:54	of all that
0:16:55	of the likelihood
0:16:57	and that is much better to recovering the that degrees-of-freedom
0:17:03	okay for a while so that for this synthetic models here i was generating then
0:17:08	with the retail distribution and that was getting more or less the same estimates
0:17:13	for this but i'd similar problem when i was assigned to
0:17:17	do some things you know to what you did for calibration with
0:17:21	like non gaussian distribution but skewed distribution and those kind of things and they realise
0:17:25	that
0:17:26	em there was not that would i was doing it numerically and was working but
0:17:32	so maybe i was lucky with the two distributions
0:17:36	i think to combine that that's really heavy-tail then that let's but if it's not
0:17:41	that doesn't so probably the degrees-of-freedom is allow you can recover it but if it's
0:17:46	like be around ten or twenty then you can't recovered anymore
0:17:53	one question
0:17:57	the speaker again

Generative pairwise models for speaker recognition

Speaker Modeling II

Sandro Cumani and Pietro Laface