Speech Transcript - GMM Weights Adaptation Based on Subspace Approaches for Speaker Verification

0:00:15	all right nice the so i'm going to propose something that we have been the
0:00:19	what knowing actually during the last
0:00:24	see last p washable hopkinson trying to explore
0:00:28	you've these any information useful in the gmm weights because of the i-vectors are or
0:00:32	probably no is only at the
0:00:34	try to adapt the means
0:00:36	so
0:00:38	so as you probably all down now that the i-vectors is related to adapting the
0:00:43	means for handies be have been
0:00:45	very well applied for speaker language dialect and different all applications
0:00:50	so and the story
0:00:53	behind only adapting them each is
0:00:55	going back to the
0:00:57	gmm amount but the gmm map adaptation with the ubm universal background model as basis
0:01:02	so undefined it only means can be adapted to what whenever try to revise the
0:01:07	fact that maybe what the i-vectors it only older useful information
0:01:12	in the we trained the weights or even in the variance
0:01:15	probably patrick well a already tried with the variance for jfa
0:01:20	but
0:01:21	so we hear when this work we try to
0:01:24	to do something with the weight so huh the having a lot of a peak
0:01:28	i technique set of inoperable proposed for the weights
0:01:32	and
0:01:34	and we have tried will build a new one called non-negative factor analysis what was
0:01:38	actually has and who was the students and belgian one he was busy male the
0:01:42	mighty and we first tried foreign language id and which actually we have some success
0:01:47	with it
0:01:48	and the reason was that you know for language at some time of me haven't
0:01:52	ubm and you use for that you're portions are kind of phonemes supposedly so if
0:01:57	some from language this phoneme that not appearing so turning dams how can also zero
0:02:02	or even the weights of this goal cushions can be useful information and that's what
0:02:07	we
0:02:08	we found out and that's what actually motivated construct four speakers there's any information that
0:02:13	can be also used for speaker from the gmm weights
0:02:16	and this is ultimately the topics of this work
0:02:18	and we also compared to switch to this non-negative factor analysis and fa two
0:02:24	already existing techniques that it was proposed in but we should subspace a multinomial model
0:02:29	and over the speed this that the this presentation is kind of comprising between the
0:02:34	two for in the case of gmm weight adaptation
0:02:37	so
0:02:38	it's for forty adapting the gmm weights i have been a lot of a peak
0:02:41	techniques already applied maximum a posteriori maximum likelihood any many recreational are
0:02:48	eigenvoice wishes they
0:02:50	the starting point of all the new technology not jfa and i-vector
0:02:54	and they were also a lot of
0:02:56	weight adaptation techniques they're like for example maximum likelihood nonnegative matrix factorization and multinomial subspace
0:03:03	model
0:03:05	and then you wanted we propose non-negative factor analyzers
0:03:08	so
0:03:10	sell the idea behind this example the i-vector concept i don't want to bore you
0:03:13	with this
0:03:14	is you know you say that for a given utterance there is an ubm which
0:03:17	is not a prior of all the sounds how the sauce look like and the
0:03:21	i-vectors try to model all the shift for this ubm to a given recording you
0:03:26	can be
0:03:26	can be require a model by a low dimensional
0:03:30	us matrix representation
0:03:33	and the coordinates of this
0:03:36	recording in this space we call that i-vectors
0:03:38	so we tried to use the same concepts actually for doing though the means and
0:03:42	all that this fourteen sorry for doing that we the weights
0:03:45	and the only difference that we were facing is that the way it should be
0:03:49	all positive and they sum to one so
0:03:52	i can explain that later so in order to do the weights now so the
0:03:56	first thing so you when you when you when you have we have one ubm
0:03:59	for example universal background model
0:04:01	and you have a sequence of features you can compute some counts which is the
0:04:06	posterior distribution of occupational fish a portion given the frame cell which given here in
0:04:12	equation
0:04:13	so the objective function in the weights is kind of be of this kind of
0:04:16	le
0:04:18	it's work
0:04:20	this callback liberty versions which is kind of trying to maximize
0:04:23	the cover different versions between the because the that's a redeeming about the cover different
0:04:27	prices between the counts
0:04:29	and of the weights that you want to model
0:04:33	and so if you want if you get the data discounts and you don't normalise
0:04:37	in with the land of
0:04:38	you're for eight euros sentence you get the maximum likelihood estimation
0:04:42	for the weight which is easy to do so for example the first two well
0:04:46	technique that we propose a bit unfortunately we can compare would it for this for
0:04:51	this paper is more negative matrix factorization so we suppose you have a weights and
0:04:56	you say that okay this which can be
0:04:58	split into negative nonnegative sorry
0:05:01	mattresses
0:05:02	and the first one is gonna be the basis of your space and the second
0:05:06	one he's the cover the coordinate of this
0:05:08	and this the composition can be
0:05:11	a
0:05:14	sorry
0:05:15	can be a the of to find to optimize
0:05:18	the auxiliary function
0:05:20	okay so this is the fact that and forty two in have time to do
0:05:23	comparison with it
0:05:24	but we did what we did we compare with this subspace model because always would
0:05:29	look actually that i
0:05:30	so we try to compare with a two
0:05:33	so they behind and what this subsystems to model this is you have that the
0:05:38	concise and of accounts here
0:05:40	and you try to find a multinomial distribution
0:05:43	that fit
0:05:44	this
0:05:45	this distribution
0:05:46	and this can be defined by saluted much there are i-vector space this is the
0:05:50	ubm weights
0:05:52	and this is normalized
0:05:53	but to get the that the weights sum to one
0:05:57	and it had they have splits over papers in that how to do the optimization
0:06:01	have some haitian
0:06:04	solution for that
0:06:06	so for example suppose you have for example for the s m suppose you have
0:06:10	to go options
0:06:12	and the in band for each point here is the maximal likely to put that
0:06:16	is a maximum likelihood estimation of the weights for a given recording okay
0:06:20	and for this example which we see with the actually this point are generated for
0:06:25	and suppress model don't the subspace multimodal distribution
0:06:30	so we generate from this model
0:06:32	because the time of belief that for low dim for high dimensional space
0:06:36	you that the detector should be distributed like to not over like because if you
0:06:40	take a lot of data and it right only to go options the data would
0:06:44	be everywhere
0:06:45	but if and high dimensional space
0:06:48	i would try to simulate that and is it and to find that
0:06:51	you know tried to simulate high dimensional gmm intrigue oceans
0:06:56	quite case and this is kind of that's what we did so we they you
0:07:00	to look at low and other people but did
0:07:03	so we generate a data from this model
0:07:07	and we shall we what's difference between this
0:07:10	this model and the non-negative factor analyses
0:07:12	so we non-negative factor analysis actually what would say let's say which is the same
0:07:16	as the i-vectors we suppose that we haven't ubm
0:07:19	and issue recording which the weight for each recording can be explained by a shift
0:07:23	just t v in the direction of the data
0:07:26	and this
0:07:28	so the same as the i-vector sell this can be a low rank and are
0:07:31	is a new i-vectors in this new space
0:07:34	so the only problem with this we had were facing is that
0:07:38	the weights for each recording should be always positive initial sum to one
0:07:43	so here we have we develop some kind of an em like
0:07:47	so in of so we have an like an
0:07:51	we first a big air
0:07:52	we get some statistics for this here
0:07:55	we get some a to the gods in the a sound
0:07:59	to estimate the air
0:08:00	and then
0:08:00	when we obtain the l
0:08:02	we do and projected crow project projected the gradient ascent
0:08:07	what the projection metric that we used try to
0:08:10	a given as the constraint that they should always sum to one and they should
0:08:13	always be positive
0:08:15	and that's what we actually did if you want to have more explanation
0:08:19	i don't have time for that and here i can find that so
0:08:22	remember this is the several account this is the auxiliary function for the lack of
0:08:27	for the for the gmm weights case and with this is our weight and we
0:08:32	would like to estimate is to
0:08:33	parameters
0:08:34	which subjected that they should sum to one so what we did we just
0:08:38	assume that the g is a one vector of one so they should just one
0:08:42	should sum to one
0:08:43	and they all we should be positive
0:08:46	okay
0:08:47	so this is a to constrain that a low us to keep that the weights
0:08:52	should be something to one and they should be opposite so indicates for example if
0:08:56	you compare between what the gmm what the non-negative factor as a the when compared
0:09:00	to what subspace model to model
0:09:03	and what you know muir model is doing so
0:09:05	so for this case for example
0:09:08	differently the s m is different refitting well the data
0:09:12	because it was generated from it
0:09:14	but the i-vector the anything would choose an approximation of the data so it has
0:09:18	the benefit it has a disadvantage to because the been if it is in the
0:09:23	case of it and s m has a behavior to overfit of the data because
0:09:28	he we really model well the distribution of the training data but twenty go to
0:09:32	the lid task
0:09:33	sometime and in generalize well
0:09:35	so as to what but did they have the user is a regularization
0:09:39	to try to control this over fitting
0:09:41	so they have an orgasm regularization term that you to one when you're dead
0:09:45	in order to do that so for our case we are so we are not
0:09:50	suffering too much from this the good from this the we don't fit to may
0:09:54	very well the training data
0:09:56	but we approximate one generalize sometime better
0:09:59	sometime then that's mm but is that then of that application to be honest we
0:10:02	compare that for several application is sometimes the one is a bit another sometime the
0:10:06	opposite
0:10:07	but anyway so the difference is this one is like some this as a man
0:10:11	can fit really well the data
0:10:13	the training data but can have problem of overfitting we need to control with regularization
0:10:18	or and the n f a approximate optional the data
0:10:22	and you will sometimes generalize better
0:10:26	so this is the approximate the experimented with this so we have actually train and
0:10:31	i-vectors first and all that the data that we have
0:10:35	and which would test it actually in telephone condition of nist two thousand ten
0:10:40	and we have ubm of
0:10:43	two thousand forty eight this is not more technical things so we haven't i-vectors of
0:10:46	extend read we use the lda let normalisation p lda scheme that but use
0:10:53	and then we ask which so we try to use an i-vector for the means
0:10:56	and an i-vector from
0:10:58	the weights from s m and four and fa
0:11:01	and we tried to do fusion how we can combine them for example just a
0:11:04	simple fusion so we did score fusion
0:11:07	didn't help
0:11:08	allow so we just so key for get it would be some i-vectors fusion
0:11:13	it seems to be
0:11:15	a little bit of the better but not too much for speaker that's what little
0:11:18	disappointed
0:11:19	but for a language id actually with helping a lot
0:11:24	so i for example i two can affect for example i try to see how
0:11:27	the dimensionality of and then the day this new weight adaptation a compared to for
0:11:33	example the i-vectors
0:11:34	so i took and i don't wanna get factor analysis to train five hundred
0:11:38	thousand
0:11:39	and one by one power
0:11:41	one thousand five hundred so we remember that the starting ubm was two thousand forty
0:11:45	eight
0:11:46	so i and the this one's the lda first do much reduction before you length
0:11:52	normalization
0:11:53	and you see that
0:11:54	it's not really do is to the difference of not really big by the one
0:11:58	by varying the data dimension d for lda
0:12:00	and even
0:12:02	if you compare between five hundred thousand data as the difference is not really big
0:12:07	so we were a little bit surprised especially for and fa which we seen the
0:12:11	same behaviour for s m as well
0:12:13	and
0:12:15	but sometimes they just send them is you need to be more low dimensional compared
0:12:19	to all
0:12:20	wanna get the factorized as non-negative factor try to be more high dimension compared to
0:12:25	the other one
0:12:27	so here for example you i we compare the best result that we obtained from
0:12:31	a negative factor analysis
0:12:32	compared to one for the
0:12:34	subsist multi model
0:12:36	and for the core condition of male
0:12:39	and female and eight conversations so we can see that actually that's not really too
0:12:44	much difference
0:12:46	some time and if a listening that sometime is less but better than
0:12:49	then an s m
0:12:51	and
0:12:52	and the but you see that for the conversation you know you can get very
0:12:57	nice improvement over a nice result even without using the gmm weights is the mean
0:13:01	they're just t weights
0:13:04	no if you compared with the i-vectors
0:13:08	so the i-vectors is i don't so i the says a lot the maximum likelihood
0:13:11	of the weights so we should talk about the
0:13:15	the maximum likelihood of the weights with of the log and feed that to lda
0:13:18	maybe it's not the best way to do
0:13:22	so maybe you can do something clever so it seems that to with a local
0:13:25	the women selected was worse
0:13:27	compared to s m and
0:13:29	and the weight for all the condition eight conversation male female and core condition as
0:13:33	well
0:13:34	so now we remove the maximum likelihood from the loop
0:13:37	and we put the i-vectors here so we can see that
0:13:41	usually the i-vectors is twice better
0:13:45	this year we can do you get the can i-vectors other than the weight vectors
0:13:48	any divide by traffic to get the i-vectors so
0:13:52	so the
0:13:53	so the i-vectors is it differently much better than the weights
0:13:57	and
0:13:59	let's not too much but if you go to do eight conversation
0:14:03	it's actually pretty cool "'cause" the correct is a very low
0:14:06	so even for the long for when you have a lack of a lot a
0:14:09	more recording from the speaker
0:14:12	the weights can also give your
0:14:13	almost useful information that need
0:14:16	the i-vector can give you
0:14:18	so that was of the source surprising for that of reason
0:14:21	so here what we took sector this will have and you the minimal dcf only
0:14:25	the c of liquids doesn't a this doesn't and is the great you have the
0:14:29	baseline with you the i-vector
0:14:31	for female and male
0:14:33	so one would you the i-vectors with the weights would use the
0:14:37	i-vector fusion here
0:14:38	so this is an f when you're ready to and if we if we added
0:14:41	and fa will win little bit here
0:14:44	we use an acre eight looks but here but not rate too much
0:14:48	but at one for example for female when we do this when you fuse we
0:14:52	s m we get muscular but again for new dcf
0:14:55	you know operating point and even the correct
0:14:59	so for f e m s m was the best
0:15:02	diffuse with
0:15:03	for male you know we can see that the and if a was much better
0:15:07	for all this but not really in medium new minimal dcf
0:15:11	so it was not really
0:15:13	exciting to tour of fusion to be honest it was loaded with improvement of really
0:15:17	locked compared what was seen for language id
0:15:22	so here since the i-vectors is an awful related to the dimensionality of the supervectors
0:15:27	so we cannot go right to increase the ubm sizes for the gmm weights
0:15:33	the dimensionality is kind of related to how many courses you have so
0:15:37	we have
0:15:39	with tried to say okay well let's try to increase and decrease the ubm sides
0:15:43	and see what happened with the with the
0:15:46	we did not example here we tried only get a factor as for the only
0:15:49	one
0:15:50	so if we can see that for example if we increase the
0:15:53	the portions in the ubm size you get a very nice improvement in the for
0:15:59	both men and female
0:16:01	especially if only maybe
0:16:03	and you mindcf so
0:16:05	so here vince the i-vector that the weight is not ready to the size of
0:16:09	the supervector as you can increase
0:16:11	the
0:16:12	the amount of the
0:16:14	or of the
0:16:17	of the portions in the so in the ubm size so you can you can
0:16:20	even think about using
0:16:22	a speech recognizer and try it some if you want
0:16:26	so i'll so what we did here is actually took the baseline as well
0:16:30	which i
0:16:32	sorry
0:16:40	we took the i storage notably i-vectors
0:16:44	that is all i would try to fuse it with the
0:16:47	the i-vectors from different
0:16:49	ubm size
0:16:51	and he can see that for example of is a kind of are not really
0:16:54	kind of conclusions
0:16:55	over true even you increase even here for example we get
0:17:00	well sorry yes
0:17:00	if and you get better results with two thousand forty thousand questions
0:17:05	diffusions for example for female didn't help too much
0:17:09	to be honest was actually words
0:17:11	and for female form a was a little bit so as well
0:17:15	political court and would do
0:17:17	it doesn't mean that you get better results in the weights will happen would if
0:17:19	use only i-vectors as they could the question
0:17:23	so as a conclusion here we try to a
0:17:28	use the weights and try to think if it is worth a little better way
0:17:31	of and using the weights and updating as well
0:17:34	not only the means which is the
0:17:35	what the i-vector is doing
0:17:37	and so we will we seen some slight improvement when you want to combine them
0:17:41	maybe we need to find a better way to combine them some for example
0:17:45	similar to what subspace gmms are doing for is for speech recognition
0:17:50	i don't know what
0:17:52	then and look for working on that and on of they make some progress
0:17:55	which i tried interactively for example you estimate the weights you all data gmm weights
0:18:01	of the ubm
0:18:02	and then extract wilma statistic second you i-vectors it didn't have for speaker to be
0:18:07	honest i tried it and in a given the same result no improvement not think
0:18:11	so
0:18:13	but i met in tried for language id but only for speaker
0:18:18	thank you
0:18:24	so
0:18:34	you have a lot of time that i'm to understand my question that i
0:18:38	you know we walk a lot on the way to unit in avenue always negotiate
0:18:42	for mainly
0:18:44	and we are looking also on
0:18:47	the weights with l you know we approach and
0:18:51	michelle has also some results so
0:18:54	it seems to me since beginning was because she felt that
0:18:59	the weights are very interesting very nice source of information
0:19:05	in fact it's of in every information
0:19:08	why if you come back to ubm-gmm
0:19:11	and come back to dog results
0:19:14	when he proposed via a top cushion scoring compression you are using on the
0:19:18	one motion put one to this one and zero to all view of those the
0:19:23	lowest them of that summons was quite small
0:19:27	after that when you go to a nickel ship a results
0:19:32	which wine is the em too many out and
0:19:35	do a lot of things very close to what you
0:19:38	presented
0:19:39	at the end the best solution was to use very rank based normalisation
0:19:43	in the right based is very close to a
0:19:47	put the one to some portion zero to view of those on the weight and
0:19:51	count
0:19:53	this after
0:19:54	and now if you look at p m share the results he's bit of a
0:19:58	need to explain a text in but and of the time using just the zero
0:20:03	and one information a of the weights it seems that we are able to find
0:20:09	so according to me the way to a position
0:20:13	represents information birdies
0:20:15	this information yes or no and not a continuous information
0:20:20	like
0:20:21	you are trying to do any we
0:20:23	so i so there is a good point here because
0:20:27	one i one eight one to one has sent could start working with the mean
0:20:31	in a negative factor is my first question on my first think was that was
0:20:35	aware what nicole lasted in
0:20:37	i don't i'm
0:20:38	i want to sparsity into in the weights
0:20:42	this is this is not able to do what are what we're doing now sell
0:20:47	because i agree
0:20:48	with top one top five what it what ones for the top five
0:20:51	so it's like i say one sparsity in the in the in the weights like
0:20:55	which is right and all of them the response like some
0:20:57	zero one can only the top five for example or something
0:21:00	but for this system
0:21:02	well for this model that we have we're not doing that
0:21:05	that was my first actually common one because of it was in the committee and
0:21:09	made was my first comment was like how we can be good sparse
0:21:13	because based on what you what you're saying exactly
0:21:25	at
0:21:28	extract the i-vectors adaptively
0:21:31	okay that the ubm
0:21:34	before you extract
0:21:37	for each frame
0:21:39	there are very few girls
0:21:44	that's what happens i don't know it's that's
0:21:47	them going to knock down to solution to your problem but you will get sparsity
0:21:51	about way how okay
0:21:55	thanks one a
0:22:03	so this kind of follows up on patrick's questionnaire so you're doing sequential estimation for
0:22:09	the l six l some c are
0:22:12	on how many iterations at you go through two
0:22:15	to get that
0:22:17	a wrong
0:22:19	ten of like an em style greater and for each one is a grant in
0:22:23	this s and i go looks five for our and three for l
0:22:29	so
0:22:30	i'm asking this because to me it's interesting to see the rate of convergence that
0:22:34	you might actually hit on this and i know it's extra work right
0:22:38	you did in your evaluations i believe you're doing your evaluations when you believe you
0:22:42	converge did you run any previous system
0:22:46	so let's say that before hitting five you try to that i try and just
0:22:52	to see where you're actually see it may be there are certain
0:22:56	seems to get activator i-vector to get active
0:22:59	are you might actually see
0:23:03	there might be some insight into c
0:23:06	i try this but not in the context on the on this context of the
0:23:11	without of this enforce something what that the n and it
0:23:14	the different and it's a little it's sensitive to one like to see it up
0:23:18	more sometime when you get it that's true like fifteen iteration
0:23:22	kind of like seen that results going to grabbing after some from point the degradation
0:23:27	start to be seen
0:23:28	and usually it's like between
0:23:31	eight five to eight you already saturated
0:23:40	yes we need to control a little bit that
0:23:43	yes if we go if elected goal you is sort yourself actually s m sometime
0:23:47	especially for the sparsity s m is much better because it will it would hit
0:23:51	it just so what you will know like it would fit the data but he
0:23:55	with just
0:23:56	and have a would not do that because it's like an approximation
0:23:59	so that's my issue would and fa
0:24:02	s m would definitely get miss some sparsity if you know if you know how
0:24:05	to control it because you know you maybe you might overfit
0:24:09	the side
0:24:10	more probably marceau it can no better than meets is
0:24:13	for a morsel
0:24:15	you probably know what are then mean that
0:24:18	"'cause" he was doing this isn't that right
0:24:30	where did i
0:24:35	actually when we did this work are we tried different optimization algorithms
0:24:42	for these approximate hacienda it converts and iterated
0:24:46	quite good and also like the questions before we also saw like even few iterations
0:24:52	already
0:24:54	you got already quite good results
0:24:56	and if you like when only iterating you've got some degradation that so it looked
0:25:02	like it starts over fitting the model
0:25:04	so i guess you use all similar
0:25:16	two

GMM Weights Adaptation Based on Subspace Approaches for Speaker Verification

Speaker Modeling I

Najim Dehak, Oldrich Plchot, Mohamad Hasan Bahari, Lukas Burget, Hugo Van Hamme and Reda Dehak