Speech Transcript - Compensating Inter-Dataset Variability in PLDA Hyper-Parameters for Robust Speaker Recognition

0:00:14	okay so i'm guy and title of my talk is a compensating inter dataset variability
0:00:19	in peeled a hyper parameters for bass the speaker recognition
0:00:24	but hope that and the presentation is much simpler than this a long title
0:00:30	okay so the problem setup is quite a similar a to what we have already
0:00:35	heard today but with some twist
0:00:37	so it as well as we already heard a speaker recognition state-of-the-art can be quite
0:00:42	accurate
0:00:44	when trained and a lot of mismatch will not of matched data
0:00:48	but in practice many times we don't
0:00:50	have the dumbbells to do that so we have a lot of set data from
0:00:55	a source domain
0:00:56	namely nist data
0:00:58	and we may have limited don't know they tied all from our target data which
0:01:04	is much which may be very different say
0:01:07	in many senses formant from the
0:01:10	source domain and this work addresses the setup where we don't have they connect all
0:01:14	formed
0:01:16	a from the target domain okay so it's a bit different from what we heard
0:01:19	so far
0:01:21	and of course doubt related the applications one of the one the fit the first
0:01:26	one is when you do have some small labeled data set to adapt your model
0:01:31	but problem is that that's not always the case for many of because you don't
0:01:35	have data at all
0:01:36	or you may have a very limited amounts of data and you in it may
0:01:40	be you want to be able to adapt probably using so it's so scouts data
0:01:46	now the second the am a related work is when you have a lots of
0:01:51	a data unlabeled data to adapt the models and that's what we already heard today
0:01:57	but for many applications for example for a text-dependent okay so and in other cases
0:02:03	you don't have it a lot should be to have unlabeled data
0:02:07	and also a one of the problems with all with the i met so that
0:02:11	we already heard today is that is that most of them are based on clustering
0:02:16	and clustering and maybe a tricky it's thing to do especially when you don't have
0:02:22	the luxury two
0:02:24	to see the results of your a clustering algorithm and you don't you cannot with
0:02:28	each unit if you don't really have labeled data from your target domain
0:02:34	so we don't really i don't we like to use clustering
0:02:39	okay so
0:02:41	so this work was don it started to doing dodge h u a recent jhu
0:02:46	speaker verification a workshop
0:02:49	and this is a variant of the domain adaptation challenge in
0:02:52	i quite that domain robustness shannon's
0:02:55	so that we steer is that we do not use adaptation data at all
0:02:59	that is we don't use the mixer for a at all even for a centring
0:03:05	that they dial whitening the data
0:03:08	and again we can see here are some based on the results we see that
0:03:11	if we trained our system on dexter we get the any collate of two point
0:03:16	four if we trained only on switchboard without any a use the of a mix
0:03:20	of for centring
0:03:22	we get the eight point two
0:03:24	and the challenge is to try to bridge disguise
0:03:29	okay so a we all hear familiar with building modeling just for notation we parameterize
0:03:36	the ple model by three hyperparameters by a new which is the centre or the
0:03:41	mean of the distribution of all i-vectors be which is that
0:03:46	between speaker covariance to metrics or attack was class covariance matrix and w which is
0:03:53	the within speaker covariance matrix
0:03:57	okay so just before describing the method the i want to it presents some interesting
0:04:02	a experiments
0:04:04	on the data
0:04:06	so if we if we a estimate a purely model from switchboard and we don't
0:04:12	the centring also on the search for the then i got an equal out of
0:04:16	eight point two
0:04:17	and using mixture to center the data i help to but not the that much
0:04:24	okay if we a just use the nist
0:04:27	ten data to centred
0:04:29	the evaluation data then we get the very large e improvement so far from here
0:04:35	we can see that this entering is a big issue
0:04:39	now however if we do the same experimental next there we see here that centring
0:04:43	with nist data distend data doesn't help that much
0:04:48	so
0:04:49	the conclusion of a quite next
0:04:51	so basically what we can say that centring does a really a it can account
0:04:57	for some of the mismatch but it's quite complicated a in the clinic what a
0:05:01	complicated way and it's not really clear it when centring will hurt or not
0:05:06	and because we can see that centring with mixture is not good enough for switchboard
0:05:10	will the but it is quite good enough for this the mixer
0:05:15	okay so the proposed method which is the a partly unrelated to the experiments die
0:05:21	or the upshot is just following so the basic idea here is that we hypothesize
0:05:26	that some
0:05:28	some directions in the i-vector space are a more important i actually account for most
0:05:33	of that dataset mismatch
0:05:35	this is the underlying assumption
0:05:38	and we want to find these directions and made remove this direction of this subspace
0:05:42	using a projection a which will find p
0:05:45	and the and we call this method in the dataset variability compensation or i d
0:05:49	v c
0:05:50	a and we actually presented this in the recent i cuss right but what we
0:05:55	did there is we just say focused on the center hyper parameter and try to
0:06:00	estimate everything form center hyper parameter
0:06:03	and what we what we do here is we also focus on the other hyper
0:06:06	parameters you b and w so basically what we want to do it want to
0:06:11	find a projection that minimizes the variation of viability of these i four parameters when
0:06:17	trained over different two datasets
0:06:20	that's the basic idea
0:06:22	okay so how can we find this a projection that p
0:06:25	well given a set of data sets a representing each one
0:06:29	by a vector or in the hyper parameter space they clean you b and w
0:06:35	so we can think of a for example switchboard there is a one point in
0:06:39	this space and mixed sign one point or we can also look at a different
0:06:42	components of switchboard and mixer it different two years different deliveries and each one can
0:06:48	be represented in a point in this a subspace
0:06:52	and the duh problem out that to dataset mismatch problem is actually illustrated by the
0:06:57	fact that the points
0:06:58	have some variation do not the same point in if we could have a very
0:07:02	robust system then all the points while the colliding posting can point to the that's
0:07:07	the goal
0:07:09	so the idea is to try to find the sum is some a projection like
0:07:13	big done in a
0:07:15	but approach but here is the projection is on the i-vector space
0:07:19	that would effectively reduce the variability in this ap lda space
0:07:24	and i what we do in this work we do it independently for each one
0:07:28	of the hyper parameters you b and w and it so for each one we
0:07:33	just say compute this projection or subspace and then we just a combined them all
0:07:39	to a single one
0:07:42	okay
0:07:43	so that one is the question is how to how can we do this if
0:07:46	we don't have the target data
0:07:48	well what we actually do is we use the data that we do have and
0:07:52	we hope that what we find the generalize to dancing data
0:07:57	of course if we have done some data we will also can applied
0:08:00	also and unseen data
0:08:02	so but we do is that we observe our a development data and the main
0:08:07	the main and point here is that quote of a two way what we generally
0:08:13	believed in the past are data is not on how mcginnis if it's mcginnis then
0:08:17	the this was not work
0:08:20	so we observe the fact that the data is not a whole mcginnis and we
0:08:24	try to a we try to a divided the partition it into distinct subsets and
0:08:30	the end we and the goal is that each subset supposed to be quite a
0:08:34	mcginnis in
0:08:35	the so in that different subsets should be a as best as a far from
0:08:39	each other so what is in practice what we did we just say a observe
0:08:43	that it's which what the data is the consist of six different it delivers we
0:08:48	didn't even look at the labels
0:08:49	we just said okay we have six deliveries it so let's and make a partition
0:08:54	into six subsets we also try to petition according to gender
0:08:59	and we also try to see what happens if we had we have only two
0:09:01	partitions they but here we select that the may is the intentionally as one to
0:09:07	be you landline the data and one to be several data
0:09:11	so we have these three different partitions
0:09:15	so now doubt on if the middle there's a is following first be estimated projection
0:09:20	p
0:09:22	what we do is we take our development data and divided into distinct subsets
0:09:27	now we impact in a
0:09:29	in generally did this doesn't have to be a the actual development data that we
0:09:33	tend to train the system we can just try to collect other sources of data
0:09:37	which we believe it will be we represent some it i in many types of
0:09:42	mismatch and we can try to applied on
0:09:45	the collection of all the data sets that we managed to get under our hands
0:09:49	so once we have these subsets we can estimate the ple model a for each
0:09:55	one
0:09:56	and then we find the projection i in the and that's why was shown the
0:10:00	nexus lights
0:10:01	now once we found this a projection p we will we just applied on a
0:10:05	lower i-vectors as a preprocessing step we do it before we do everything else before
0:10:11	may center eating a whitening guys like summarisation and everything
0:10:18	and then we could just say retrain or more lp of the model
0:10:21	so the hope is that this projection just cleanups
0:10:24	in some sense and
0:10:26	some of the dataset mismatch for some amount of data
0:10:31	okay so first i can we do it for them you hyper parameter it's very
0:10:36	simple we just take the collection of centers that we gather from different the data
0:10:41	sets and we apply pca or on that and construct the projection from that top
0:10:47	eigenvectors
0:10:49	now for the mattress s b and b and w
0:10:53	a it's a bit different and i we show how it's being done for w
0:10:56	and the same it's i can be done for b
0:10:59	so basically given a set of covariance matrices w sub i we have one for
0:11:04	each dataset
0:11:05	we also way to is defined the mean covariance w bar
0:11:11	and now let us define a unit vector v which is that direction in the
0:11:15	i-vector space
0:11:17	now what we can do we can we can they're computed a variance all five
0:11:23	of a given the covariance matrix w sub i along this direction or project the
0:11:27	projection of the covariance one this that this direction
0:11:30	this is a the
0:11:32	transposed abuse of by v
0:11:34	now the goal that we define here is that we want to find that such
0:11:37	directions v
0:11:39	that's maximize the variance of this a quantity normalized by a by v transpose rebar
0:11:46	v so we will not we normalize that the variance along each direction by that
0:11:52	the average variance
0:11:53	and we want to find directions that maximize this because if you find a direction
0:11:58	that maximize this quantity means that different p l they models for different datasets it
0:12:03	behave very differently according to the a along this direction in the i-vector space and
0:12:08	this we want to remove or maybe model in the future
0:12:12	but in the moment you want to remove it
0:12:15	okay so that the algorithm to find this is quite a straightforward we first white
0:12:20	and the i-vector space
0:12:23	with respect to the it to w bar
0:12:26	and then we just the compute this does some of the squares of w sub
0:12:31	i and again find the top eigenvectors in a and
0:12:36	we constructed a projection p to remove these eigenvectors
0:12:41	the proof a that it actually the right thing to do is quite a but
0:12:45	is in it i want go over it because we have lunch
0:12:49	so way
0:12:50	but it's in the paper and it's very simple
0:12:53	its immediate
0:12:54	okay so not now where now and just a one a one thing that may
0:13:00	be quite important what happens again if we want to use other data sources to
0:13:05	model the mismatch and it but maybe for that for those that the sources we
0:13:10	don't have speaker labels
0:13:12	a soul to estimate them you have a matter it's quite easy we don't need
0:13:17	the it all a speaker labels but for w and b we do need so
0:13:21	what we can do it that cases in those cases we can just say replace
0:13:26	w and b with a t is the total comments magics and of course we
0:13:29	can estimated without speaker labels
0:13:32	and what we can is c i here is that it for typical datasets where
0:13:37	the there was a large number of speakers t is up can be approximate it
0:13:42	that by w possibly
0:13:44	okay so
0:13:46	so we saved at is the case it means that if we have high inter
0:13:49	dataset variability in inside for some directions forty
0:13:54	then it would be the same direction the same a fee would be actually optimum
0:13:59	i'll optimal a or probe signal to model also for either w or b and
0:14:04	vice versa so instead of finding that these this subspace for w and four b
0:14:08	we can just to find it forty and it will be practically almost as good
0:14:15	okay so now results
0:14:17	first a is a few results a it with using the only done you hyper
0:14:22	parameter the center the hyper parameter we can see here in the blue curve a
0:14:28	the results using this approach for p ldc is system a trained on switchboard
0:14:34	we see here that we started with a point two we could wait it we
0:14:37	have a slight degradation because we remove the gender i is that the first we
0:14:43	find out that the first a it'd a dimension that we remove is
0:14:49	is actually the gender or a the information
0:14:52	and but then we start to get a games
0:14:55	and we get any product of three point eight
0:14:58	we also get nice improvements one for this you have what we see here in
0:15:02	the in the red curve and a black curve we see what happens if we
0:15:06	apply the same aid vc system method for mixer based the a bill build so
0:15:13	we
0:15:13	for that the red and in their black curves are out when we train a
0:15:18	purely model and mixer and still we want to a applied if c to see
0:15:22	maybe we're getting some gains or at least we're not losing anything so what we
0:15:25	can see that in general we can say that
0:15:28	nothing really much happens here when we when you plot you aid posi on the
0:15:33	mix able to at least we're not losing
0:15:39	okay so now the same thing is being done but it i'd of for the
0:15:43	w hyper parameters for the b hyper parameter
0:15:46	we can see here in a blue
0:15:49	and light blue what happens when we train a system one switchboard and apply d
0:15:54	v c
0:15:55	either on one of these hyperparameters
0:15:58	we see that we get very large improvements even larger than we get for the
0:16:01	center hyper parameter
0:16:04	a and it's to h i a around i dunno one hundred dimensions and then
0:16:09	we start to get the degradation
0:16:11	so if we move too much
0:16:13	to too many dimensions then we start getting their it degradation
0:16:16	and this is
0:16:18	for the center hyperparameter we can not remove too much because if we have for
0:16:21	example only twelve this subsets then we can remove only to ellie only eleven dimensions
0:16:27	but the but for w can be we can actually move up to four hundred
0:16:31	so it's a bit a i it's a bit different
0:16:34	and okay now in it again in black and red we see what happens when
0:16:39	we apply the same as a doorknob mixer based appeal a system
0:16:44	with here again that we get very slightly i'm not show it significantly prevent and
0:16:48	then we start getting degradation
0:16:50	and what we see here quite interesting is that after a dimension of around let's
0:16:55	say one hundred and fifty
0:16:57	all the systems they actually behave the same so my in eight importation is that
0:17:03	we actually managed to remove most or all of that dataset mismatch but we also
0:17:07	make a remove some of the a good information from the system and therefore we
0:17:12	get some we get degradation but
0:17:14	the system actually behave very roughly the same
0:17:18	after we moved to the mission of one hundred fifty
0:17:23	okay so now have a what happens when we combine everything together we started to
0:17:28	form the two point four mix to build an eight point to force which will
0:17:32	build and for different partitions we get slightly different the results between equal rate of
0:17:38	three and three point three
0:17:40	and if we just use the simplistic partition
0:17:44	and ogi to only to a subset and we use only a only hyper parameters
0:17:50	we can we can a estimate without speaker labels mu and t we say we
0:17:55	get to three point five so we
0:17:58	that the conclusions the at actually works also without the speaker labels
0:18:04	so to conclude a we have shown that i posi id posi can effectively reduce
0:18:09	the influence of a dataset variability
0:18:12	it is for this that particular is set up for the domain robustness challenge
0:18:17	i and then we actually managed to capture to recover roughly ninety percent of the
0:18:23	of the error
0:18:24	in a compared to that to a totally is switchboard the build
0:18:31	and also this aid posi system works well even when trained on two subsets only
0:18:37	and without speaker labels
0:18:39	okay
0:18:40	thank you
0:18:55	i wonder if you if you happen to know what would happen if you simply
0:18:58	projected away three leading sorry one hundred eigenvectors from very
0:19:06	to the training set with corpus
0:19:09	without bothering to train individual w one b metric
0:19:15	the one on from the somebody so as not just take that of the commencement
0:19:20	extension that that's a method recall a
0:19:25	a week or the two wire nap in total a probability subsystem removal we have
0:19:31	this in darkness paper we have the we've done that you also get gains but
0:19:35	not this and not the time was very
0:19:37	not very so very
0:19:39	if i understand correctly you're saying the number three
0:19:42	benefit comes from trying to
0:19:45	am optimize the within class covariance matrices are grossly datasets
0:19:52	trying to make fools look the same
0:19:55	so but the w matrix is
0:19:58	okay so but basically say you to you try it apparently that
0:20:03	that a process this is in some sense
0:20:08	reasonable that some directions the i-vector space are more sensitive to mismatch
0:20:13	data mismatch and some not
0:20:16	and
0:20:17	it's harder observe it from the data a unless you do something like we did
0:20:28	did you look simple it just doing per state of set whitening
0:20:32	so another should do the whitening you did censoring but you did we did only
0:20:36	whitening is it didn't change in the performance
0:20:39	right
0:20:40	well that's contrary to other sites right
0:20:44	i think the whitening is generally used
0:20:46	so one question would be and as you do that if you did whitening per
0:20:50	dataset
0:20:51	what negative the same effect in a soft form versus the
0:20:55	projection a way
0:20:57	i tried i didn't right smack in many to a like a tried very quick
0:21:00	experiment in it's like wrote down the total yes i don't know if it's maybe
0:21:04	something i have to do more carefully but
0:21:07	digital
0:21:15	just a question if you do projection data to train k lda is your within
0:21:19	and between became single or something
0:21:22	that's like the fifth time error here this question and that's why ram
0:21:26	so basically but it's the same only when you apply lda for example before you
0:21:31	always gonna build a so it was used
0:21:36	so you know way so either you can just and not just to the you
0:21:41	can actually moved without low dimension remove these dimension but everything on the whole the
0:21:46	low dimension
0:21:47	or you can do some tricks to fix it
0:21:53	i like add some
0:21:55	some quantities to the
0:21:58	to the covariance matrices
0:22:03	i think so that all can i just in you paper that you
0:22:06	you contrasted against source normalization and that was the parallel presented in the icassp paper
0:22:11	unfortunately can access it here was trying to look it up as we went along
0:22:15	and i b and also taken on source normalization extend that that'd be further
0:22:20	of the reason i bring this up is
0:22:23	in this context the data sets the speaker disjoint across different datasets what about the
0:22:28	context way perhaps i have to have a system trained on telephone speech for a
0:22:33	certain number
0:22:34	speakers and then you suddenly acquired data from those
0:22:37	same speakers in a different channel
0:22:40	what's going to happen in terms of c i mean in terms of the a
0:22:45	testing side of things you require more data by speakers
0:22:50	from a microphone channel perhaps
0:22:52	and then you also require microphone data from different set of speakers for training adapting
0:22:57	a system
0:22:58	it appears is that this point the within class variations are estimated independently on each
0:23:04	dataset
0:23:06	so
0:23:07	does that mean that the difference between those datasets is going to be suppressed
0:23:14	or it's actually maintained in them
0:23:16	an under this framework
0:23:19	i think it's not very sensitive and
0:23:23	because it looks on very broad hyper parameters so it doesn't really matter if it's
0:23:27	the same speaker or not
0:23:29	okay maybe we can discuss of on a bit
0:23:37	the final question
0:23:40	so
0:23:41	and in the fast
0:23:42	we started dealing with the channel problem
0:23:45	by projecting away stuff at school
0:23:49	and then the development due to software approaches we start modeling the stuff instead of
0:23:56	explicitly projecting that the way j if i really i so
0:24:01	can you imagine a probabilistic model what which is
0:24:06	includes a variability for dataset actually thanks for the question i was thinking maybe tried
0:24:12	another slide on that i actually tried to
0:24:15	two where extend the p lda model with another like plus the to add another
0:24:21	component which is the dataset mismatch which
0:24:24	behaves the be different now do not alright and a components and i meant some
0:24:30	experiments i got something any improvement
0:24:33	compared to the baseline but not as good as i got to using this nap
0:24:37	approach
0:24:39	and i have some any ideas why that is that what this is the case
0:24:43	but i will not be supplied someone just the that's it in different way and
0:24:48	gets better result
0:24:53	okay

Compensating Inter-Dataset Variability in PLDA Hyper-Parameters for Robust Speaker Recognition

Speaker Modeling II

Hagai Aronowitz