0:00:14okay so i'm guy and title of my talk is a compensating inter dataset variability
0:00:19in peeled a hyper parameters for bass the speaker recognition
0:00:24but hope that and the presentation is much simpler than this a long title
0:00:30okay so the problem setup is quite a similar a to what we have already
0:00:35heard today but with some twist
0:00:37so it as well as we already heard a speaker recognition state-of-the-art can be quite
0:00:44when trained and a lot of mismatch will not of matched data
0:00:48but in practice many times we don't
0:00:50have the dumbbells to do that so we have a lot of set data from
0:00:55a source domain
0:00:56namely nist data
0:00:58and we may have limited don't know they tied all from our target data which
0:01:04is much which may be very different say
0:01:07in many senses formant from the
0:01:10source domain and this work addresses the setup where we don't have they connect all
0:01:16a from the target domain okay so it's a bit different from what we heard
0:01:19so far
0:01:21and of course doubt related the applications one of the one the fit the first
0:01:26one is when you do have some small labeled data set to adapt your model
0:01:31but problem is that that's not always the case for many of because you don't
0:01:35have data at all
0:01:36or you may have a very limited amounts of data and you in it may
0:01:40be you want to be able to adapt probably using so it's so scouts data
0:01:46now the second the am a related work is when you have a lots of
0:01:51a data unlabeled data to adapt the models and that's what we already heard today
0:01:57but for many applications for example for a text-dependent okay so and in other cases
0:02:03you don't have it a lot should be to have unlabeled data
0:02:07and also a one of the problems with all with the i met so that
0:02:11we already heard today is that is that most of them are based on clustering
0:02:16and clustering and maybe a tricky it's thing to do especially when you don't have
0:02:22the luxury two
0:02:24to see the results of your a clustering algorithm and you don't you cannot with
0:02:28each unit if you don't really have labeled data from your target domain
0:02:34so we don't really i don't we like to use clustering
0:02:39okay so
0:02:41so this work was don it started to doing dodge h u a recent jhu
0:02:46speaker verification a workshop
0:02:49and this is a variant of the domain adaptation challenge in
0:02:52i quite that domain robustness shannon's
0:02:55so that we steer is that we do not use adaptation data at all
0:02:59that is we don't use the mixer for a at all even for a centring
0:03:05that they dial whitening the data
0:03:08and again we can see here are some based on the results we see that
0:03:11if we trained our system on dexter we get the any collate of two point
0:03:16four if we trained only on switchboard without any a use the of a mix
0:03:20of for centring
0:03:22we get the eight point two
0:03:24and the challenge is to try to bridge disguise
0:03:29okay so a we all hear familiar with building modeling just for notation we parameterize
0:03:36the ple model by three hyperparameters by a new which is the centre or the
0:03:41mean of the distribution of all i-vectors be which is that
0:03:46between speaker covariance to metrics or attack was class covariance matrix and w which is
0:03:53the within speaker covariance matrix
0:03:57okay so just before describing the method the i want to it presents some interesting
0:04:02a experiments
0:04:04on the data
0:04:06so if we if we a estimate a purely model from switchboard and we don't
0:04:12the centring also on the search for the then i got an equal out of
0:04:16eight point two
0:04:17and using mixture to center the data i help to but not the that much
0:04:24okay if we a just use the nist
0:04:27ten data to centred
0:04:29the evaluation data then we get the very large e improvement so far from here
0:04:35we can see that this entering is a big issue
0:04:39now however if we do the same experimental next there we see here that centring
0:04:43with nist data distend data doesn't help that much
0:04:49the conclusion of a quite next
0:04:51so basically what we can say that centring does a really a it can account
0:04:57for some of the mismatch but it's quite complicated a in the clinic what a
0:05:01complicated way and it's not really clear it when centring will hurt or not
0:05:06and because we can see that centring with mixture is not good enough for switchboard
0:05:10will the but it is quite good enough for this the mixer
0:05:15okay so the proposed method which is the a partly unrelated to the experiments die
0:05:21or the upshot is just following so the basic idea here is that we hypothesize
0:05:26that some
0:05:28some directions in the i-vector space are a more important i actually account for most
0:05:33of that dataset mismatch
0:05:35this is the underlying assumption
0:05:38and we want to find these directions and made remove this direction of this subspace
0:05:42using a projection a which will find p
0:05:45and the and we call this method in the dataset variability compensation or i d
0:05:49v c
0:05:50a and we actually presented this in the recent i cuss right but what we
0:05:55did there is we just say focused on the center hyper parameter and try to
0:06:00estimate everything form center hyper parameter
0:06:03and what we what we do here is we also focus on the other hyper
0:06:06parameters you b and w so basically what we want to do it want to
0:06:11find a projection that minimizes the variation of viability of these i four parameters when
0:06:17trained over different two datasets
0:06:20that's the basic idea
0:06:22okay so how can we find this a projection that p
0:06:25well given a set of data sets a representing each one
0:06:29by a vector or in the hyper parameter space they clean you b and w
0:06:35so we can think of a for example switchboard there is a one point in
0:06:39this space and mixed sign one point or we can also look at a different
0:06:42components of switchboard and mixer it different two years different deliveries and each one can
0:06:48be represented in a point in this a subspace
0:06:52and the duh problem out that to dataset mismatch problem is actually illustrated by the
0:06:57fact that the points
0:06:58have some variation do not the same point in if we could have a very
0:07:02robust system then all the points while the colliding posting can point to the that's
0:07:07the goal
0:07:09so the idea is to try to find the sum is some a projection like
0:07:13big done in a
0:07:15but approach but here is the projection is on the i-vector space
0:07:19that would effectively reduce the variability in this ap lda space
0:07:24and i what we do in this work we do it independently for each one
0:07:28of the hyper parameters you b and w and it so for each one we
0:07:33just say compute this projection or subspace and then we just a combined them all
0:07:39to a single one
0:07:43so that one is the question is how to how can we do this if
0:07:46we don't have the target data
0:07:48well what we actually do is we use the data that we do have and
0:07:52we hope that what we find the generalize to dancing data
0:07:57of course if we have done some data we will also can applied
0:08:00also and unseen data
0:08:02so but we do is that we observe our a development data and the main
0:08:07the main and point here is that quote of a two way what we generally
0:08:13believed in the past are data is not on how mcginnis if it's mcginnis then
0:08:17the this was not work
0:08:20so we observe the fact that the data is not a whole mcginnis and we
0:08:24try to a we try to a divided the partition it into distinct subsets and
0:08:30the end we and the goal is that each subset supposed to be quite a
0:08:34mcginnis in
0:08:35the so in that different subsets should be a as best as a far from
0:08:39each other so what is in practice what we did we just say a observe
0:08:43that it's which what the data is the consist of six different it delivers we
0:08:48didn't even look at the labels
0:08:49we just said okay we have six deliveries it so let's and make a partition
0:08:54into six subsets we also try to petition according to gender
0:08:59and we also try to see what happens if we had we have only two
0:09:01partitions they but here we select that the may is the intentionally as one to
0:09:07be you landline the data and one to be several data
0:09:11so we have these three different partitions
0:09:15so now doubt on if the middle there's a is following first be estimated projection
0:09:22what we do is we take our development data and divided into distinct subsets
0:09:27now we impact in a
0:09:29in generally did this doesn't have to be a the actual development data that we
0:09:33tend to train the system we can just try to collect other sources of data
0:09:37which we believe it will be we represent some it i in many types of
0:09:42mismatch and we can try to applied on
0:09:45the collection of all the data sets that we managed to get under our hands
0:09:49so once we have these subsets we can estimate the ple model a for each
0:09:56and then we find the projection i in the and that's why was shown the
0:10:00nexus lights
0:10:01now once we found this a projection p we will we just applied on a
0:10:05lower i-vectors as a preprocessing step we do it before we do everything else before
0:10:11may center eating a whitening guys like summarisation and everything
0:10:18and then we could just say retrain or more lp of the model
0:10:21so the hope is that this projection just cleanups
0:10:24in some sense and
0:10:26some of the dataset mismatch for some amount of data
0:10:31okay so first i can we do it for them you hyper parameter it's very
0:10:36simple we just take the collection of centers that we gather from different the data
0:10:41sets and we apply pca or on that and construct the projection from that top
0:10:49now for the mattress s b and b and w
0:10:53a it's a bit different and i we show how it's being done for w
0:10:56and the same it's i can be done for b
0:10:59so basically given a set of covariance matrices w sub i we have one for
0:11:04each dataset
0:11:05we also way to is defined the mean covariance w bar
0:11:11and now let us define a unit vector v which is that direction in the
0:11:15i-vector space
0:11:17now what we can do we can we can they're computed a variance all five
0:11:23of a given the covariance matrix w sub i along this direction or project the
0:11:27projection of the covariance one this that this direction
0:11:30this is a the
0:11:32transposed abuse of by v
0:11:34now the goal that we define here is that we want to find that such
0:11:37directions v
0:11:39that's maximize the variance of this a quantity normalized by a by v transpose rebar
0:11:46v so we will not we normalize that the variance along each direction by that
0:11:52the average variance
0:11:53and we want to find directions that maximize this because if you find a direction
0:11:58that maximize this quantity means that different p l they models for different datasets it
0:12:03behave very differently according to the a along this direction in the i-vector space and
0:12:08this we want to remove or maybe model in the future
0:12:12but in the moment you want to remove it
0:12:15okay so that the algorithm to find this is quite a straightforward we first white
0:12:20and the i-vector space
0:12:23with respect to the it to w bar
0:12:26and then we just the compute this does some of the squares of w sub
0:12:31i and again find the top eigenvectors in a and
0:12:36we constructed a projection p to remove these eigenvectors
0:12:41the proof a that it actually the right thing to do is quite a but
0:12:45is in it i want go over it because we have lunch
0:12:49so way
0:12:50but it's in the paper and it's very simple
0:12:53its immediate
0:12:54okay so not now where now and just a one a one thing that may
0:13:00be quite important what happens again if we want to use other data sources to
0:13:05model the mismatch and it but maybe for that for those that the sources we
0:13:10don't have speaker labels
0:13:12a soul to estimate them you have a matter it's quite easy we don't need
0:13:17the it all a speaker labels but for w and b we do need so
0:13:21what we can do it that cases in those cases we can just say replace
0:13:26w and b with a t is the total comments magics and of course we
0:13:29can estimated without speaker labels
0:13:32and what we can is c i here is that it for typical datasets where
0:13:37the there was a large number of speakers t is up can be approximate it
0:13:42that by w possibly
0:13:44okay so
0:13:46so we saved at is the case it means that if we have high inter
0:13:49dataset variability in inside for some directions forty
0:13:54then it would be the same direction the same a fee would be actually optimum
0:13:59i'll optimal a or probe signal to model also for either w or b and
0:14:04vice versa so instead of finding that these this subspace for w and four b
0:14:08we can just to find it forty and it will be practically almost as good
0:14:15okay so now results
0:14:17first a is a few results a it with using the only done you hyper
0:14:22parameter the center the hyper parameter we can see here in the blue curve a
0:14:28the results using this approach for p ldc is system a trained on switchboard
0:14:34we see here that we started with a point two we could wait it we
0:14:37have a slight degradation because we remove the gender i is that the first we
0:14:43find out that the first a it'd a dimension that we remove is
0:14:49is actually the gender or a the information
0:14:52and but then we start to get a games
0:14:55and we get any product of three point eight
0:14:58we also get nice improvements one for this you have what we see here in
0:15:02the in the red curve and a black curve we see what happens if we
0:15:06apply the same aid vc system method for mixer based the a bill build so
0:15:13for that the red and in their black curves are out when we train a
0:15:18purely model and mixer and still we want to a applied if c to see
0:15:22maybe we're getting some gains or at least we're not losing anything so what we
0:15:25can see that in general we can say that
0:15:28nothing really much happens here when we when you plot you aid posi on the
0:15:33mix able to at least we're not losing
0:15:39okay so now the same thing is being done but it i'd of for the
0:15:43w hyper parameters for the b hyper parameter
0:15:46we can see here in a blue
0:15:49and light blue what happens when we train a system one switchboard and apply d
0:15:54v c
0:15:55either on one of these hyperparameters
0:15:58we see that we get very large improvements even larger than we get for the
0:16:01center hyper parameter
0:16:04a and it's to h i a around i dunno one hundred dimensions and then
0:16:09we start to get the degradation
0:16:11so if we move too much
0:16:13to too many dimensions then we start getting their it degradation
0:16:16and this is
0:16:18for the center hyperparameter we can not remove too much because if we have for
0:16:21example only twelve this subsets then we can remove only to ellie only eleven dimensions
0:16:27but the but for w can be we can actually move up to four hundred
0:16:31so it's a bit a i it's a bit different
0:16:34and okay now in it again in black and red we see what happens when
0:16:39we apply the same as a doorknob mixer based appeal a system
0:16:44with here again that we get very slightly i'm not show it significantly prevent and
0:16:48then we start getting degradation
0:16:50and what we see here quite interesting is that after a dimension of around let's
0:16:55say one hundred and fifty
0:16:57all the systems they actually behave the same so my in eight importation is that
0:17:03we actually managed to remove most or all of that dataset mismatch but we also
0:17:07make a remove some of the a good information from the system and therefore we
0:17:12get some we get degradation but
0:17:14the system actually behave very roughly the same
0:17:18after we moved to the mission of one hundred fifty
0:17:23okay so now have a what happens when we combine everything together we started to
0:17:28form the two point four mix to build an eight point to force which will
0:17:32build and for different partitions we get slightly different the results between equal rate of
0:17:38three and three point three
0:17:40and if we just use the simplistic partition
0:17:44and ogi to only to a subset and we use only a only hyper parameters
0:17:50we can we can a estimate without speaker labels mu and t we say we
0:17:55get to three point five so we
0:17:58that the conclusions the at actually works also without the speaker labels
0:18:04so to conclude a we have shown that i posi id posi can effectively reduce
0:18:09the influence of a dataset variability
0:18:12it is for this that particular is set up for the domain robustness challenge
0:18:17i and then we actually managed to capture to recover roughly ninety percent of the
0:18:23of the error
0:18:24in a compared to that to a totally is switchboard the build
0:18:31and also this aid posi system works well even when trained on two subsets only
0:18:37and without speaker labels
0:18:40thank you
0:18:55i wonder if you if you happen to know what would happen if you simply
0:18:58projected away three leading sorry one hundred eigenvectors from very
0:19:06to the training set with corpus
0:19:09without bothering to train individual w one b metric
0:19:15the one on from the somebody so as not just take that of the commencement
0:19:20extension that that's a method recall a
0:19:25a week or the two wire nap in total a probability subsystem removal we have
0:19:31this in darkness paper we have the we've done that you also get gains but
0:19:35not this and not the time was very
0:19:37not very so very
0:19:39if i understand correctly you're saying the number three
0:19:42benefit comes from trying to
0:19:45am optimize the within class covariance matrices are grossly datasets
0:19:52trying to make fools look the same
0:19:55so but the w matrix is
0:19:58okay so but basically say you to you try it apparently that
0:20:03that a process this is in some sense
0:20:08reasonable that some directions the i-vector space are more sensitive to mismatch
0:20:13data mismatch and some not
0:20:17it's harder observe it from the data a unless you do something like we did
0:20:28did you look simple it just doing per state of set whitening
0:20:32so another should do the whitening you did censoring but you did we did only
0:20:36whitening is it didn't change in the performance
0:20:40well that's contrary to other sites right
0:20:44i think the whitening is generally used
0:20:46so one question would be and as you do that if you did whitening per
0:20:51what negative the same effect in a soft form versus the
0:20:55projection a way
0:20:57i tried i didn't right smack in many to a like a tried very quick
0:21:00experiment in it's like wrote down the total yes i don't know if it's maybe
0:21:04something i have to do more carefully but
0:21:15just a question if you do projection data to train k lda is your within
0:21:19and between became single or something
0:21:22that's like the fifth time error here this question and that's why ram
0:21:26so basically but it's the same only when you apply lda for example before you
0:21:31always gonna build a so it was used
0:21:36so you know way so either you can just and not just to the you
0:21:41can actually moved without low dimension remove these dimension but everything on the whole the
0:21:46low dimension
0:21:47or you can do some tricks to fix it
0:21:53i like add some
0:21:55some quantities to the
0:21:58to the covariance matrices
0:22:03i think so that all can i just in you paper that you
0:22:06you contrasted against source normalization and that was the parallel presented in the icassp paper
0:22:11unfortunately can access it here was trying to look it up as we went along
0:22:15and i b and also taken on source normalization extend that that'd be further
0:22:20of the reason i bring this up is
0:22:23in this context the data sets the speaker disjoint across different datasets what about the
0:22:28context way perhaps i have to have a system trained on telephone speech for a
0:22:33certain number
0:22:34speakers and then you suddenly acquired data from those
0:22:37same speakers in a different channel
0:22:40what's going to happen in terms of c i mean in terms of the a
0:22:45testing side of things you require more data by speakers
0:22:50from a microphone channel perhaps
0:22:52and then you also require microphone data from different set of speakers for training adapting
0:22:57a system
0:22:58it appears is that this point the within class variations are estimated independently on each
0:23:07does that mean that the difference between those datasets is going to be suppressed
0:23:14or it's actually maintained in them
0:23:16an under this framework
0:23:19i think it's not very sensitive and
0:23:23because it looks on very broad hyper parameters so it doesn't really matter if it's
0:23:27the same speaker or not
0:23:29okay maybe we can discuss of on a bit
0:23:37the final question
0:23:41and in the fast
0:23:42we started dealing with the channel problem
0:23:45by projecting away stuff at school
0:23:49and then the development due to software approaches we start modeling the stuff instead of
0:23:56explicitly projecting that the way j if i really i so
0:24:01can you imagine a probabilistic model what which is
0:24:06includes a variability for dataset actually thanks for the question i was thinking maybe tried
0:24:12another slide on that i actually tried to
0:24:15two where extend the p lda model with another like plus the to add another
0:24:21component which is the dataset mismatch which
0:24:24behaves the be different now do not alright and a components and i meant some
0:24:30experiments i got something any improvement
0:24:33compared to the baseline but not as good as i got to using this nap
0:24:39and i have some any ideas why that is that what this is the case
0:24:43but i will not be supplied someone just the that's it in different way and
0:24:48gets better result