0:00:14 | okay so i'm guy and title of my talk is a compensating inter dataset variability |
---|---|

0:00:19 | in peeled a hyper parameters for bass the speaker recognition |

0:00:24 | but hope that and the presentation is much simpler than this a long title |

0:00:30 | okay so the problem setup is quite a similar a to what we have already |

0:00:35 | heard today but with some twist |

0:00:37 | so it as well as we already heard a speaker recognition state-of-the-art can be quite |

0:00:42 | accurate |

0:00:44 | when trained and a lot of mismatch will not of matched data |

0:00:48 | but in practice many times we don't |

0:00:50 | have the dumbbells to do that so we have a lot of set data from |

0:00:55 | a source domain |

0:00:56 | namely nist data |

0:00:58 | and we may have limited don't know they tied all from our target data which |

0:01:04 | is much which may be very different say |

0:01:07 | in many senses formant from the |

0:01:10 | source domain and this work addresses the setup where we don't have they connect all |

0:01:14 | formed |

0:01:16 | a from the target domain okay so it's a bit different from what we heard |

0:01:19 | so far |

0:01:21 | and of course doubt related the applications one of the one the fit the first |

0:01:26 | one is when you do have some small labeled data set to adapt your model |

0:01:31 | but problem is that that's not always the case for many of because you don't |

0:01:35 | have data at all |

0:01:36 | or you may have a very limited amounts of data and you in it may |

0:01:40 | be you want to be able to adapt probably using so it's so scouts data |

0:01:46 | now the second the am a related work is when you have a lots of |

0:01:51 | a data unlabeled data to adapt the models and that's what we already heard today |

0:01:57 | but for many applications for example for a text-dependent okay so and in other cases |

0:02:03 | you don't have it a lot should be to have unlabeled data |

0:02:07 | and also a one of the problems with all with the i met so that |

0:02:11 | we already heard today is that is that most of them are based on clustering |

0:02:16 | and clustering and maybe a tricky it's thing to do especially when you don't have |

0:02:22 | the luxury two |

0:02:24 | to see the results of your a clustering algorithm and you don't you cannot with |

0:02:28 | each unit if you don't really have labeled data from your target domain |

0:02:34 | so we don't really i don't we like to use clustering |

0:02:39 | okay so |

0:02:41 | so this work was don it started to doing dodge h u a recent jhu |

0:02:46 | speaker verification a workshop |

0:02:49 | and this is a variant of the domain adaptation challenge in |

0:02:52 | i quite that domain robustness shannon's |

0:02:55 | so that we steer is that we do not use adaptation data at all |

0:02:59 | that is we don't use the mixer for a at all even for a centring |

0:03:05 | that they dial whitening the data |

0:03:08 | and again we can see here are some based on the results we see that |

0:03:11 | if we trained our system on dexter we get the any collate of two point |

0:03:16 | four if we trained only on switchboard without any a use the of a mix |

0:03:20 | of for centring |

0:03:22 | we get the eight point two |

0:03:24 | and the challenge is to try to bridge disguise |

0:03:29 | okay so a we all hear familiar with building modeling just for notation we parameterize |

0:03:36 | the ple model by three hyperparameters by a new which is the centre or the |

0:03:41 | mean of the distribution of all i-vectors be which is that |

0:03:46 | between speaker covariance to metrics or attack was class covariance matrix and w which is |

0:03:53 | the within speaker covariance matrix |

0:03:57 | okay so just before describing the method the i want to it presents some interesting |

0:04:02 | a experiments |

0:04:04 | on the data |

0:04:06 | so if we if we a estimate a purely model from switchboard and we don't |

0:04:12 | the centring also on the search for the then i got an equal out of |

0:04:16 | eight point two |

0:04:17 | and using mixture to center the data i help to but not the that much |

0:04:24 | okay if we a just use the nist |

0:04:27 | ten data to centred |

0:04:29 | the evaluation data then we get the very large e improvement so far from here |

0:04:35 | we can see that this entering is a big issue |

0:04:39 | now however if we do the same experimental next there we see here that centring |

0:04:43 | with nist data distend data doesn't help that much |

0:04:48 | so |

0:04:49 | the conclusion of a quite next |

0:04:51 | so basically what we can say that centring does a really a it can account |

0:04:57 | for some of the mismatch but it's quite complicated a in the clinic what a |

0:05:01 | complicated way and it's not really clear it when centring will hurt or not |

0:05:06 | and because we can see that centring with mixture is not good enough for switchboard |

0:05:10 | will the but it is quite good enough for this the mixer |

0:05:15 | okay so the proposed method which is the a partly unrelated to the experiments die |

0:05:21 | or the upshot is just following so the basic idea here is that we hypothesize |

0:05:26 | that some |

0:05:28 | some directions in the i-vector space are a more important i actually account for most |

0:05:33 | of that dataset mismatch |

0:05:35 | this is the underlying assumption |

0:05:38 | and we want to find these directions and made remove this direction of this subspace |

0:05:42 | using a projection a which will find p |

0:05:45 | and the and we call this method in the dataset variability compensation or i d |

0:05:49 | v c |

0:05:50 | a and we actually presented this in the recent i cuss right but what we |

0:05:55 | did there is we just say focused on the center hyper parameter and try to |

0:06:00 | estimate everything form center hyper parameter |

0:06:03 | and what we what we do here is we also focus on the other hyper |

0:06:06 | parameters you b and w so basically what we want to do it want to |

0:06:11 | find a projection that minimizes the variation of viability of these i four parameters when |

0:06:17 | trained over different two datasets |

0:06:20 | that's the basic idea |

0:06:22 | okay so how can we find this a projection that p |

0:06:25 | well given a set of data sets a representing each one |

0:06:29 | by a vector or in the hyper parameter space they clean you b and w |

0:06:35 | so we can think of a for example switchboard there is a one point in |

0:06:39 | this space and mixed sign one point or we can also look at a different |

0:06:42 | components of switchboard and mixer it different two years different deliveries and each one can |

0:06:48 | be represented in a point in this a subspace |

0:06:52 | and the duh problem out that to dataset mismatch problem is actually illustrated by the |

0:06:57 | fact that the points |

0:06:58 | have some variation do not the same point in if we could have a very |

0:07:02 | robust system then all the points while the colliding posting can point to the that's |

0:07:07 | the goal |

0:07:09 | so the idea is to try to find the sum is some a projection like |

0:07:13 | big done in a |

0:07:15 | but approach but here is the projection is on the i-vector space |

0:07:19 | that would effectively reduce the variability in this ap lda space |

0:07:24 | and i what we do in this work we do it independently for each one |

0:07:28 | of the hyper parameters you b and w and it so for each one we |

0:07:33 | just say compute this projection or subspace and then we just a combined them all |

0:07:39 | to a single one |

0:07:42 | okay |

0:07:43 | so that one is the question is how to how can we do this if |

0:07:46 | we don't have the target data |

0:07:48 | well what we actually do is we use the data that we do have and |

0:07:52 | we hope that what we find the generalize to dancing data |

0:07:57 | of course if we have done some data we will also can applied |

0:08:00 | also and unseen data |

0:08:02 | so but we do is that we observe our a development data and the main |

0:08:07 | the main and point here is that quote of a two way what we generally |

0:08:13 | believed in the past are data is not on how mcginnis if it's mcginnis then |

0:08:17 | the this was not work |

0:08:20 | so we observe the fact that the data is not a whole mcginnis and we |

0:08:24 | try to a we try to a divided the partition it into distinct subsets and |

0:08:30 | the end we and the goal is that each subset supposed to be quite a |

0:08:34 | mcginnis in |

0:08:35 | the so in that different subsets should be a as best as a far from |

0:08:39 | each other so what is in practice what we did we just say a observe |

0:08:43 | that it's which what the data is the consist of six different it delivers we |

0:08:48 | didn't even look at the labels |

0:08:49 | we just said okay we have six deliveries it so let's and make a partition |

0:08:54 | into six subsets we also try to petition according to gender |

0:08:59 | and we also try to see what happens if we had we have only two |

0:09:01 | partitions they but here we select that the may is the intentionally as one to |

0:09:07 | be you landline the data and one to be several data |

0:09:11 | so we have these three different partitions |

0:09:15 | so now doubt on if the middle there's a is following first be estimated projection |

0:09:20 | p |

0:09:22 | what we do is we take our development data and divided into distinct subsets |

0:09:27 | now we impact in a |

0:09:29 | in generally did this doesn't have to be a the actual development data that we |

0:09:33 | tend to train the system we can just try to collect other sources of data |

0:09:37 | which we believe it will be we represent some it i in many types of |

0:09:42 | mismatch and we can try to applied on |

0:09:45 | the collection of all the data sets that we managed to get under our hands |

0:09:49 | so once we have these subsets we can estimate the ple model a for each |

0:09:55 | one |

0:09:56 | and then we find the projection i in the and that's why was shown the |

0:10:00 | nexus lights |

0:10:01 | now once we found this a projection p we will we just applied on a |

0:10:05 | lower i-vectors as a preprocessing step we do it before we do everything else before |

0:10:11 | may center eating a whitening guys like summarisation and everything |

0:10:18 | and then we could just say retrain or more lp of the model |

0:10:21 | so the hope is that this projection just cleanups |

0:10:24 | in some sense and |

0:10:26 | some of the dataset mismatch for some amount of data |

0:10:31 | okay so first i can we do it for them you hyper parameter it's very |

0:10:36 | simple we just take the collection of centers that we gather from different the data |

0:10:41 | sets and we apply pca or on that and construct the projection from that top |

0:10:47 | eigenvectors |

0:10:49 | now for the mattress s b and b and w |

0:10:53 | a it's a bit different and i we show how it's being done for w |

0:10:56 | and the same it's i can be done for b |

0:10:59 | so basically given a set of covariance matrices w sub i we have one for |

0:11:04 | each dataset |

0:11:05 | we also way to is defined the mean covariance w bar |

0:11:11 | and now let us define a unit vector v which is that direction in the |

0:11:15 | i-vector space |

0:11:17 | now what we can do we can we can they're computed a variance all five |

0:11:23 | of a given the covariance matrix w sub i along this direction or project the |

0:11:27 | projection of the covariance one this that this direction |

0:11:30 | this is a the |

0:11:32 | transposed abuse of by v |

0:11:34 | now the goal that we define here is that we want to find that such |

0:11:37 | directions v |

0:11:39 | that's maximize the variance of this a quantity normalized by a by v transpose rebar |

0:11:46 | v so we will not we normalize that the variance along each direction by that |

0:11:52 | the average variance |

0:11:53 | and we want to find directions that maximize this because if you find a direction |

0:11:58 | that maximize this quantity means that different p l they models for different datasets it |

0:12:03 | behave very differently according to the a along this direction in the i-vector space and |

0:12:08 | this we want to remove or maybe model in the future |

0:12:12 | but in the moment you want to remove it |

0:12:15 | okay so that the algorithm to find this is quite a straightforward we first white |

0:12:20 | and the i-vector space |

0:12:23 | with respect to the it to w bar |

0:12:26 | and then we just the compute this does some of the squares of w sub |

0:12:31 | i and again find the top eigenvectors in a and |

0:12:36 | we constructed a projection p to remove these eigenvectors |

0:12:41 | the proof a that it actually the right thing to do is quite a but |

0:12:45 | is in it i want go over it because we have lunch |

0:12:49 | so way |

0:12:50 | but it's in the paper and it's very simple |

0:12:53 | its immediate |

0:12:54 | okay so not now where now and just a one a one thing that may |

0:13:00 | be quite important what happens again if we want to use other data sources to |

0:13:05 | model the mismatch and it but maybe for that for those that the sources we |

0:13:10 | don't have speaker labels |

0:13:12 | a soul to estimate them you have a matter it's quite easy we don't need |

0:13:17 | the it all a speaker labels but for w and b we do need so |

0:13:21 | what we can do it that cases in those cases we can just say replace |

0:13:26 | w and b with a t is the total comments magics and of course we |

0:13:29 | can estimated without speaker labels |

0:13:32 | and what we can is c i here is that it for typical datasets where |

0:13:37 | the there was a large number of speakers t is up can be approximate it |

0:13:42 | that by w possibly |

0:13:44 | okay so |

0:13:46 | so we saved at is the case it means that if we have high inter |

0:13:49 | dataset variability in inside for some directions forty |

0:13:54 | then it would be the same direction the same a fee would be actually optimum |

0:13:59 | i'll optimal a or probe signal to model also for either w or b and |

0:14:04 | vice versa so instead of finding that these this subspace for w and four b |

0:14:08 | we can just to find it forty and it will be practically almost as good |

0:14:15 | okay so now results |

0:14:17 | first a is a few results a it with using the only done you hyper |

0:14:22 | parameter the center the hyper parameter we can see here in the blue curve a |

0:14:28 | the results using this approach for p ldc is system a trained on switchboard |

0:14:34 | we see here that we started with a point two we could wait it we |

0:14:37 | have a slight degradation because we remove the gender i is that the first we |

0:14:43 | find out that the first a it'd a dimension that we remove is |

0:14:49 | is actually the gender or a the information |

0:14:52 | and but then we start to get a games |

0:14:55 | and we get any product of three point eight |

0:14:58 | we also get nice improvements one for this you have what we see here in |

0:15:02 | the in the red curve and a black curve we see what happens if we |

0:15:06 | apply the same aid vc system method for mixer based the a bill build so |

0:15:13 | we |

0:15:13 | for that the red and in their black curves are out when we train a |

0:15:18 | purely model and mixer and still we want to a applied if c to see |

0:15:22 | maybe we're getting some gains or at least we're not losing anything so what we |

0:15:25 | can see that in general we can say that |

0:15:28 | nothing really much happens here when we when you plot you aid posi on the |

0:15:33 | mix able to at least we're not losing |

0:15:39 | okay so now the same thing is being done but it i'd of for the |

0:15:43 | w hyper parameters for the b hyper parameter |

0:15:46 | we can see here in a blue |

0:15:49 | and light blue what happens when we train a system one switchboard and apply d |

0:15:54 | v c |

0:15:55 | either on one of these hyperparameters |

0:15:58 | we see that we get very large improvements even larger than we get for the |

0:16:01 | center hyper parameter |

0:16:04 | a and it's to h i a around i dunno one hundred dimensions and then |

0:16:09 | we start to get the degradation |

0:16:11 | so if we move too much |

0:16:13 | to too many dimensions then we start getting their it degradation |

0:16:16 | and this is |

0:16:18 | for the center hyperparameter we can not remove too much because if we have for |

0:16:21 | example only twelve this subsets then we can remove only to ellie only eleven dimensions |

0:16:27 | but the but for w can be we can actually move up to four hundred |

0:16:31 | so it's a bit a i it's a bit different |

0:16:34 | and okay now in it again in black and red we see what happens when |

0:16:39 | we apply the same as a doorknob mixer based appeal a system |

0:16:44 | with here again that we get very slightly i'm not show it significantly prevent and |

0:16:48 | then we start getting degradation |

0:16:50 | and what we see here quite interesting is that after a dimension of around let's |

0:16:55 | say one hundred and fifty |

0:16:57 | all the systems they actually behave the same so my in eight importation is that |

0:17:03 | we actually managed to remove most or all of that dataset mismatch but we also |

0:17:07 | make a remove some of the a good information from the system and therefore we |

0:17:12 | get some we get degradation but |

0:17:14 | the system actually behave very roughly the same |

0:17:18 | after we moved to the mission of one hundred fifty |

0:17:23 | okay so now have a what happens when we combine everything together we started to |

0:17:28 | form the two point four mix to build an eight point to force which will |

0:17:32 | build and for different partitions we get slightly different the results between equal rate of |

0:17:38 | three and three point three |

0:17:40 | and if we just use the simplistic partition |

0:17:44 | and ogi to only to a subset and we use only a only hyper parameters |

0:17:50 | we can we can a estimate without speaker labels mu and t we say we |

0:17:55 | get to three point five so we |

0:17:58 | that the conclusions the at actually works also without the speaker labels |

0:18:04 | so to conclude a we have shown that i posi id posi can effectively reduce |

0:18:09 | the influence of a dataset variability |

0:18:12 | it is for this that particular is set up for the domain robustness challenge |

0:18:17 | i and then we actually managed to capture to recover roughly ninety percent of the |

0:18:23 | of the error |

0:18:24 | in a compared to that to a totally is switchboard the build |

0:18:31 | and also this aid posi system works well even when trained on two subsets only |

0:18:37 | and without speaker labels |

0:18:39 | okay |

0:18:40 | thank you |

0:18:55 | i wonder if you if you happen to know what would happen if you simply |

0:18:58 | projected away three leading sorry one hundred eigenvectors from very |

0:19:06 | to the training set with corpus |

0:19:09 | without bothering to train individual w one b metric |

0:19:15 | the one on from the somebody so as not just take that of the commencement |

0:19:20 | extension that that's a method recall a |

0:19:25 | a week or the two wire nap in total a probability subsystem removal we have |

0:19:31 | this in darkness paper we have the we've done that you also get gains but |

0:19:35 | not this and not the time was very |

0:19:37 | not very so very |

0:19:39 | if i understand correctly you're saying the number three |

0:19:42 | benefit comes from trying to |

0:19:45 | am optimize the within class covariance matrices are grossly datasets |

0:19:52 | trying to make fools look the same |

0:19:55 | so but the w matrix is |

0:19:58 | okay so but basically say you to you try it apparently that |

0:20:03 | that a process this is in some sense |

0:20:08 | reasonable that some directions the i-vector space are more sensitive to mismatch |

0:20:13 | data mismatch and some not |

0:20:16 | and |

0:20:17 | it's harder observe it from the data a unless you do something like we did |

0:20:28 | did you look simple it just doing per state of set whitening |

0:20:32 | so another should do the whitening you did censoring but you did we did only |

0:20:36 | whitening is it didn't change in the performance |

0:20:39 | right |

0:20:40 | well that's contrary to other sites right |

0:20:44 | i think the whitening is generally used |

0:20:46 | so one question would be and as you do that if you did whitening per |

0:20:50 | dataset |

0:20:51 | what negative the same effect in a soft form versus the |

0:20:55 | projection a way |

0:20:57 | i tried i didn't right smack in many to a like a tried very quick |

0:21:00 | experiment in it's like wrote down the total yes i don't know if it's maybe |

0:21:04 | something i have to do more carefully but |

0:21:07 | digital |

0:21:15 | just a question if you do projection data to train k lda is your within |

0:21:19 | and between became single or something |

0:21:22 | that's like the fifth time error here this question and that's why ram |

0:21:26 | so basically but it's the same only when you apply lda for example before you |

0:21:31 | always gonna build a so it was used |

0:21:36 | so you know way so either you can just and not just to the you |

0:21:41 | can actually moved without low dimension remove these dimension but everything on the whole the |

0:21:46 | low dimension |

0:21:47 | or you can do some tricks to fix it |

0:21:53 | i like add some |

0:21:55 | some quantities to the |

0:21:58 | to the covariance matrices |

0:22:03 | i think so that all can i just in you paper that you |

0:22:06 | you contrasted against source normalization and that was the parallel presented in the icassp paper |

0:22:11 | unfortunately can access it here was trying to look it up as we went along |

0:22:15 | and i b and also taken on source normalization extend that that'd be further |

0:22:20 | of the reason i bring this up is |

0:22:23 | in this context the data sets the speaker disjoint across different datasets what about the |

0:22:28 | context way perhaps i have to have a system trained on telephone speech for a |

0:22:33 | certain number |

0:22:34 | speakers and then you suddenly acquired data from those |

0:22:37 | same speakers in a different channel |

0:22:40 | what's going to happen in terms of c i mean in terms of the a |

0:22:45 | testing side of things you require more data by speakers |

0:22:50 | from a microphone channel perhaps |

0:22:52 | and then you also require microphone data from different set of speakers for training adapting |

0:22:57 | a system |

0:22:58 | it appears is that this point the within class variations are estimated independently on each |

0:23:04 | dataset |

0:23:06 | so |

0:23:07 | does that mean that the difference between those datasets is going to be suppressed |

0:23:14 | or it's actually maintained in them |

0:23:16 | an under this framework |

0:23:19 | i think it's not very sensitive and |

0:23:23 | because it looks on very broad hyper parameters so it doesn't really matter if it's |

0:23:27 | the same speaker or not |

0:23:29 | okay maybe we can discuss of on a bit |

0:23:37 | the final question |

0:23:40 | so |

0:23:41 | and in the fast |

0:23:42 | we started dealing with the channel problem |

0:23:45 | by projecting away stuff at school |

0:23:49 | and then the development due to software approaches we start modeling the stuff instead of |

0:23:56 | explicitly projecting that the way j if i really i so |

0:24:01 | can you imagine a probabilistic model what which is |

0:24:06 | includes a variability for dataset actually thanks for the question i was thinking maybe tried |

0:24:12 | another slide on that i actually tried to |

0:24:15 | two where extend the p lda model with another like plus the to add another |

0:24:21 | component which is the dataset mismatch which |

0:24:24 | behaves the be different now do not alright and a components and i meant some |

0:24:30 | experiments i got something any improvement |

0:24:33 | compared to the baseline but not as good as i got to using this nap |

0:24:37 | approach |

0:24:39 | and i have some any ideas why that is that what this is the case |

0:24:43 | but i will not be supplied someone just the that's it in different way and |

0:24:48 | gets better result |

0:24:53 | okay |