0:00:15 | all right nice the so i'm going to propose something that we have been the |
---|---|

0:00:19 | what knowing actually during the last |

0:00:24 | see last p washable hopkinson trying to explore |

0:00:28 | you've these any information useful in the gmm weights because of the i-vectors are or |

0:00:32 | probably no is only at the |

0:00:34 | try to adapt the means |

0:00:36 | so |

0:00:38 | so as you probably all down now that the i-vectors is related to adapting the |

0:00:43 | means for handies be have been |

0:00:45 | very well applied for speaker language dialect and different all applications |

0:00:50 | so and the story |

0:00:53 | behind only adapting them each is |

0:00:55 | going back to the |

0:00:57 | gmm amount but the gmm map adaptation with the ubm universal background model as basis |

0:01:02 | so undefined it only means can be adapted to what whenever try to revise the |

0:01:07 | fact that maybe what the i-vectors it only older useful information |

0:01:12 | in the we trained the weights or even in the variance |

0:01:15 | probably patrick well a already tried with the variance for jfa |

0:01:20 | but |

0:01:21 | so we hear when this work we try to |

0:01:24 | to do something with the weight so huh the having a lot of a peak |

0:01:28 | i technique set of inoperable proposed for the weights |

0:01:32 | and |

0:01:34 | and we have tried will build a new one called non-negative factor analysis what was |

0:01:38 | actually has and who was the students and belgian one he was busy male the |

0:01:42 | mighty and we first tried foreign language id and which actually we have some success |

0:01:47 | with it |

0:01:48 | and the reason was that you know for language at some time of me haven't |

0:01:52 | ubm and you use for that you're portions are kind of phonemes supposedly so if |

0:01:57 | some from language this phoneme that not appearing so turning dams how can also zero |

0:02:02 | or even the weights of this goal cushions can be useful information and that's what |

0:02:07 | we |

0:02:08 | we found out and that's what actually motivated construct four speakers there's any information that |

0:02:13 | can be also used for speaker from the gmm weights |

0:02:16 | and this is ultimately the topics of this work |

0:02:18 | and we also compared to switch to this non-negative factor analysis and fa two |

0:02:24 | already existing techniques that it was proposed in but we should subspace a multinomial model |

0:02:29 | and over the speed this that the this presentation is kind of comprising between the |

0:02:34 | two for in the case of gmm weight adaptation |

0:02:37 | so |

0:02:38 | it's for forty adapting the gmm weights i have been a lot of a peak |

0:02:41 | techniques already applied maximum a posteriori maximum likelihood any many recreational are |

0:02:48 | eigenvoice wishes they |

0:02:50 | the starting point of all the new technology not jfa and i-vector |

0:02:54 | and they were also a lot of |

0:02:56 | weight adaptation techniques they're like for example maximum likelihood nonnegative matrix factorization and multinomial subspace |

0:03:03 | model |

0:03:05 | and then you wanted we propose non-negative factor analyzers |

0:03:08 | so |

0:03:10 | sell the idea behind this example the i-vector concept i don't want to bore you |

0:03:13 | with this |

0:03:14 | is you know you say that for a given utterance there is an ubm which |

0:03:17 | is not a prior of all the sounds how the sauce look like and the |

0:03:21 | i-vectors try to model all the shift for this ubm to a given recording you |

0:03:26 | can be |

0:03:26 | can be require a model by a low dimensional |

0:03:30 | us matrix representation |

0:03:33 | and the coordinates of this |

0:03:36 | recording in this space we call that i-vectors |

0:03:38 | so we tried to use the same concepts actually for doing though the means and |

0:03:42 | all that this fourteen sorry for doing that we the weights |

0:03:45 | and the only difference that we were facing is that the way it should be |

0:03:49 | all positive and they sum to one so |

0:03:52 | i can explain that later so in order to do the weights now so the |

0:03:56 | first thing so you when you when you when you have we have one ubm |

0:03:59 | for example universal background model |

0:04:01 | and you have a sequence of features you can compute some counts which is the |

0:04:06 | posterior distribution of occupational fish a portion given the frame cell which given here in |

0:04:12 | equation |

0:04:13 | so the objective function in the weights is kind of be of this kind of |

0:04:16 | le |

0:04:18 | it's work |

0:04:20 | this callback liberty versions which is kind of trying to maximize |

0:04:23 | the cover different versions between the because the that's a redeeming about the cover different |

0:04:27 | prices between the counts |

0:04:29 | and of the weights that you want to model |

0:04:33 | and so if you want if you get the data discounts and you don't normalise |

0:04:37 | in with the land of |

0:04:38 | you're for eight euros sentence you get the maximum likelihood estimation |

0:04:42 | for the weight which is easy to do so for example the first two well |

0:04:46 | technique that we propose a bit unfortunately we can compare would it for this for |

0:04:51 | this paper is more negative matrix factorization so we suppose you have a weights and |

0:04:56 | you say that okay this which can be |

0:04:58 | split into negative nonnegative sorry |

0:05:01 | mattresses |

0:05:02 | and the first one is gonna be the basis of your space and the second |

0:05:06 | one he's the cover the coordinate of this |

0:05:08 | and this the composition can be |

0:05:11 | a |

0:05:14 | sorry |

0:05:15 | can be a the of to find to optimize |

0:05:18 | the auxiliary function |

0:05:20 | okay so this is the fact that and forty two in have time to do |

0:05:23 | comparison with it |

0:05:24 | but we did what we did we compare with this subspace model because always would |

0:05:29 | look actually that i |

0:05:30 | so we try to compare with a two |

0:05:33 | so they behind and what this subsystems to model this is you have that the |

0:05:38 | concise and of accounts here |

0:05:40 | and you try to find a multinomial distribution |

0:05:43 | that fit |

0:05:44 | this |

0:05:45 | this distribution |

0:05:46 | and this can be defined by saluted much there are i-vector space this is the |

0:05:50 | ubm weights |

0:05:52 | and this is normalized |

0:05:53 | but to get the that the weights sum to one |

0:05:57 | and it had they have splits over papers in that how to do the optimization |

0:06:01 | have some haitian |

0:06:04 | solution for that |

0:06:06 | so for example suppose you have for example for the s m suppose you have |

0:06:10 | to go options |

0:06:12 | and the in band for each point here is the maximal likely to put that |

0:06:16 | is a maximum likelihood estimation of the weights for a given recording okay |

0:06:20 | and for this example which we see with the actually this point are generated for |

0:06:25 | and suppress model don't the subspace multimodal distribution |

0:06:30 | so we generate from this model |

0:06:32 | because the time of belief that for low dim for high dimensional space |

0:06:36 | you that the detector should be distributed like to not over like because if you |

0:06:40 | take a lot of data and it right only to go options the data would |

0:06:44 | be everywhere |

0:06:45 | but if and high dimensional space |

0:06:48 | i would try to simulate that and is it and to find that |

0:06:51 | you know tried to simulate high dimensional gmm intrigue oceans |

0:06:56 | quite case and this is kind of that's what we did so we they you |

0:07:00 | to look at low and other people but did |

0:07:03 | so we generate a data from this model |

0:07:07 | and we shall we what's difference between this |

0:07:10 | this model and the non-negative factor analyses |

0:07:12 | so we non-negative factor analysis actually what would say let's say which is the same |

0:07:16 | as the i-vectors we suppose that we haven't ubm |

0:07:19 | and issue recording which the weight for each recording can be explained by a shift |

0:07:23 | just t v in the direction of the data |

0:07:26 | and this |

0:07:28 | so the same as the i-vector sell this can be a low rank and are |

0:07:31 | is a new i-vectors in this new space |

0:07:34 | so the only problem with this we had were facing is that |

0:07:38 | the weights for each recording should be always positive initial sum to one |

0:07:43 | so here we have we develop some kind of an em like |

0:07:47 | so in of so we have an like an |

0:07:51 | we first a big air |

0:07:52 | we get some statistics for this here |

0:07:55 | we get some a to the gods in the a sound |

0:07:59 | to estimate the air |

0:08:00 | and then |

0:08:00 | when we obtain the l |

0:08:02 | we do and projected crow project projected the gradient ascent |

0:08:07 | what the projection metric that we used try to |

0:08:10 | a given as the constraint that they should always sum to one and they should |

0:08:13 | always be positive |

0:08:15 | and that's what we actually did if you want to have more explanation |

0:08:19 | i don't have time for that and here i can find that so |

0:08:22 | remember this is the several account this is the auxiliary function for the lack of |

0:08:27 | for the for the gmm weights case and with this is our weight and we |

0:08:32 | would like to estimate is to |

0:08:33 | parameters |

0:08:34 | which subjected that they should sum to one so what we did we just |

0:08:38 | assume that the g is a one vector of one so they should just one |

0:08:42 | should sum to one |

0:08:43 | and they all we should be positive |

0:08:46 | okay |

0:08:47 | so this is a to constrain that a low us to keep that the weights |

0:08:52 | should be something to one and they should be opposite so indicates for example if |

0:08:56 | you compare between what the gmm what the non-negative factor as a the when compared |

0:09:00 | to what subspace model to model |

0:09:03 | and what you know muir model is doing so |

0:09:05 | so for this case for example |

0:09:08 | differently the s m is different refitting well the data |

0:09:12 | because it was generated from it |

0:09:14 | but the i-vector the anything would choose an approximation of the data so it has |

0:09:18 | the benefit it has a disadvantage to because the been if it is in the |

0:09:23 | case of it and s m has a behavior to overfit of the data because |

0:09:28 | he we really model well the distribution of the training data but twenty go to |

0:09:32 | the lid task |

0:09:33 | sometime and in generalize well |

0:09:35 | so as to what but did they have the user is a regularization |

0:09:39 | to try to control this over fitting |

0:09:41 | so they have an orgasm regularization term that you to one when you're dead |

0:09:45 | in order to do that so for our case we are so we are not |

0:09:50 | suffering too much from this the good from this the we don't fit to may |

0:09:54 | very well the training data |

0:09:56 | but we approximate one generalize sometime better |

0:09:59 | sometime then that's mm but is that then of that application to be honest we |

0:10:02 | compare that for several application is sometimes the one is a bit another sometime the |

0:10:06 | opposite |

0:10:07 | but anyway so the difference is this one is like some this as a man |

0:10:11 | can fit really well the data |

0:10:13 | the training data but can have problem of overfitting we need to control with regularization |

0:10:18 | or and the n f a approximate optional the data |

0:10:22 | and you will sometimes generalize better |

0:10:26 | so this is the approximate the experimented with this so we have actually train and |

0:10:31 | i-vectors first and all that the data that we have |

0:10:35 | and which would test it actually in telephone condition of nist two thousand ten |

0:10:40 | and we have ubm of |

0:10:43 | two thousand forty eight this is not more technical things so we haven't i-vectors of |

0:10:46 | extend read we use the lda let normalisation p lda scheme that but use |

0:10:53 | and then we ask which so we try to use an i-vector for the means |

0:10:56 | and an i-vector from |

0:10:58 | the weights from s m and four and fa |

0:11:01 | and we tried to do fusion how we can combine them for example just a |

0:11:04 | simple fusion so we did score fusion |

0:11:07 | didn't help |

0:11:08 | allow so we just so key for get it would be some i-vectors fusion |

0:11:13 | it seems to be |

0:11:15 | a little bit of the better but not too much for speaker that's what little |

0:11:18 | disappointed |

0:11:19 | but for a language id actually with helping a lot |

0:11:24 | so i for example i two can affect for example i try to see how |

0:11:27 | the dimensionality of and then the day this new weight adaptation a compared to for |

0:11:33 | example the i-vectors |

0:11:34 | so i took and i don't wanna get factor analysis to train five hundred |

0:11:38 | thousand |

0:11:39 | and one by one power |

0:11:41 | one thousand five hundred so we remember that the starting ubm was two thousand forty |

0:11:45 | eight |

0:11:46 | so i and the this one's the lda first do much reduction before you length |

0:11:52 | normalization |

0:11:53 | and you see that |

0:11:54 | it's not really do is to the difference of not really big by the one |

0:11:58 | by varying the data dimension d for lda |

0:12:00 | and even |

0:12:02 | if you compare between five hundred thousand data as the difference is not really big |

0:12:07 | so we were a little bit surprised especially for and fa which we seen the |

0:12:11 | same behaviour for s m as well |

0:12:13 | and |

0:12:15 | but sometimes they just send them is you need to be more low dimensional compared |

0:12:19 | to all |

0:12:20 | wanna get the factorized as non-negative factor try to be more high dimension compared to |

0:12:25 | the other one |

0:12:27 | so here for example you i we compare the best result that we obtained from |

0:12:31 | a negative factor analysis |

0:12:32 | compared to one for the |

0:12:34 | subsist multi model |

0:12:36 | and for the core condition of male |

0:12:39 | and female and eight conversations so we can see that actually that's not really too |

0:12:44 | much difference |

0:12:46 | some time and if a listening that sometime is less but better than |

0:12:49 | then an s m |

0:12:51 | and |

0:12:52 | and the but you see that for the conversation you know you can get very |

0:12:57 | nice improvement over a nice result even without using the gmm weights is the mean |

0:13:01 | they're just t weights |

0:13:04 | no if you compared with the i-vectors |

0:13:08 | so the i-vectors is i don't so i the says a lot the maximum likelihood |

0:13:11 | of the weights so we should talk about the |

0:13:15 | the maximum likelihood of the weights with of the log and feed that to lda |

0:13:18 | maybe it's not the best way to do |

0:13:22 | so maybe you can do something clever so it seems that to with a local |

0:13:25 | the women selected was worse |

0:13:27 | compared to s m and |

0:13:29 | and the weight for all the condition eight conversation male female and core condition as |

0:13:33 | well |

0:13:34 | so now we remove the maximum likelihood from the loop |

0:13:37 | and we put the i-vectors here so we can see that |

0:13:41 | usually the i-vectors is twice better |

0:13:45 | this year we can do you get the can i-vectors other than the weight vectors |

0:13:48 | any divide by traffic to get the i-vectors so |

0:13:52 | so the |

0:13:53 | so the i-vectors is it differently much better than the weights |

0:13:57 | and |

0:13:59 | let's not too much but if you go to do eight conversation |

0:14:03 | it's actually pretty cool "'cause" the correct is a very low |

0:14:06 | so even for the long for when you have a lack of a lot a |

0:14:09 | more recording from the speaker |

0:14:12 | the weights can also give your |

0:14:13 | almost useful information that need |

0:14:16 | the i-vector can give you |

0:14:18 | so that was of the source surprising for that of reason |

0:14:21 | so here what we took sector this will have and you the minimal dcf only |

0:14:25 | the c of liquids doesn't a this doesn't and is the great you have the |

0:14:29 | baseline with you the i-vector |

0:14:31 | for female and male |

0:14:33 | so one would you the i-vectors with the weights would use the |

0:14:37 | i-vector fusion here |

0:14:38 | so this is an f when you're ready to and if we if we added |

0:14:41 | and fa will win little bit here |

0:14:44 | we use an acre eight looks but here but not rate too much |

0:14:48 | but at one for example for female when we do this when you fuse we |

0:14:52 | s m we get muscular but again for new dcf |

0:14:55 | you know operating point and even the correct |

0:14:59 | so for f e m s m was the best |

0:15:02 | diffuse with |

0:15:03 | for male you know we can see that the and if a was much better |

0:15:07 | for all this but not really in medium new minimal dcf |

0:15:11 | so it was not really |

0:15:13 | exciting to tour of fusion to be honest it was loaded with improvement of really |

0:15:17 | locked compared what was seen for language id |

0:15:22 | so here since the i-vectors is an awful related to the dimensionality of the supervectors |

0:15:27 | so we cannot go right to increase the ubm sizes for the gmm weights |

0:15:33 | the dimensionality is kind of related to how many courses you have so |

0:15:37 | we have |

0:15:39 | with tried to say okay well let's try to increase and decrease the ubm sides |

0:15:43 | and see what happened with the with the |

0:15:46 | we did not example here we tried only get a factor as for the only |

0:15:49 | one |

0:15:50 | so if we can see that for example if we increase the |

0:15:53 | the portions in the ubm size you get a very nice improvement in the for |

0:15:59 | both men and female |

0:16:01 | especially if only maybe |

0:16:03 | and you mindcf so |

0:16:05 | so here vince the i-vector that the weight is not ready to the size of |

0:16:09 | the supervector as you can increase |

0:16:11 | the |

0:16:12 | the amount of the |

0:16:14 | or of the |

0:16:17 | of the portions in the so in the ubm size so you can you can |

0:16:20 | even think about using |

0:16:22 | a speech recognizer and try it some if you want |

0:16:26 | so i'll so what we did here is actually took the baseline as well |

0:16:30 | which i |

0:16:32 | sorry |

0:16:40 | we took the i storage notably i-vectors |

0:16:44 | that is all i would try to fuse it with the |

0:16:47 | the i-vectors from different |

0:16:49 | ubm size |

0:16:51 | and he can see that for example of is a kind of are not really |

0:16:54 | kind of conclusions |

0:16:55 | over true even you increase even here for example we get |

0:17:00 | well sorry yes |

0:17:00 | if and you get better results with two thousand forty thousand questions |

0:17:05 | diffusions for example for female didn't help too much |

0:17:09 | to be honest was actually words |

0:17:11 | and for female form a was a little bit so as well |

0:17:15 | political court and would do |

0:17:17 | it doesn't mean that you get better results in the weights will happen would if |

0:17:19 | use only i-vectors as they could the question |

0:17:23 | so as a conclusion here we try to a |

0:17:28 | use the weights and try to think if it is worth a little better way |

0:17:31 | of and using the weights and updating as well |

0:17:34 | not only the means which is the |

0:17:35 | what the i-vector is doing |

0:17:37 | and so we will we seen some slight improvement when you want to combine them |

0:17:41 | maybe we need to find a better way to combine them some for example |

0:17:45 | similar to what subspace gmms are doing for is for speech recognition |

0:17:50 | i don't know what |

0:17:52 | then and look for working on that and on of they make some progress |

0:17:55 | which i tried interactively for example you estimate the weights you all data gmm weights |

0:18:01 | of the ubm |

0:18:02 | and then extract wilma statistic second you i-vectors it didn't have for speaker to be |

0:18:07 | honest i tried it and in a given the same result no improvement not think |

0:18:11 | so |

0:18:13 | but i met in tried for language id but only for speaker |

0:18:18 | thank you |

0:18:24 | so |

0:18:34 | you have a lot of time that i'm to understand my question that i |

0:18:38 | you know we walk a lot on the way to unit in avenue always negotiate |

0:18:42 | for mainly |

0:18:44 | and we are looking also on |

0:18:47 | the weights with l you know we approach and |

0:18:51 | michelle has also some results so |

0:18:54 | it seems to me since beginning was because she felt that |

0:18:59 | the weights are very interesting very nice source of information |

0:19:05 | in fact it's of in every information |

0:19:08 | why if you come back to ubm-gmm |

0:19:11 | and come back to dog results |

0:19:14 | when he proposed via a top cushion scoring compression you are using on the |

0:19:18 | one motion put one to this one and zero to all view of those the |

0:19:23 | lowest them of that summons was quite small |

0:19:27 | after that when you go to a nickel ship a results |

0:19:32 | which wine is the em too many out and |

0:19:35 | do a lot of things very close to what you |

0:19:38 | presented |

0:19:39 | at the end the best solution was to use very rank based normalisation |

0:19:43 | in the right based is very close to a |

0:19:47 | put the one to some portion zero to view of those on the weight and |

0:19:51 | count |

0:19:53 | this after |

0:19:54 | and now if you look at p m share the results he's bit of a |

0:19:58 | need to explain a text in but and of the time using just the zero |

0:20:03 | and one information a of the weights it seems that we are able to find |

0:20:09 | so according to me the way to a position |

0:20:13 | represents information birdies |

0:20:15 | this information yes or no and not a continuous information |

0:20:20 | like |

0:20:21 | you are trying to do any we |

0:20:23 | so i so there is a good point here because |

0:20:27 | one i one eight one to one has sent could start working with the mean |

0:20:31 | in a negative factor is my first question on my first think was that was |

0:20:35 | aware what nicole lasted in |

0:20:37 | i don't i'm |

0:20:38 | i want to sparsity into in the weights |

0:20:42 | this is this is not able to do what are what we're doing now sell |

0:20:47 | because i agree |

0:20:48 | with top one top five what it what ones for the top five |

0:20:51 | so it's like i say one sparsity in the in the in the weights like |

0:20:55 | which is right and all of them the response like some |

0:20:57 | zero one can only the top five for example or something |

0:21:00 | but for this system |

0:21:02 | well for this model that we have we're not doing that |

0:21:05 | that was my first actually common one because of it was in the committee and |

0:21:09 | made was my first comment was like how we can be good sparse |

0:21:13 | because based on what you what you're saying exactly |

0:21:25 | at |

0:21:28 | extract the i-vectors adaptively |

0:21:31 | okay that the ubm |

0:21:34 | before you extract |

0:21:37 | for each frame |

0:21:39 | there are very few girls |

0:21:44 | that's what happens i don't know it's that's |

0:21:47 | them going to knock down to solution to your problem but you will get sparsity |

0:21:51 | about way how okay |

0:21:55 | thanks one a |

0:22:03 | so this kind of follows up on patrick's questionnaire so you're doing sequential estimation for |

0:22:09 | the l six l some c are |

0:22:12 | on how many iterations at you go through two |

0:22:15 | to get that |

0:22:17 | a wrong |

0:22:19 | ten of like an em style greater and for each one is a grant in |

0:22:23 | this s and i go looks five for our and three for l |

0:22:29 | so |

0:22:30 | i'm asking this because to me it's interesting to see the rate of convergence that |

0:22:34 | you might actually hit on this and i know it's extra work right |

0:22:38 | you did in your evaluations i believe you're doing your evaluations when you believe you |

0:22:42 | converge did you run any previous system |

0:22:46 | so let's say that before hitting five you try to that i try and just |

0:22:52 | to see where you're actually see it may be there are certain |

0:22:56 | seems to get activator i-vector to get active |

0:22:59 | are you might actually see |

0:23:03 | there might be some insight into c |

0:23:06 | i try this but not in the context on the on this context of the |

0:23:11 | without of this enforce something what that the n and it |

0:23:14 | the different and it's a little it's sensitive to one like to see it up |

0:23:18 | more sometime when you get it that's true like fifteen iteration |

0:23:22 | kind of like seen that results going to grabbing after some from point the degradation |

0:23:27 | start to be seen |

0:23:28 | and usually it's like between |

0:23:31 | eight five to eight you already saturated |

0:23:40 | yes we need to control a little bit that |

0:23:43 | yes if we go if elected goal you is sort yourself actually s m sometime |

0:23:47 | especially for the sparsity s m is much better because it will it would hit |

0:23:51 | it just so what you will know like it would fit the data but he |

0:23:55 | with just |

0:23:56 | and have a would not do that because it's like an approximation |

0:23:59 | so that's my issue would and fa |

0:24:02 | s m would definitely get miss some sparsity if you know if you know how |

0:24:05 | to control it because you know you maybe you might overfit |

0:24:09 | the side |

0:24:10 | more probably marceau it can no better than meets is |

0:24:13 | for a morsel |

0:24:15 | you probably know what are then mean that |

0:24:18 | "'cause" he was doing this isn't that right |

0:24:30 | where did i |

0:24:35 | actually when we did this work are we tried different optimization algorithms |

0:24:42 | for these approximate hacienda it converts and iterated |

0:24:46 | quite good and also like the questions before we also saw like even few iterations |

0:24:52 | already |

0:24:54 | you got already quite good results |

0:24:56 | and if you like when only iterating you've got some degradation that so it looked |

0:25:02 | like it starts over fitting the model |

0:25:04 | so i guess you use all similar |

0:25:16 | two |