0:00:14 | hello my name is to be addressed model |
---|---|

0:00:16 | and in this video i describe our work in that narrow i-vectors |

0:00:22 | this work was go out for me by can likely and the only thing on |

0:00:26 | it |

0:00:27 | we don't we are from the university of east and be learned and going i |

0:00:31 | was the time of writing any |

0:00:37 | our study proposes a new way of combining gaussian mixture model based gender the i-vector |

0:00:43 | models discriminatively train the enhanced exact speaker and endings for speaker verification task |

0:00:51 | our aim is to improve upon existing i-vector systems |

0:00:56 | and we also hope to gain some insight is what causes the performance differences i |

0:01:02 | mean |

0:01:03 | in a speaker and the things and discriminant in the in speaker and that is |

0:01:09 | our study also is that is stronger convex and the gaussian mixture models and some |

0:01:15 | of the existing |

0:01:17 | ian and holding layers |

0:01:21 | as a background for how we look for different i are considered |

0:01:26 | the last three this can start can start suggested here |

0:01:31 | are combining ideas from all i-vectors and the intense |

0:01:36 | we a special events and the jurors universal background models and i-vector extractors all these |

0:01:44 | constants |

0:01:46 | let's to be the standard i-vector |

0:01:50 | so |

0:01:51 | key components here the two gender models are there |

0:01:56 | gaussian mixture model based universal background model and |

0:02:00 | i i-vector extractor |

0:02:04 | so even and |

0:02:05 | is used together it |

0:02:07 | initial easy readers to compute the supposition statistics for that |

0:02:12 | i-vector extractor |

0:02:14 | extract i-vectors |

0:02:18 | so we know the features are rule based and the rest of the components are |

0:02:23 | gender strange |

0:02:28 | then the nn i-vectors in this construct the universal background model is replaced by |

0:02:36 | these the and dates acoustic features as an input and reduce the senone posteriors as |

0:02:42 | an hour |

0:02:44 | these posteriors are used together and of c and it's easy features can be sufficient |

0:02:50 | statistics for the i-vector extractor |

0:02:55 | so this clustering differs from the standard i-vector and the universal background models discriminatively trained |

0:03:03 | one of the audience |

0:03:06 | third system descending an i-vector system |

0:03:09 | the system combines three modules one i-vectors is then the one neural network |

0:03:16 | these manuals are features statistics |

0:03:19 | when you |

0:03:20 | statistics i-vectors when you and that are module |

0:03:23 | is responsible |

0:03:25 | score and errors of i-vectors |

0:03:29 | training of these |

0:03:31 | kind of network goes as follows |

0:03:34 | so that are then used to train these individual modules |

0:03:38 | shortly these can benefit from the |

0:03:43 | i guess the a wrong |

0:03:45 | corresponding generative models |

0:03:49 | after these modules have been trained separately and they can be combined or and then |

0:03:55 | train |

0:03:59 | so this course there |

0:04:02 | you do less is generally models in the initialization stays |

0:04:07 | well i and i will use discriminative training the whole network |

0:04:14 | therefore and the last background construct this guy |

0:04:19 | is using and the nn with a mixture factor analysis fourteen year |

0:04:23 | in this for the authors used to estimate based you know texture you start speaker |

0:04:29 | and things |

0:04:31 | what is special about this |

0:04:33 | is that they use their own in a fourteen layer |

0:04:37 | these |

0:04:38 | for the error is basically an i-vector extractor implemented inside the in |

0:04:45 | is the m f a is based on after calling may or no must be |

0:04:49 | learned dictionary hangover |

0:04:52 | i think used right learned dictionary encoder right the wrong in this alliance |

0:05:00 | so we get all the components of these last construct our discriminately discriminatively trained with |

0:05:07 | speaker targets |

0:05:12 | okay |

0:05:12 | next we belong to the proposed neural i-vectors |

0:05:18 | before explaining the cluster itself |

0:05:22 | we will need to do |

0:05:24 | prerequisites for our model |

0:05:27 | and these are the know that and |

0:05:30 | the only layers and describe these two only layers by some and how they relate |

0:05:36 | to the standard c n |

0:05:38 | so then next initialize will be quite match for most |

0:05:45 | so first then it that |

0:05:48 | and we will study |

0:05:50 | the posterior combination formula or a standard gmm |

0:05:55 | we can see how we get the |

0:05:58 | note that formalize and all of their your question from here |

0:06:04 | so okay here we have |

0:06:07 | that was the number of gaussian components and its constant component power |

0:06:12 | covariance matrix mean vector and the associated right |

0:06:18 | okay in that assumes covariance matrices |

0:06:23 | or gaussian components |

0:06:27 | we will okay this four-mora in the is for |

0:06:30 | by expanding the normal distributions |

0:06:38 | then |

0:06:39 | but no |

0:06:41 | this inverse covariance times minima there will |

0:06:45 | my god |

0:06:48 | and |

0:06:48 | then the slow terms minus the other there we see |

0:06:55 | we get |

0:06:56 | this |

0:06:59 | and this happens to be exactly |

0:07:02 | formally used in note that |

0:07:04 | paper from two thousand and sixty |

0:07:09 | so |

0:07:10 | we have basically some on the last two means their covariance matrices |

0:07:15 | and the gmms we get that |

0:07:19 | same formalize and as in that |

0:07:23 | okay in it but i |

0:07:25 | illinois or learnable parameters there |

0:07:29 | note that there are |

0:07:31 | this form of grass |

0:07:33 | she's and news |

0:07:36 | and estimating these forming class and z is has to do not what i mean |

0:07:41 | there |

0:07:43 | we see from the posterior combination formula that doesn't depend and |

0:07:49 | from the mean vectors it is quite interesting signal can and the |

0:07:54 | standard gmms |

0:07:57 | but anyway |

0:07:59 | there you can compute the posteriors |

0:08:03 | or is there |

0:08:06 | input feature vectors |

0:08:09 | we can compare the component wise |

0:08:13 | what's |

0:08:14 | or in it but layer |

0:08:18 | formalize zone |

0:08:19 | on the right side the screen |

0:08:22 | and then |

0:08:23 | well right there |

0:08:26 | we have the first order centre so some statistics |

0:08:32 | the denominator just length normalized is then |

0:08:37 | so for each gaussian component we get one |

0:08:41 | vector |

0:08:42 | and finally |

0:08:44 | is no but they are male |

0:08:45 | concatenate is |

0:08:48 | component lifestyle closed form a supervector |

0:08:53 | so this is very similar to a |

0:08:56 | standard |

0:08:57 | c gmm supervectors |

0:08:59 | and how they are form |

0:09:05 | okay next |

0:09:07 | do the same for the learned dictionary encoder |

0:09:10 | only layer |

0:09:12 | so we start be there |

0:09:13 | gmm posterior combination formula |

0:09:18 | okay this time we you know we then |

0:09:21 | is |

0:09:23 | once colour term |

0:09:27 | do we get this |

0:09:28 | by expanding the normal distributions |

0:09:34 | okay |

0:09:35 | no if we assume |

0:09:38 | isotropic |

0:09:39 | or spherical covariance matrices |

0:09:44 | this formula |

0:09:45 | we simply by |

0:09:47 | this four |

0:09:51 | and |

0:09:52 | this is the |

0:09:53 | for music in that kind of prediction reading over all in layer |

0:09:59 | although in the |

0:10:00 | original publication or is and t is |

0:10:05 | be the term was not included but it was added later on by other authors |

0:10:14 | so the key point here must the |

0:10:16 | by assuming isotropic covariance matrices the l d |

0:10:22 | formulation from the standard gmm performance |

0:10:28 | then learnable parameters of this energy will are |

0:10:33 | i is |

0:10:34 | it's on the scaling factors for covariances then the mean vectors and is |

0:10:40 | i the terms |

0:10:44 | similarly as in that we can then going to the component was a rules or |

0:10:49 | is there |

0:10:51 | so again then we write directly have the first order some for some statistics |

0:10:58 | well okay in the standard and denominator is will be different so it is a |

0:11:02 | sample |

0:11:04 | posteriors for its |

0:11:07 | each component |

0:11:09 | so this is model i and it's the traditional maximum likelihood ratio on the east |

0:11:16 | is on one and vice outputs |

0:11:21 | and then the |

0:11:22 | on the nist and form a supervector |

0:11:29 | okay |

0:11:31 | so now we have the necessary can start you |

0:11:35 | constructs explained extend the proposed neural i-vectors |

0:11:41 | so we start with |

0:11:43 | and standard |

0:11:46 | extractor architecture |

0:11:50 | and we replace that |

0:11:52 | and are willing layer |

0:11:54 | we either that or l d coordinator |

0:12:00 | and as its or from the previous bias we can use |

0:12:04 | each polling layers the extra stuff isn't statistics |

0:12:09 | so we do that |

0:12:11 | and by using this present study is this weekend frame |

0:12:16 | regular i-vector extractor and you can also then extract i-vectors from these statistics |

0:12:25 | so that's the idea |

0:12:32 | so now we can completely stable |

0:12:37 | so how our how our proposed functions are dressed differs from there |

0:12:45 | able in their roles is that the |

0:12:48 | i-vector extractor is generally |

0:12:52 | otherwise the cluster is the same |

0:12:55 | if we compare our proposed neural i-vectors we then the in and i-vectors |

0:13:01 | we can see that the |

0:13:02 | i-vector what is the same but that |

0:13:05 | users and you the in verse |

0:13:08 | no one ever then restrained speaker utterance |

0:13:13 | and also the features are obtained from a |

0:13:17 | last layer before the one in there |

0:13:22 | next |

0:13:23 | that's model and the experiments and results |

0:13:27 | so we can say that speaker verification experiments on the speakers and one evaluation |

0:13:33 | first we compare our role as the results the other i-vector systems |

0:13:39 | the single fine from the literature and these are some of the best ones |

0:13:46 | on the line we have started in this easy i may or |

0:13:51 | and in the second one may have i-vector system that is isn't perceptual linear prediction |

0:13:56 | features together with the actual in the features |

0:14:00 | and this w the a is |

0:14:03 | dereverberation system |

0:14:06 | so we can see from this results the then all i-vectors performs the best |

0:14:13 | okay |

0:14:14 | so partial but let's next |

0:14:18 | compare our results |

0:14:21 | the nn speaker and endings |

0:14:23 | so we can use the same the nn sticks there either sufficient statistics for a |

0:14:27 | narrow i-vectors |

0:14:30 | or the can extend the speaker and endings directly from the audience |

0:14:38 | so |

0:14:39 | here are all these are our results so |

0:14:45 | in the first line we have a |

0:14:48 | the and we notable dictionary encoder whoever |

0:14:53 | be into one zero two equal error rate |

0:14:57 | but then the corresponding no i-vectors |

0:15:00 | okay that is that we use the same union the extended sufficient statistics |

0:15:05 | and then bending |

0:15:06 | then trained that generally |

0:15:08 | i-vector extractor so |

0:15:11 | the roles we can do one nine three |

0:15:14 | so no that's |

0:15:17 | okay the third level we have a modification of the learned dictionary encoder |

0:15:23 | so this uses |

0:15:25 | so i dunno |

0:15:27 | covariance matrices in instead of |

0:15:29 | isotropic covariance matrices |

0:15:33 | so we got been improvements by doing |

0:15:36 | these verification |

0:15:38 | the last two lines of their |

0:15:41 | results for then applied on there |

0:15:47 | so the interesting |

0:15:49 | they here |

0:15:51 | i wonder is a |

0:15:54 | what |

0:15:55 | what courses the performance difference between the |

0:15:59 | we generalize there's and then the in the things |

0:16:02 | because these are using the same the nn |

0:16:05 | but in |

0:16:09 | so there are two possible sources for this dress |

0:16:14 | so the first we used one |

0:16:16 | is the difference between the |

0:16:20 | generated by the model and they're |

0:16:24 | thereafter holding their |

0:16:27 | so |

0:16:29 | because of the holy where there is only one layer |

0:16:33 | even for the in the in here so |

0:16:35 | only this small part seems to really |

0:16:40 | well alarms |

0:16:41 | in terms in the equal error rate |

0:16:45 | so it seems that the discriminative |

0:16:48 | training objective is better |

0:16:52 | okay there is another |

0:16:54 | possible reason for this performance difference |

0:17:00 | so there is like mismatched how we trained the |

0:17:04 | b and n |

0:17:06 | one or we how we trained in the in holding linear and how |

0:17:10 | how we use it in the i-vector approach |

0:17:14 | can see that the and we explicitly form a supervector |

0:17:20 | and in there |

0:17:21 | i-vector |

0:17:24 | roles it is not a |

0:17:26 | i-vector of proteins the base adamantly so |

0:17:30 | is |

0:17:32 | like |

0:17:33 | console how many alignments how many frames are aligned itself the gaussian components |

0:17:39 | so this is missing from based supervector a row |

0:17:44 | so this is one of the |

0:17:47 | future works so |

0:17:49 | i used in the in owning layer is that it will resemble more there |

0:17:55 | i-vector approach |

0:17:58 | so this mismatch will be going on there |

0:18:03 | another |

0:18:05 | idea for the future work is |

0:18:07 | explain here |

0:18:09 | so |

0:18:10 | instead of substance that these three extra |

0:18:14 | the errors and the universal background model on there |

0:18:19 | the posteriors from this one in there |

0:18:24 | and |

0:18:25 | by using the is |

0:18:27 | we will then |

0:18:30 | have a neural gmm-ubm system we train based scoring |

0:18:36 | so this might be useful for some |

0:18:39 | special application why our for a welder a sore race and speaker verification |

0:18:50 | before i finished i have to related announcements first one is the program goals are |

0:18:55 | available |

0:18:57 | so we have i-vector extractor and providing their systems and in addition to speakers in |

0:19:02 | the mind you have also that there's |

0:19:06 | the goal or python and by those based |

0:19:09 | well we have or ugandans the on can be more research such |

0:19:16 | the second announcement is the this study was also included in my dissertation |

0:19:23 | and or is this tradition i have been extremely nice and coming residual and its |

0:19:30 | but weeks so |

0:19:32 | anyone who wants to jordan is pretty to design and we can be found |

0:19:38 | well |

0:19:38 | here |

0:19:41 | so you there |