0:00:14hello my name is to be addressed model
0:00:16and in this video i describe our work in that narrow i-vectors
0:00:22this work was go out for me by can likely and the only thing on
0:00:26it
0:00:27we don't we are from the university of east and be learned and going i
0:00:31was the time of writing any
0:00:37our study proposes a new way of combining gaussian mixture model based gender the i-vector
0:00:43models discriminatively train the enhanced exact speaker and endings for speaker verification task
0:00:51our aim is to improve upon existing i-vector systems
0:00:56and we also hope to gain some insight is what causes the performance differences i
0:01:02mean
0:01:03in a speaker and the things and discriminant in the in speaker and that is
0:01:09our study also is that is stronger convex and the gaussian mixture models and some
0:01:15of the existing
0:01:17ian and holding layers
0:01:21as a background for how we look for different i are considered
0:01:26the last three this can start can start suggested here
0:01:31are combining ideas from all i-vectors and the intense
0:01:36we a special events and the jurors universal background models and i-vector extractors all these
0:01:44constants
0:01:46let's to be the standard i-vector
0:01:50so
0:01:51key components here the two gender models are there
0:01:56gaussian mixture model based universal background model and
0:02:00i i-vector extractor
0:02:04so even and
0:02:05is used together it
0:02:07initial easy readers to compute the supposition statistics for that
0:02:12i-vector extractor
0:02:14extract i-vectors
0:02:18so we know the features are rule based and the rest of the components are
0:02:23gender strange
0:02:28then the nn i-vectors in this construct the universal background model is replaced by
0:02:36these the and dates acoustic features as an input and reduce the senone posteriors as
0:02:42an hour
0:02:44these posteriors are used together and of c and it's easy features can be sufficient
0:02:50statistics for the i-vector extractor
0:02:55so this clustering differs from the standard i-vector and the universal background models discriminatively trained
0:03:03one of the audience
0:03:06third system descending an i-vector system
0:03:09the system combines three modules one i-vectors is then the one neural network
0:03:16these manuals are features statistics
0:03:19when you
0:03:20statistics i-vectors when you and that are module
0:03:23is responsible
0:03:25score and errors of i-vectors
0:03:29training of these
0:03:31kind of network goes as follows
0:03:34so that are then used to train these individual modules
0:03:38shortly these can benefit from the
0:03:43i guess the a wrong
0:03:45corresponding generative models
0:03:49after these modules have been trained separately and they can be combined or and then
0:03:55train
0:03:59so this course there
0:04:02you do less is generally models in the initialization stays
0:04:07well i and i will use discriminative training the whole network
0:04:14therefore and the last background construct this guy
0:04:19is using and the nn with a mixture factor analysis fourteen year
0:04:23in this for the authors used to estimate based you know texture you start speaker
0:04:29and things
0:04:31what is special about this
0:04:33is that they use their own in a fourteen layer
0:04:37these
0:04:38for the error is basically an i-vector extractor implemented inside the in
0:04:45is the m f a is based on after calling may or no must be
0:04:49learned dictionary hangover
0:04:52i think used right learned dictionary encoder right the wrong in this alliance
0:05:00so we get all the components of these last construct our discriminately discriminatively trained with
0:05:07speaker targets
0:05:12okay
0:05:12next we belong to the proposed neural i-vectors
0:05:18before explaining the cluster itself
0:05:22we will need to do
0:05:24prerequisites for our model
0:05:27and these are the know that and
0:05:30the only layers and describe these two only layers by some and how they relate
0:05:36to the standard c n
0:05:38so then next initialize will be quite match for most
0:05:45so first then it that
0:05:48and we will study
0:05:50the posterior combination formula or a standard gmm
0:05:55we can see how we get the
0:05:58note that formalize and all of their your question from here
0:06:04so okay here we have
0:06:07that was the number of gaussian components and its constant component power
0:06:12covariance matrix mean vector and the associated right
0:06:18okay in that assumes covariance matrices
0:06:23or gaussian components
0:06:27we will okay this four-mora in the is for
0:06:30by expanding the normal distributions
0:06:38then
0:06:39but no
0:06:41this inverse covariance times minima there will
0:06:45my god
0:06:48and
0:06:48then the slow terms minus the other there we see
0:06:55we get
0:06:56this
0:06:59and this happens to be exactly
0:07:02formally used in note that
0:07:04paper from two thousand and sixty
0:07:09so
0:07:10we have basically some on the last two means their covariance matrices
0:07:15and the gmms we get that
0:07:19same formalize and as in that
0:07:23okay in it but i
0:07:25illinois or learnable parameters there
0:07:29note that there are
0:07:31this form of grass
0:07:33she's and news
0:07:36and estimating these forming class and z is has to do not what i mean
0:07:41there
0:07:43we see from the posterior combination formula that doesn't depend and
0:07:49from the mean vectors it is quite interesting signal can and the
0:07:54standard gmms
0:07:57but anyway
0:07:59there you can compute the posteriors
0:08:03or is there
0:08:06input feature vectors
0:08:09we can compare the component wise
0:08:13what's
0:08:14or in it but layer
0:08:18formalize zone
0:08:19on the right side the screen
0:08:22and then
0:08:23well right there
0:08:26we have the first order centre so some statistics
0:08:32the denominator just length normalized is then
0:08:37so for each gaussian component we get one
0:08:41vector
0:08:42and finally
0:08:44is no but they are male
0:08:45concatenate is
0:08:48component lifestyle closed form a supervector
0:08:53so this is very similar to a
0:08:56standard
0:08:57c gmm supervectors
0:08:59and how they are form
0:09:05okay next
0:09:07do the same for the learned dictionary encoder
0:09:10only layer
0:09:12so we start be there
0:09:13gmm posterior combination formula
0:09:18okay this time we you know we then
0:09:21is
0:09:23once colour term
0:09:27do we get this
0:09:28by expanding the normal distributions
0:09:34okay
0:09:35no if we assume
0:09:38isotropic
0:09:39or spherical covariance matrices
0:09:44this formula
0:09:45we simply by
0:09:47this four
0:09:51and
0:09:52this is the
0:09:53for music in that kind of prediction reading over all in layer
0:09:59although in the
0:10:00original publication or is and t is
0:10:05be the term was not included but it was added later on by other authors
0:10:14so the key point here must the
0:10:16by assuming isotropic covariance matrices the l d
0:10:22formulation from the standard gmm performance
0:10:28then learnable parameters of this energy will are
0:10:33i is
0:10:34it's on the scaling factors for covariances then the mean vectors and is
0:10:40i the terms
0:10:44similarly as in that we can then going to the component was a rules or
0:10:49is there
0:10:51so again then we write directly have the first order some for some statistics
0:10:58well okay in the standard and denominator is will be different so it is a
0:11:02sample
0:11:04posteriors for its
0:11:07each component
0:11:09so this is model i and it's the traditional maximum likelihood ratio on the east
0:11:16is on one and vice outputs
0:11:21and then the
0:11:22on the nist and form a supervector
0:11:29okay
0:11:31so now we have the necessary can start you
0:11:35constructs explained extend the proposed neural i-vectors
0:11:41so we start with
0:11:43and standard
0:11:46extractor architecture
0:11:50and we replace that
0:11:52and are willing layer
0:11:54we either that or l d coordinator
0:12:00and as its or from the previous bias we can use
0:12:04each polling layers the extra stuff isn't statistics
0:12:09so we do that
0:12:11and by using this present study is this weekend frame
0:12:16regular i-vector extractor and you can also then extract i-vectors from these statistics
0:12:25so that's the idea
0:12:32so now we can completely stable
0:12:37so how our how our proposed functions are dressed differs from there
0:12:45able in their roles is that the
0:12:48i-vector extractor is generally
0:12:52otherwise the cluster is the same
0:12:55if we compare our proposed neural i-vectors we then the in and i-vectors
0:13:01we can see that the
0:13:02i-vector what is the same but that
0:13:05users and you the in verse
0:13:08no one ever then restrained speaker utterance
0:13:13and also the features are obtained from a
0:13:17last layer before the one in there
0:13:22next
0:13:23that's model and the experiments and results
0:13:27so we can say that speaker verification experiments on the speakers and one evaluation
0:13:33first we compare our role as the results the other i-vector systems
0:13:39the single fine from the literature and these are some of the best ones
0:13:46on the line we have started in this easy i may or
0:13:51and in the second one may have i-vector system that is isn't perceptual linear prediction
0:13:56features together with the actual in the features
0:14:00and this w the a is
0:14:03dereverberation system
0:14:06so we can see from this results the then all i-vectors performs the best
0:14:13okay
0:14:14so partial but let's next
0:14:18compare our results
0:14:21the nn speaker and endings
0:14:23so we can use the same the nn sticks there either sufficient statistics for a
0:14:27narrow i-vectors
0:14:30or the can extend the speaker and endings directly from the audience
0:14:38so
0:14:39here are all these are our results so
0:14:45in the first line we have a
0:14:48the and we notable dictionary encoder whoever
0:14:53be into one zero two equal error rate
0:14:57but then the corresponding no i-vectors
0:15:00okay that is that we use the same union the extended sufficient statistics
0:15:05and then bending
0:15:06then trained that generally
0:15:08i-vector extractor so
0:15:11the roles we can do one nine three
0:15:14so no that's
0:15:17okay the third level we have a modification of the learned dictionary encoder
0:15:23so this uses
0:15:25so i dunno
0:15:27covariance matrices in instead of
0:15:29isotropic covariance matrices
0:15:33so we got been improvements by doing
0:15:36these verification
0:15:38the last two lines of their
0:15:41results for then applied on there
0:15:47so the interesting
0:15:49they here
0:15:51i wonder is a
0:15:54what
0:15:55what courses the performance difference between the
0:15:59we generalize there's and then the in the things
0:16:02because these are using the same the nn
0:16:05but in
0:16:09so there are two possible sources for this dress
0:16:14so the first we used one
0:16:16is the difference between the
0:16:20generated by the model and they're
0:16:24thereafter holding their
0:16:27so
0:16:29because of the holy where there is only one layer
0:16:33even for the in the in here so
0:16:35only this small part seems to really
0:16:40well alarms
0:16:41in terms in the equal error rate
0:16:45so it seems that the discriminative
0:16:48training objective is better
0:16:52okay there is another
0:16:54possible reason for this performance difference
0:17:00so there is like mismatched how we trained the
0:17:04b and n
0:17:06one or we how we trained in the in holding linear and how
0:17:10how we use it in the i-vector approach
0:17:14can see that the and we explicitly form a supervector
0:17:20and in there
0:17:21i-vector
0:17:24roles it is not a
0:17:26i-vector of proteins the base adamantly so
0:17:30is
0:17:32like
0:17:33console how many alignments how many frames are aligned itself the gaussian components
0:17:39so this is missing from based supervector a row
0:17:44so this is one of the
0:17:47future works so
0:17:49i used in the in owning layer is that it will resemble more there
0:17:55i-vector approach
0:17:58so this mismatch will be going on there
0:18:03another
0:18:05idea for the future work is
0:18:07explain here
0:18:09so
0:18:10instead of substance that these three extra
0:18:14the errors and the universal background model on there
0:18:19the posteriors from this one in there
0:18:24and
0:18:25by using the is
0:18:27we will then
0:18:30have a neural gmm-ubm system we train based scoring
0:18:36so this might be useful for some
0:18:39special application why our for a welder a sore race and speaker verification
0:18:50before i finished i have to related announcements first one is the program goals are
0:18:55available
0:18:57so we have i-vector extractor and providing their systems and in addition to speakers in
0:19:02the mind you have also that there's
0:19:06the goal or python and by those based
0:19:09well we have or ugandans the on can be more research such
0:19:16the second announcement is the this study was also included in my dissertation
0:19:23and or is this tradition i have been extremely nice and coming residual and its
0:19:30but weeks so
0:19:32anyone who wants to jordan is pretty to design and we can be found
0:19:38well
0:19:38here
0:19:41so you there