0:00:13i and it you set
0:00:14uh so uh i was that you told you what is the the a vector let me go quickly the
0:00:18through
0:00:19uh
0:00:21do you have a two an information rich low dimensional fixed like thing to representing of voice print
0:00:25uh i an arbitrary long uh and
0:00:28so uh we like these little in is because they no a time me and they turn the speaker I
0:00:33D
0:00:33uh a task into a pattern recognition problem and and are you already told are already shown
0:00:38uh how to do with then
0:00:40so uh just to go quickly about uh this estimate alone again so
0:00:45uh
0:00:46what we wanna model is the um that the data
0:00:49uh that that come up
0:00:50uh so we here an example of an utterance
0:00:53a buttons
0:00:54so we usually model them using the the gaussian mixture model
0:00:58we forget about the the variance and we're and remember the
0:01:01a a the means we
0:01:03construct a the super vector of the mean
0:01:06no i uh do is that we look at more data
0:01:09and uh are we extract the means
0:01:11uh a of of that of all the utterances
0:01:14and we're trying to see that be and we see that they have some kind of a new be of
0:01:18and this this this is what we assume in the i-vector so
0:01:21uh we see that of the got some some uh
0:01:23a offset which is uh represent but the U B and mean is the end symbol
0:01:27and this picture
0:01:28and
0:01:30which is represented by the by the uh a to ads and then we have
0:01:34uh the
0:01:35total variability space
0:01:37i was represented by the um by the hours which which as and in which direction we can
0:01:42she the mean to adopt the mean to the to the incoming and
0:01:46to to you describe the the directions of the has ability of as
0:01:51uh and uh vector W
0:01:54as a bit of a a banana such we can impose a uh uh we can impose a a prior
0:01:58on its so will choose the uh
0:02:00i got some uh uh uh that's not stand alone a prior
0:02:03and getting some uh
0:02:04incoming data X
0:02:06uh we compute the posterior here
0:02:08uh i'll be very are and uh
0:02:12she's also gaussian with uh mean W X and a precision matrix X
0:02:17and basically uh we recall not is the is the mean of this year
0:02:22so that that given any any any details so uh uh this is just a a like a cookbook
0:02:27uh uh
0:02:29codebook book um
0:02:30talk
0:02:31so it to compute the either we need to a the statistics uh extracted from the ubm so we have
0:02:36this your order statistics
0:02:37and the the first order statistics
0:02:40uh
0:02:41the we go any further we do a little tricks so we
0:02:43uh a to the data around the the ubm so we find which cluster of the data comes to which
0:02:48which class and of the of the of the you and you and then
0:02:51and we we should that allow
0:02:54uh uh and we also
0:02:56uh white and the data are uh using the a ubm covariance matrix
0:03:01but that this covariance matrix uh uh can be
0:03:04matrix as you may have already um
0:03:06realise is to with battles
0:03:08i i stock
0:03:09oh
0:03:11a a which makes the yeah
0:03:13virtually it makes the uh the as of the of the individual
0:03:17a a gmm components uh equal to identity
0:03:22so a he's um he's a a a a um
0:03:25a a codebook book question for for computing the so that did it D W is basically a dot product
0:03:31between
0:03:31uh some um
0:03:33oh aims of the post a distribution
0:03:35uh uh the factor we matrix T which describes the subspace
0:03:40and the first order statistics
0:03:42be uh precision matrix a um that is a basically a sum of over
0:03:47of all the uh a gaussian
0:03:49associated pieces of the team manager
0:03:52uh what it used by the by the uh but this you order
0:03:56stats
0:03:56oh of that incoming utterance
0:03:59and that's to a little analysis of of of of uh what this what this function that's in a a
0:04:03a a a computer so
0:04:05uh
0:04:06we have a and
0:04:07we have C gmm component
0:04:10yeah have F dimensional
0:04:11uh feature space and we have and subspace
0:04:15uh well and dimensions subspace
0:04:17uh
0:04:18was describes are are are are are a space
0:04:21and uh so the um
0:04:23do um and to the power of is the is the version there's nothing much we can do battery
0:04:28um
0:04:29that is um
0:04:30the the biggest problem actually use the uh is the sum
0:04:34and the precision computation
0:04:37and then we have the the um
0:04:39the dot product
0:04:40oh of of the individual matrix is
0:04:43and the first from you
0:04:45the memory complexity of uh uh um
0:04:48oh
0:04:49just to just to say restore everything when when we computing the stuff but we can put computed be pre-computed
0:04:54and balance
0:04:55and with to start this product in advance because not dependent on data
0:04:58so the the memory compress is really uh a high for this uh uh for this for this model
0:05:04so uh uh if we mention that in in a typical model we have um um you know thousands of
0:05:08gaussians since
0:05:09uh this can be a but a really for
0:05:12a now be uh were we also have to store is the
0:05:15as the T matrix
0:05:16so these two terms of balance the bound the can complexity of of or other
0:05:21so
0:05:22uh the motivation for simplification of this of this form that's actually wanted to put the application to small scale
0:05:27devices
0:05:28as part of might be a project
0:05:30and uh yeah we also it to prepare a this uh i that a framework for discriminative training what we
0:05:36thought that um such equations could be
0:05:39oh quite difficult to to to compute gradient in four
0:05:43but that's first take look at the first simplification simplifications that we but we uh and assume here in the
0:05:48first M san isn't the pictures that the the proportion
0:05:51of the data generated by each gaussian in the in and the you and am
0:05:55as to is constant across the chris or utterances
0:05:59a uh and this proportions a is actually a
0:06:02uh a given by the ubm rates
0:06:05so what happens is that the um the and
0:06:08the the sum in the in the precision computation
0:06:11it is uh that's
0:06:12independent
0:06:14of the data and we can really effectively pre-computed in rounds
0:06:17so we don't have to um
0:06:20each time we we we compute that's some i mean we we we compute the precision
0:06:24we just uh
0:06:25we look at this formula
0:06:27oh we just instead of adding the the the sum the
0:06:29going from the from the
0:06:30i
0:06:31to the button most
0:06:32uh we see that we only have a scaled um
0:06:36scale to uh
0:06:38addition of two matrices
0:06:42so a a little analysis so we totally only got rid of the of the
0:06:46of the C square
0:06:48um
0:06:49to um in the computational complexity
0:06:51and close of memory complexity signal
0:06:54or uh
0:06:54basically
0:06:56for good most of the data that we were storing
0:06:58before
0:06:59i just a time for for the
0:07:01before before the results section the
0:07:03the the number of gaussians were
0:07:05is is thousands is that said and and the typical size of of the subspace
0:07:10four hundred
0:07:12size and hundreds
0:07:14uh a so this so the first simplification out
0:07:17we also had a the thought uh or
0:07:19we would try to sue
0:07:21that uh we can find a
0:07:23uh uh thing and is a orthogonalization transformation G
0:07:27some G that would uh you know that have been rise or the T transposed times T
0:07:32uh
0:07:33component associated parts of the of the of the factor loading matrix T
0:07:38which are bothering us in the in the precision computation
0:07:41for lab
0:07:44as a transformation
0:07:46then uh we can uh you know multiply very the equation from both sides and uh
0:07:51uh a something like this and then uh to get the original precision we would just uh
0:07:56multiply from paul says by the inverse of G
0:07:58um if our sense from was was correct
0:08:03uh so than i thing here is that uh we would be something the diagonal matrices
0:08:08which uh
0:08:09can be implemented effectively an C or my occur
0:08:12and also so the other thing was that uh the the the the the
0:08:16the died nice
0:08:18precision matrix is diagonal is diagonal also
0:08:21if you remember a uh we were inviting in in the in the i-vector
0:08:25uh extraction from a city a vector
0:08:27so the so um
0:08:29version of the
0:08:31is diagonal matrix is trivial here
0:08:34or if that the effectively written
0:08:36uh
0:08:37we can we can pack
0:08:39uh we can pack the um
0:08:41and the you T times it to and you a T transpose T terms
0:08:45uh uh of the gonna has into a single matrix and we can simply
0:08:49we can simply uh
0:08:50to dot product with the with the vector of zero order statistics
0:08:54the the close and at
0:08:57and we can uh
0:08:58this this lower "'cause" gonna the X symbol stands for in the diagonal of that matrix
0:09:04to a a a a column vector and the
0:09:06capital dag
0:09:08and a simple
0:09:09again maps that
0:09:11column vector to a diagonal matrix
0:09:13and the i-vector extraction
0:09:15is then uh
0:09:16and by to by D second question here and i think about is that the do transpose in the middle
0:09:21of the and
0:09:22can be projected directly to the to the T matrix which is a which uh would you can give the
0:09:26some benefit
0:09:27and the S we set the a matrix and can be inverted to effectively
0:09:31so if we look at the analysis again
0:09:33uh
0:09:34and the computational complexity
0:09:37we but rid of but that the terms that's that's were in on the only the diagonal
0:09:41and uh uh for combat some the memory complexity
0:09:44uh
0:09:45we got an extra term um uh the the the the um
0:09:48see M but we got the
0:09:50do you we got rid of that C and square term there
0:09:56the question is how to the how we compute the G matrix of the first uh well the first uh
0:10:02i i was to use pca
0:10:04uh which we will see that works
0:10:06the second i but was to use this this had had just good st clean linear discriminant analysis
0:10:11uh
0:10:12he a was the simple example what
0:10:15a basically want
0:10:16i it to rotate that that uh those two covariance matrices
0:10:19forty five degrees but uh
0:10:22and the the um
0:10:24the um
0:10:25average within class covariance would be a identity matrix here so
0:10:29oh
0:10:30first that was the inspiration
0:10:32it it's with the lvcsr tasks
0:10:36uh i just a quick step uh
0:10:38we say thing about of the T matrix for those wouldn't no
0:10:41uh are that there's uh
0:10:43a pair of a can load is that we have to accumulate can relate while training
0:10:47uh the T matrix we got all utterances of or or or all training utterances and we use some
0:10:53some computation that and we can relate that and we do some some up
0:10:57at the end of this of of this procedure
0:10:59uh but inside this uh
0:11:02that in theoretical explanation and sat inside of these uh uh these uh this computation
0:11:08we see that we can use the the double which is the final why vector and is the precision matrix
0:11:13so if we know that we can simplify this precision matrix
0:11:16we can simplify the lead actors and
0:11:18or this this this training procedure
0:11:20it's uh a simplified so um the memory use the the use each with this this so um
0:11:26hmmm well
0:11:27that would this simple trick
0:11:28uh we get to about a half of the memory we the gen we can
0:11:32we can maybe effectively
0:11:33try to uh increase
0:11:35the other parameters
0:11:37because number about the parameters to
0:11:39to two for comparison
0:11:42so i for experimental setup uh we use mfcc features uh the standard thing
0:11:48uh
0:11:49um um
0:11:50short-time cepstral mean and variance normalisation
0:11:53uh we used double but that doesn't double the that thus
0:11:56for the training set uh uh with different combinations of the switchboard two phase two and three speech for solar
0:12:01are the nist two thousand four two thousand six
0:12:05uh we use sure in which one and two for training the team of J
0:12:09uh the test set we evaluated on the nist sre ten extend core condition five which is the telephone telephone
0:12:15female and me
0:12:17uh
0:12:18one to mention the slides that we use exactly the same scoring the thing that's as as as i mention
0:12:23in is previous talk
0:12:24so is the cosine distance
0:12:26with uh within class normalisation
0:12:28uh the performance set uh because we a measure of the the the um the speed and the
0:12:34a a and the memory code and the memory demands
0:12:36so do use the matlab environment uh which what which was set to a single core
0:12:41a a single third operation and around on some internal then
0:12:45a process to
0:12:46and we measuring the speed or four fifty randomly picked utterances from the mixer corpus but we had the out
0:12:52the statistics
0:12:53we computed from uh
0:12:55sort the so the of statistics collection is not included in the analysis
0:12:59and as ubm ubm M was diagonal covariance uh
0:13:02two thousand four eight component ubm and as was trained on um
0:13:06and do it about the fisher
0:13:09so a the summary of numbers uh uh are we used two thousand forty eight gaussians
0:13:13the feature dimension was sixty
0:13:15and we use of for and it uh a dimensional subspace
0:13:19uh uh for a and was is been chosen as a trade between performance and and technical conditions so
0:13:24but a can i mean uh the the configuration of of of of the this of the machines that computed
0:13:29the i-vector
0:13:31that uh as i said we were able to know um
0:13:34in one in one of our
0:13:36and the for simplification we were able to
0:13:38uh a decrease the memory demands so we to be fair which i had to uh use also and and
0:13:43in equal to and under eight hundred
0:13:46uh uh just to see
0:13:47just to see what happens
0:13:50and is a little uh a constellation plot
0:13:52oh for the results are the uh the X
0:13:55here is the is the baseline
0:13:57uh
0:13:59oh of course of the the the the little
0:14:01the little or a block down and the the most
0:14:04for "'cause" the eight hundred traditional
0:14:06uh i-vector extraction
0:14:09see that the systems
0:14:11perform from slightly poorer than the than the baseline but uh
0:14:15uh
0:14:15this is just an informative picture
0:14:17uh we see that the best
0:14:20i'm traditional
0:14:21or
0:14:22yeah that's non traditional i-vector extractor
0:14:24goes from a uh a sick the three point six to about three point eight
0:14:28uh equal error rate
0:14:30uh the same can the something
0:14:32and logically with the norm D C have
0:14:35uh
0:14:37are so the system are slightly worse but um
0:14:39this work was
0:14:41and i
0:14:42that's on the analysis of the speech so for look at that
0:14:44and the speech of uh a of the of B computation so
0:14:48uh
0:14:49with the
0:14:50with the baseline
0:14:51to extract those fifty fifty i vectors it to class
0:14:54uh uh thirteen seconds
0:14:56not thirteen second
0:14:58and um
0:15:01so you the uh you C D the the the relative
0:15:03the relative uh
0:15:05well uh numbers here so
0:15:06"'cause" they're talking the you and eight hundred baseline
0:15:09as there is a huge a decrease in performance because the the complex the complexity there is
0:15:14score dropped go
0:15:17uh i that's to have a nice that was that
0:15:19if we if we are able to train the system somehow or without without a hardware
0:15:23we can afford to use a hundred
0:15:25uh dimension dimension right to
0:15:27and still get to but you know ten percent
0:15:30uh a ten percent of the original
0:15:32time
0:15:33uh there was necessary to compute those fifty i vector
0:15:37uh
0:15:38now let let's take a look at the uh comparison of of memory usage
0:15:42so uh for the you for the baseline
0:15:45so the the first column i mean the second column uh what's is constant uh that's something that we can
0:15:50change something that we have to store in memory
0:15:53a a good in specific uh uh and numbers
0:15:55uh show to medical decrease in in in memory needs
0:15:59for the for the uh for the simplified
0:16:02uh algorithms
0:16:03so as to
0:16:04uh uh if we if we want to use the uh uh
0:16:07and have good
0:16:08a dimensional uh i vectors
0:16:10we to still that a fraction of of the memory that that
0:16:15the traditional
0:16:16a total of about the eight hundred baseline system which is which which again close
0:16:21uh a practically
0:16:24so this is this is just a prove that we can use those simplification on and the in the vector
0:16:29training procedure
0:16:30uh a we save space
0:16:33uh a but also little this the the simplification make this process a lot faster
0:16:37and uh
0:16:38this this these numbers just show that uh
0:16:41the difference between between uh
0:16:44the fact that we train
0:16:45uh
0:16:46using the the traditional i-vector extraction and the simplified i-vector extraction
0:16:51does of that we can
0:16:52i can really
0:16:53i the E D D D simplified five
0:16:57so the conclusion is that uh we managed to simplify the state-of-the-art technique
0:17:02in terms of speed and memory
0:17:03with uh are sacrificing some of the
0:17:06uh
0:17:07you know that the performance
0:17:08the the recognition performance
0:17:11oh we have also simplify the form a uh so that the uh easily
0:17:15differentiable for a future work which which uh which is going to be the discriminative training of of the i-vector
0:17:21extractor
0:17:22the matrix T or the
0:17:24G V and and uh and others
0:17:27uh and uh a finally you
0:17:29we managed to fit the guy vector based to the system
0:17:32to to to a cellphone application which was uh uh which was uh
0:17:36one of the tasks
0:17:37and we use puns are to be or project which was on
0:17:42i model a
0:17:43speaker recognition
0:17:46um
0:17:47thank you
0:17:53some something Q with
0:17:54time for one and two questions
0:18:02no questions um
0:18:05i of the questions so you you may two assumptions
0:18:08hmmm to simplify your
0:18:10you go to them uh did you very fine in some way that the which was emission is that were
0:18:15to or
0:18:16source source can we did you find in some way with the data and that the just
0:18:21by looking at the score
0:18:23no
0:18:23hmmm was uh a one or the other
0:18:26assumption was wrong or or a yeah well they were
0:18:30i looking at the at the at the recognition performance that all
0:18:33but
0:18:34slightly one i mean uh
0:18:36yeah that
0:18:37that was a mismatch of course
0:18:39uh if the the um the the
0:18:41the proportion of the data generated by the by the gauss since this is different it's not always a equal
0:18:46to the to the ubm way
0:18:48and the um
0:18:50uh the
0:18:51because guess we we're using two thousand forty eight gaussians since and of finding a one single orthogonalization matrix
0:18:58um
0:18:58oh
0:18:59is is also probably
0:19:01an appropriate here so so
0:19:03but i
0:19:04we tried and and and um
0:19:06and it was
0:19:07some of the
0:19:08yeah
0:19:09okay
0:19:10the questions you
0:19:20uh no
0:19:21no would i did not combine the techniques
0:19:23and of a combined the techniques so
0:19:31it it yeah i'm sorry i didn't i i'm i'm sorry
0:19:33i it was better than pca
0:19:35for you for gonna
0:19:40the baseline
0:19:41uh no
0:19:44yeah i thank you yeah yeah that's
0:19:46that's a good point
0:19:48yeah
0:19:52okay there is no of the question
0:19:55so let's
0:19:56thanks
0:19:57speaker again