Speech Transcript - SIMPLIFICATION AND OPTIMIZATION OF I-VECTOR EXTRACTION

i and it you set uh so uh i was that you told you what is the the a vector let me go quickly the through uh do you have a two an information rich low dimensional fixed like thing to representing of voice print uh i an arbitrary long uh and so uh we like these little in is because they no a time me and they turn the speaker I D uh a task into a pattern recognition problem and and are you already told are already shown uh how to do with then so uh just to go quickly about uh this estimate alone again so uh what we wanna model is the um that the data uh that that come up uh so we here an example of an utterance a buttons so we usually model them using the the gaussian mixture model we forget about the the variance and we're and remember the a a the means we construct a the super vector of the mean no i uh do is that we look at more data and uh are we extract the means uh a of of that of all the utterances and we're trying to see that be and we see that they have some kind of a new be of and this this this is what we assume in the i-vector so uh we see that of the got some some uh a offset which is uh represent but the U B and mean is the end symbol and this picture and which is represented by the by the uh a to ads and then we have uh the total variability space i was represented by the um by the hours which which as and in which direction we can she the mean to adopt the mean to the to the incoming and to to you describe the the directions of the has ability of as uh and uh vector W as a bit of a a banana such we can impose a uh uh we can impose a a prior on its so will choose the uh i got some uh uh uh that's not stand alone a prior and getting some uh incoming data X uh we compute the posterior here uh i'll be very are and uh she's also gaussian with uh mean W X and a precision matrix X and basically uh we recall not is the is the mean of this year so that that given any any any details so uh uh this is just a a like a cookbook uh uh codebook book um talk so it to compute the either we need to a the statistics uh extracted from the ubm so we have this your order statistics and the the first order statistics uh the we go any further we do a little tricks so we uh a to the data around the the ubm so we find which cluster of the data comes to which which class and of the of the of the you and you and then and we we should that allow uh uh and we also uh white and the data are uh using the a ubm covariance matrix but that this covariance matrix uh uh can be matrix as you may have already um realise is to with battles i i stock oh a a which makes the yeah virtually it makes the uh the as of the of the individual a a gmm components uh equal to identity so a he's um he's a a a a um a a codebook book question for for computing the so that did it D W is basically a dot product between uh some um oh aims of the post a distribution uh uh the factor we matrix T which describes the subspace and the first order statistics be uh precision matrix a um that is a basically a sum of over of all the uh a gaussian associated pieces of the team manager uh what it used by the by the uh but this you order stats oh of that incoming utterance and that's to a little analysis of of of of uh what this what this function that's in a a a a a computer so uh we have a and we have C gmm component yeah have F dimensional uh feature space and we have and subspace uh well and dimensions subspace uh was describes are are are are are a space and uh so the um do um and to the power of is the is the version there's nothing much we can do battery um that is um the the biggest problem actually use the uh is the sum and the precision computation and then we have the the um the dot product oh of of the individual matrix is and the first from you the memory complexity of uh uh um oh just to just to say restore everything when when we computing the stuff but we can put computed be pre-computed and balance and with to start this product in advance because not dependent on data so the the memory compress is really uh a high for this uh uh for this for this model so uh uh if we mention that in in a typical model we have um um you know thousands of gaussians since uh this can be a but a really for a now be uh were we also have to store is the as the T matrix so these two terms of balance the bound the can complexity of of or other so uh the motivation for simplification of this of this form that's actually wanted to put the application to small scale devices as part of might be a project and uh yeah we also it to prepare a this uh i that a framework for discriminative training what we thought that um such equations could be oh quite difficult to to to compute gradient in four but that's first take look at the first simplification simplifications that we but we uh and assume here in the first M san isn't the pictures that the the proportion of the data generated by each gaussian in the in and the you and am as to is constant across the chris or utterances a uh and this proportions a is actually a uh a given by the ubm rates so what happens is that the um the and the the sum in the in the precision computation it is uh that's independent of the data and we can really effectively pre-computed in rounds so we don't have to um each time we we we compute that's some i mean we we we compute the precision we just uh we look at this formula oh we just instead of adding the the the sum the going from the from the i to the button most uh we see that we only have a scaled um scale to uh addition of two matrices so a a little analysis so we totally only got rid of the of the of the C square um to um in the computational complexity and close of memory complexity signal or uh basically for good most of the data that we were storing before i just a time for for the before before the results section the the the number of gaussians were is is thousands is that said and and the typical size of of the subspace four hundred size and hundreds uh a so this so the first simplification out we also had a the thought uh or we would try to sue that uh we can find a uh uh thing and is a orthogonalization transformation G some G that would uh you know that have been rise or the T transposed times T uh component associated parts of the of the of the factor loading matrix T which are bothering us in the in the precision computation for lab as a transformation then uh we can uh you know multiply very the equation from both sides and uh uh a something like this and then uh to get the original precision we would just uh multiply from paul says by the inverse of G um if our sense from was was correct uh so than i thing here is that uh we would be something the diagonal matrices which uh can be implemented effectively an C or my occur and also so the other thing was that uh the the the the the the died nice precision matrix is diagonal is diagonal also if you remember a uh we were inviting in in the in the i-vector uh extraction from a city a vector so the so um version of the is diagonal matrix is trivial here or if that the effectively written uh we can we can pack uh we can pack the um and the you T times it to and you a T transpose T terms uh uh of the gonna has into a single matrix and we can simply we can simply uh to dot product with the with the vector of zero order statistics the the close and at and we can uh this this lower "'cause" gonna the X symbol stands for in the diagonal of that matrix to a a a a column vector and the capital dag and a simple again maps that column vector to a diagonal matrix and the i-vector extraction is then uh and by to by D second question here and i think about is that the do transpose in the middle of the and can be projected directly to the to the T matrix which is a which uh would you can give the some benefit and the S we set the a matrix and can be inverted to effectively so if we look at the analysis again uh and the computational complexity we but rid of but that the terms that's that's were in on the only the diagonal and uh uh for combat some the memory complexity uh we got an extra term um uh the the the the um see M but we got the do you we got rid of that C and square term there the question is how to the how we compute the G matrix of the first uh well the first uh i i was to use pca uh which we will see that works the second i but was to use this this had had just good st clean linear discriminant analysis uh he a was the simple example what a basically want i it to rotate that that uh those two covariance matrices forty five degrees but uh and the the um the um average within class covariance would be a identity matrix here so oh first that was the inspiration it it's with the lvcsr tasks uh i just a quick step uh we say thing about of the T matrix for those wouldn't no uh are that there's uh a pair of a can load is that we have to accumulate can relate while training uh the T matrix we got all utterances of or or or all training utterances and we use some some computation that and we can relate that and we do some some up at the end of this of of this procedure uh but inside this uh that in theoretical explanation and sat inside of these uh uh these uh this computation we see that we can use the the double which is the final why vector and is the precision matrix so if we know that we can simplify this precision matrix we can simplify the lead actors and or this this this training procedure it's uh a simplified so um the memory use the the use each with this this so um hmmm well that would this simple trick uh we get to about a half of the memory we the gen we can we can maybe effectively try to uh increase the other parameters because number about the parameters to to two for comparison so i for experimental setup uh we use mfcc features uh the standard thing uh um um short-time cepstral mean and variance normalisation uh we used double but that doesn't double the that thus for the training set uh uh with different combinations of the switchboard two phase two and three speech for solar are the nist two thousand four two thousand six uh we use sure in which one and two for training the team of J uh the test set we evaluated on the nist sre ten extend core condition five which is the telephone telephone female and me uh one to mention the slides that we use exactly the same scoring the thing that's as as as i mention in is previous talk so is the cosine distance with uh within class normalisation uh the performance set uh because we a measure of the the the um the speed and the a a and the memory code and the memory demands so do use the matlab environment uh which what which was set to a single core a a single third operation and around on some internal then a process to and we measuring the speed or four fifty randomly picked utterances from the mixer corpus but we had the out the statistics we computed from uh sort the so the of statistics collection is not included in the analysis and as ubm ubm M was diagonal covariance uh two thousand four eight component ubm and as was trained on um and do it about the fisher so a the summary of numbers uh uh are we used two thousand forty eight gaussians the feature dimension was sixty and we use of for and it uh a dimensional subspace uh uh for a and was is been chosen as a trade between performance and and technical conditions so but a can i mean uh the the configuration of of of of the this of the machines that computed the i-vector that uh as i said we were able to know um in one in one of our and the for simplification we were able to uh a decrease the memory demands so we to be fair which i had to uh use also and and in equal to and under eight hundred uh uh just to see just to see what happens and is a little uh a constellation plot oh for the results are the uh the X here is the is the baseline uh oh of course of the the the the little the little or a block down and the the most for "'cause" the eight hundred traditional uh i-vector extraction see that the systems perform from slightly poorer than the than the baseline but uh uh this is just an informative picture uh we see that the best i'm traditional or yeah that's non traditional i-vector extractor goes from a uh a sick the three point six to about three point eight uh equal error rate uh the same can the something and logically with the norm D C have uh are so the system are slightly worse but um this work was and i that's on the analysis of the speech so for look at that and the speech of uh a of the of B computation so uh with the with the baseline to extract those fifty fifty i vectors it to class uh uh thirteen seconds not thirteen second and um so you the uh you C D the the the relative the relative uh well uh numbers here so "'cause" they're talking the you and eight hundred baseline as there is a huge a decrease in performance because the the complex the complexity there is score dropped go uh i that's to have a nice that was that if we if we are able to train the system somehow or without without a hardware we can afford to use a hundred uh dimension dimension right to and still get to but you know ten percent uh a ten percent of the original time uh there was necessary to compute those fifty i vector uh now let let's take a look at the uh comparison of of memory usage so uh for the you for the baseline so the the first column i mean the second column uh what's is constant uh that's something that we can change something that we have to store in memory a a good in specific uh uh and numbers uh show to medical decrease in in in memory needs for the for the uh for the simplified uh algorithms so as to uh uh if we if we want to use the uh uh and have good a dimensional uh i vectors we to still that a fraction of of the memory that that the traditional a total of about the eight hundred baseline system which is which which again close uh a practically so this is this is just a prove that we can use those simplification on and the in the vector training procedure uh a we save space uh a but also little this the the simplification make this process a lot faster and uh this this these numbers just show that uh the difference between between uh the fact that we train uh using the the traditional i-vector extraction and the simplified i-vector extraction does of that we can i can really i the E D D D simplified five so the conclusion is that uh we managed to simplify the state-of-the-art technique in terms of speed and memory with uh are sacrificing some of the uh you know that the performance the the recognition performance oh we have also simplify the form a uh so that the uh easily differentiable for a future work which which uh which is going to be the discriminative training of of the i-vector extractor the matrix T or the G V and and uh and others uh and uh a finally you we managed to fit the guy vector based to the system to to to a cellphone application which was uh uh which was uh one of the tasks and we use puns are to be or project which was on i model a speaker recognition um thank you some something Q with time for one and two questions no questions um i of the questions so you you may two assumptions hmmm to simplify your you go to them uh did you very fine in some way that the which was emission is that were to or source source can we did you find in some way with the data and that the just by looking at the score no hmmm was uh a one or the other assumption was wrong or or a yeah well they were i looking at the at the at the recognition performance that all but slightly one i mean uh yeah that that was a mismatch of course uh if the the um the the the proportion of the data generated by the by the gauss since this is different it's not always a equal to the to the ubm way and the um uh the because guess we we're using two thousand forty eight gaussians since and of finding a one single orthogonalization matrix um oh is is also probably an appropriate here so so but i we tried and and and um and it was some of the yeah okay the questions you uh no no would i did not combine the techniques and of a combined the techniques so it it yeah i'm sorry i didn't i i'm i'm sorry i it was better than pca for you for gonna the baseline uh no yeah i thank you yeah yeah that's that's a good point yeah okay there is no of the question so let's thanks speaker again

SIMPLIFICATION AND OPTIMIZATION OF I-VECTOR EXTRACTION

Miscellaneous Speaker Identification

Presented by: Ondřej Glembek, Author(s): Ondrej Glembek, Lukas Burget, Pavel Matejka, Martin Karafiát, Brno University of Technology, Czech Republic; Patrick Kenny, CRIM, Canada