0:00:15 the next presentation is not factor analysis of acoustic features you i mixture of probabilistic principal component analyzers moreover speaker i and that is factor analysis of acoustic features using a mixture problems component analysis for robust speaker very i so in the introduction what i want to say is so factor analysis is very popular technique when applied in gmm supervectors and the main assumption there is therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace we actually it's kind of not that the acoustic features are also represent a low dimensional sub-spaces and this phenomenon is not really taken into consideration in gmm supervector bayes factor analysis so we propose to try to see what happens if we do factor analysis on the acoustic features in addition to those i based cross so just to say more about the motivation we do not a speech spectral components are highly related to so our in the mfcc features we have a pca dct to detect these a lot of work on trying to be really features it has been shown that the first few eigen directions of the feature covariance matrix is more speaker-dependent so by maximizing back into the so what we believe is the retaining the full feature all the directions of the eigen directions the features might actually be harmful there might be some directions that are not benefiting we also get the evidence from the full covariance based i-vector system that oh what a better than eigen system which so motivates us to investigate this further so if you look at a full covariance matrix the covariance matrix of a full covariance ubm this is how it kind of looks and if you look at the eigenvalue distribution see most of the energy is compressed in the forest as in thirty two eigenvalues in this case so they're pretty much contact so i i kind of thought okay that there might be a chance to the reason to believe that there is some components the image which are not really so we use the factor analysis on acoustic features so this is the basic formulation very simple so you have a feature vector X is the factor loading matrix why is the acoustic factors which is basically the the hidden variables you is the mean vector and absolute is the isotropic noise so this is basically a ppca and the interpretation of the covariance is now modeled by the cuban variables and the covariance of the acoustic features and the residual variance is modeled by a voice model so is the pdf of the model and so what we try to do here is we want to place the acoustic features by the acoustic factors basically the or the estimation of the acoustic factors and try to use them as the features believing that these acoustic factors have more speaker-dependent information and the full feature vector might have some nuisance components so a transformation matrix is derived so it's also coming from the testing condition papers you can see first you have to select the number of coefficients you want to change suppose they have six features i want to keep forty so he would be cost forty and that was varies estimation is done by this also that's the remaining components in the S this coverage oh of the eigenvalues sorted eigenvalues so the in its eigenvalue of the covariance matrix of X and this is the factor loading matrix the maximum likelihood estimate and it's also from the keeping initial paper so this is how we estimate the acoustic factors which is basically the expected value of the posterior mean of the acoustic factors and it can be shown to be to use the expression here so it's basically removal of the meeting and the transformation by this matrix so what is given by this and so it's just are the linear transformation and if you take a this is the transformed feature vector which were like to call it and if you look at the mean and covariance matrix of this quantity it's a zero-mean gaussian distributed with a diagonal covariance matrix given by this burgers in the paper i so what to do a mixture of if it models which is basically the mixture of ppca equation so it's basically like a gaussian mixture models the same but could think about this is you can directly compute the parameters we the fa parameters from the full covariance ubm and then becomes really handy the C oh next i'd like to talk about how we want to use the the transformation so you have set and twenty four mixtures and to make sure has a transformation so what you could do us a you take a feature vector and you find the most likely mixture and you transform the feature and then you know replace the original vector right but what we saw is actually it's kind of not be the optimal way of doing it because so if you find the top scoring mixture of say your development data across the again so this is kind of the distribution so what this tells you is it's very rare that the acoustic feature is unquestionable the online two in mixture most of times that you can get like that was like one point four point five so that kind of means is a you can't really say that this feature vector comes from this mixture it kind of the last a lot of mixtures maybe more than one so what we want to do not keep all the all the transformations that are done by of the mixtures so this is how we do it basically integrating the process within the total variability model so with the i-vector system so for speech and the ubm full covariance and then we compute the parameters like we set the value of Q well just a fifty i think oh data we find the noise variance these are all you different pictures for each mixture you find a a factor loading matrix and the transformation so how it flies is basically directly those on to the first order statistics you actually have to by frame-by-frame so you compute the statistics and you can just take a transformation of that estimation so it becomes very simple you just transform the first order statistics and actually know the transformation is completely integrated within this is so these are differences with the conventional the t-matrix training so the feature size becomes Q instead of D support vector becomes in Q and the T V image of size becomes smaller and most importantly the ubm gets replaced by the distribution of the transformed features so since we are not using the original features in the subsequent processing we will use this is not really the ubm this is basically to how the parameters can place and the i-vector expected procedures similar i system we have a phone recognizer based fantasy two-dimensional six with feature cepstral mean normalization we have a ubm a gender dependent on ten twenty four mixtures oh we train we train the full covariance ubm with a variance flooring it's the investigate parameter it's that's the mean value of the corpus matrix to be a fixed value and the i-vector size was four hundred and we used five iterations so we have the pot a backend where we have a full covariance was model and the only free parameters the eigenvoice size next to the we have the fa which i just talked about we derive all the parameters from the ubm directly and we performed experiments on sre twenty ten basically conditions want to find we use the male trials so this is the initial results as we can see we change the P of the inside the eigenvoice size from fifteen then we use the cubicles fifty four forty eight and forty two our feature sizes sixteen so you can see taking off six components and so on so also what we can get nice improvement using the proposed technique so here's table showing you some of the systems that we fused so the baseline is sitting here and we are getting nice improvement in all three a couple of two thousand Q it's kind of heart to say which that would work that's in challenge and also this to that kind of it can be optimal and it can have different value in each mixture depending on how the mixture how the covariance structure is in the mixture i also did some work on that and probably see interspeech oh so anyway when we fuse the systems it's too late fusion and we can see still we can get a pretty nice improvement by fusing and different combinations so these systems to have a complementary information so these are actually extra experiments that performed after the this paper submitted source one are shown oh in other conditions works in condition one oh maybe cubicles forty eight what's nicely what condition two Q was forty two words yeah condition three cubicles forty eight and fifty four oh but in take you information for we have maybe of the dcf the new dcf didn't from improve but you of the conditions but you can see clearly that a the proposed techniques a technique works well it reduces all three a performance in this is and after fusion you can actually see nice a really different from all three of parameters so here is the det curve it's on the to a condition one to five and we just pick the cubicles forty two system oh and you can see it's almost all the fa system is better than the baseline and with fusion we get for the so we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation a compact representation well and we propose the be probabilistic feature alignment method instead of hard-clustering a feature vector to a mixture and so we show that i provides better oh when we integrate it with the i-vector system and the as a kind of nice artifact it kind of makes it faster because you know you're reducing the feature vector dimensionality which actually in turn reduces that support vector size and tv matrix size and it's you can see in this paper is discussed that V the computational complexity is proportional to be supervectors so and future work there's nothing to like not it can be mixture dependent basically so we obtain colour feature dimension like say forty eight from all the mixtures what you can be different so one of my papers that supported in interspeech which deals about the trying to optimize the parameter in each mixture and also some of future work will be using iterative techniques in proposed to begin bishops method in table four mixture of ppca most of all actually this opens up we have using other transformations also in mixture wise which might also didn't in another interesting to people where i actually a by conventional transformations and the and nap or other techniques which actually sort of take transformations in each mixture and then yeah so and then basically integrated with the i-vectors so that is all i have a given sorry how do you can go back to the acoustic features i yeah yeah i what we need to train the ubm from scratch oh yeah i did i tried i've seen some papers to i didn't think i think the way i did i thought or sure you can so i cluster a feature dimension you have to have some kind of measurement usually you can find the find the mixture by oh the most the make sure that you to the highest posterior probability but in this distribution i'm showing that oh it's not always a one to one mixture because sometimes if the maximum value of the posterior probability of the mixture is if it's giving you point to that is there other mixtures one point something that means if you take point to as the maximum mixture and use that mixtures transformation it will be so yeah we can you "'cause" to do it but i try because i just have seen this and i thought it would be nicer generate things that make things are together what is i oh so a number of trials i yeah yeah i i think i normalized in a binary invariance oh although i right yes oh maybe what you're saying is true since i get maybe conditions maybe i don't know if i the folding problem i believe just to well yeah i think that