0:00:15 | the next presentation is not factor analysis of acoustic features you i mixture of probabilistic |
---|---|

0:00:20 | principal component analyzers |

0:00:22 | moreover speaker |

0:00:26 | i |

0:00:48 | and |

0:00:50 | that is |

0:00:53 | factor analysis of acoustic features using a mixture problems |

0:00:57 | component analysis |

0:00:59 | for robust speaker very i |

0:01:05 | so in the introduction what i want to say is |

0:01:09 | so factor analysis is very popular technique when applied in gmm supervectors |

0:01:14 | and the main assumption there is |

0:01:17 | therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace |

0:01:24 | we actually it's kind of not that the acoustic features are also represent a low |

0:01:30 | dimensional sub-spaces |

0:01:32 | and this phenomenon is not really |

0:01:35 | taken into consideration in gmm supervector bayes factor analysis |

0:01:40 | so we propose to try to see |

0:01:44 | what happens if we do factor analysis on the acoustic features |

0:01:48 | in addition to those i based cross |

0:01:53 | so just to say more about the motivation |

0:01:57 | we do not a speech spectral components are highly related to so our in the |

0:02:03 | mfcc features |

0:02:06 | we have a pca dct to detect these |

0:02:10 | a lot of work on trying to be really features |

0:02:14 | it has been shown that the first few eigen directions of the feature covariance matrix |

0:02:19 | is more speaker-dependent |

0:02:22 | so by maximizing |

0:02:25 | back into the |

0:02:26 | so what we believe is the retaining the full feature all the directions of the |

0:02:33 | eigen directions |

0:02:34 | the features might actually be harmful there might be some |

0:02:38 | directions that are not benefiting |

0:02:41 | we also get the evidence from the full covariance based i-vector system that |

0:02:45 | oh what a better than eigen system |

0:02:48 | which |

0:02:49 | so motivates us to investigate this further |

0:02:54 | so if you look at a full covariance matrix |

0:02:58 | the covariance matrix of a full covariance ubm this is how it kind of looks |

0:03:03 | and if you look at the eigenvalue distribution see most of the energy is compressed |

0:03:09 | in the forest |

0:03:10 | as in thirty two eigenvalues in this case |

0:03:12 | so they're pretty much contact |

0:03:14 | so i |

0:03:15 | i kind of thought okay that there might be a chance to |

0:03:19 | the reason to believe that there is some components the image |

0:03:26 | which are not really |

0:03:28 | so we use the factor analysis |

0:03:33 | on acoustic features so this is the basic formulation very simple |

0:03:37 | so you have a feature vector X is the factor loading matrix |

0:03:42 | why is the acoustic factors which is basically the |

0:03:45 | the hidden variables |

0:03:47 | you is the mean vector and |

0:03:49 | absolute is the isotropic noise |

0:03:51 | so this is basically a ppca |

0:03:54 | and the interpretation of the covariance is now modeled by the cuban variables |

0:04:00 | and the covariance of the acoustic features |

0:04:03 | and the residual variance is modeled by a voice model |

0:04:09 | so is the pdf of the model |

0:04:13 | and so what we try to do here is we want to place the acoustic |

0:04:17 | features by the acoustic factors basically the or the estimation of the acoustic factors |

0:04:23 | and try to use them as the features |

0:04:26 | believing that these acoustic factors |

0:04:29 | have more speaker-dependent information and the full feature vector might have some nuisance components |

0:04:35 | so a transformation matrix is derived |

0:04:38 | so it's also coming from the testing condition papers you can see first you have |

0:04:42 | to select the number of |

0:04:44 | coefficients you want to change |

0:04:46 | suppose they have six features i want to keep |

0:04:49 | forty |

0:04:50 | so he would be cost forty |

0:04:52 | and that was varies estimation is done by this also that's the remaining components in |

0:04:57 | the S this coverage |

0:04:59 | oh of the |

0:05:00 | eigenvalues |

0:05:02 | sorted eigenvalues |

0:05:04 | so the in its eigenvalue of the covariance matrix of X |

0:05:07 | and this is the factor loading matrix the maximum likelihood estimate |

0:05:12 | and it's also from the keeping initial paper |

0:05:17 | so this is how we estimate the acoustic factors which is basically |

0:05:23 | the expected value of the posterior mean of the acoustic factors |

0:05:27 | and it can be shown to be to use the |

0:05:30 | expression here so it's basically removal of the meeting and the transformation by this matrix |

0:05:36 | so what is given by this |

0:05:38 | and so it's just are the linear transformation |

0:05:42 | and if you take a this is the transformed feature vector which were like to |

0:05:47 | call it |

0:05:47 | and if you look at the mean and covariance matrix of this quantity it's a |

0:05:51 | zero-mean gaussian distributed with |

0:05:54 | a diagonal covariance matrix given by this |

0:06:00 | burgers |

0:06:01 | in the paper |

0:06:03 | i |

0:06:05 | so what to do a mixture of if it models which is basically the mixture |

0:06:09 | of ppca equation |

0:06:11 | so it's basically like a gaussian mixture models the same |

0:06:16 | but could think about this is you can |

0:06:18 | directly compute the parameters we |

0:06:22 | the fa parameters |

0:06:23 | from the full covariance ubm |

0:06:25 | and then becomes really handy the C |

0:06:28 | oh |

0:06:30 | next i'd like to talk about how we want to use the |

0:06:34 | the transformation so you have set and twenty four mixtures and to make sure has |

0:06:38 | a transformation so what you could do us a you take a feature vector and |

0:06:42 | you find the most likely mixture and you transform the feature and then |

0:06:47 | you know replace the original vector right |

0:06:49 | but what we saw is |

0:06:52 | actually it's kind of not be the optimal way of doing it because |

0:06:57 | so if you find the top scoring mixture of say your development data across the |

0:07:02 | again |

0:07:02 | so this is kind of the distribution |

0:07:05 | so what this tells you is |

0:07:07 | it's very rare that the acoustic feature is unquestionable the online |

0:07:11 | two in mixture most of times that you can get like that was like one |

0:07:16 | point four point five |

0:07:17 | so that kind of means is a |

0:07:20 | you can't really say that this feature vector comes from this mixture it kind of |

0:07:24 | the last a lot of mixtures |

0:07:26 | maybe more than one so what we want to do not keep all the all |

0:07:30 | the transformations |

0:07:32 | that are done by of the mixtures |

0:07:35 | so this is how we do it |

0:07:36 | basically |

0:07:38 | integrating the process within the total variability model |

0:07:43 | so with the i-vector system |

0:07:45 | so for speech and the ubm full covariance |

0:07:48 | and then we compute the parameters like we set the value of Q well just |

0:07:53 | a |

0:07:53 | fifty |

0:07:54 | i think |

0:07:55 | oh data we find the noise variance these are all you different pictures |

0:07:59 | for each mixture you find a |

0:08:01 | a factor loading matrix and the transformation |

0:08:03 | so how it flies is basically |

0:08:06 | directly those on to the first order statistics you actually have to by frame-by-frame so |

0:08:13 | you compute the statistics and you can just take a transformation of that estimation |

0:08:17 | so it becomes very simple you just transform the first order statistics |

0:08:22 | and actually know the transformation is completely integrated within this is |

0:08:29 | so these are differences with the conventional the t-matrix training |

0:08:34 | so the feature size becomes Q instead of D |

0:08:36 | support vector becomes in Q |

0:08:39 | and the T V image of size becomes smaller |

0:08:41 | and most importantly the ubm gets replaced by the distribution of the transformed features so |

0:08:48 | since we are not using the original features in the subsequent processing we will use |

0:08:53 | this is not really the ubm this is basically to how the parameters can place |

0:08:59 | and the i-vector expected |

0:09:01 | procedures similar |

0:09:05 | i system we have a phone recognizer based fantasy two-dimensional |

0:09:10 | six with feature |

0:09:12 | cepstral mean normalization |

0:09:13 | we have a ubm a gender dependent on ten twenty four mixtures |

0:09:18 | oh we train |

0:09:20 | we train the full covariance ubm with |

0:09:22 | a variance flooring it's the investigate parameter it's that's the |

0:09:27 | mean value of the corpus matrix to be |

0:09:30 | a fixed value |

0:09:32 | and the i-vector size was four hundred |

0:09:36 | and we used five iterations |

0:09:38 | so we have the pot a backend where we have a full covariance was model |

0:09:45 | and the only free parameters the eigenvoice size |

0:09:49 | next to the we have the fa which i just talked about we derive all |

0:09:53 | the parameters from the ubm directly |

0:09:55 | and we performed experiments on sre twenty ten basically |

0:10:00 | conditions want to find we use the male trials |

0:10:05 | so this is the initial results as we can see |

0:10:08 | we change the |

0:10:10 | P of the inside the eigenvoice size from fifteen |

0:10:14 | then we use the cubicles fifty four forty eight and forty two |

0:10:18 | our feature sizes sixteen so you can see |

0:10:21 | taking off six components and so on |

0:10:25 | so also what we can get nice improvement using the proposed technique |

0:10:31 | so here's |

0:10:33 | table showing you some of the systems |

0:10:36 | that we fused |

0:10:38 | so the baseline is sitting here |

0:10:39 | and we are getting nice improvement in all three a couple of two thousand Q |

0:10:45 | it's kind of heart to say which that would work |

0:10:47 | that's in challenge |

0:10:50 | and also |

0:10:52 | this to that kind of |

0:10:53 | it can be optimal and it can have different value in each mixture depending on |

0:10:58 | how the mixture how the covariance structure is in the mixture |

0:11:02 | i also did some work on that and |

0:11:05 | probably |

0:11:06 | see interspeech |

0:11:09 | oh |

0:11:10 | so anyway |

0:11:12 | when we fuse the systems it's too late fusion and we can see still we |

0:11:17 | can get a pretty nice improvement |

0:11:20 | by fusing |

0:11:21 | and different combinations |

0:11:23 | so these systems to have a complementary information |

0:11:27 | so these are actually extra experiments that performed after the |

0:11:31 | this paper submitted source one are shown |

0:11:33 | oh in other conditions works in condition one |

0:11:37 | oh maybe cubicles forty eight what's nicely what condition two Q was forty two words |

0:11:43 | yeah |

0:11:44 | condition |

0:11:45 | three |

0:11:47 | cubicles forty eight and fifty four |

0:11:50 | oh but in take you information for we have |

0:11:54 | maybe of the dcf |

0:11:56 | the new dcf didn't from improve |

0:11:59 | but you of the conditions |

0:12:02 | but you can see clearly that a |

0:12:04 | the proposed techniques |

0:12:06 | a technique works well it reduces |

0:12:09 | all three |

0:12:10 | a performance in this is |

0:12:12 | and after fusion you can actually see nice |

0:12:15 | a really different from all three of parameters |

0:12:22 | so here is the det curve it's on the to a condition one to five |

0:12:28 | and we just pick the cubicles forty two system |

0:12:32 | oh and you can see it's |

0:12:34 | almost all |

0:12:36 | the fa system is |

0:12:37 | better than the baseline |

0:12:39 | and with fusion we get |

0:12:41 | for the |

0:12:45 | so |

0:12:47 | we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation |

0:12:56 | a compact representation well |

0:12:59 | and we propose the be probabilistic feature alignment method |

0:13:04 | instead of hard-clustering a feature vector to a mixture |

0:13:08 | and so we show that |

0:13:10 | i provides better |

0:13:12 | oh when we integrate it with the i-vector system |

0:13:15 | and the as a kind of |

0:13:18 | nice artifact it kind of makes it faster because |

0:13:22 | you know you're reducing the feature vector dimensionality which actually in turn reduces that support |

0:13:27 | vector size and tv matrix size |

0:13:29 | and it's |

0:13:30 | you can see in this paper is discussed that V |

0:13:34 | the computational complexity is proportional to be |

0:13:37 | supervectors |

0:13:39 | so and future work |

0:13:41 | there's nothing to like |

0:13:43 | not |

0:13:45 | it can be mixture dependent basically so |

0:13:47 | we obtain colour feature dimension like say |

0:13:51 | forty eight from all the mixtures |

0:13:53 | what you can be different so one of my papers that supported in interspeech which |

0:13:58 | deals about the trying to |

0:13:59 | optimize the parameter in each mixture |

0:14:03 | and also |

0:14:05 | some of future work will be |

0:14:07 | using iterative techniques in proposed to begin bishops method |

0:14:12 | in table four mixture of ppca |

0:14:16 | most of all actually |

0:14:18 | this opens up |

0:14:21 | we have |

0:14:22 | using other transformations also in mixture wise which might also didn't in another interesting to |

0:14:26 | people where i actually a by conventional transformations and the |

0:14:32 | and |

0:14:33 | nap or other techniques |

0:14:34 | which actually sort of take |

0:14:36 | transformations in each mixture and then |

0:14:39 | yeah so |

0:14:40 | and then basically integrated with the i-vectors |

0:14:45 | so |

0:14:46 | that is all i have a given |

0:15:15 | sorry how do you can go back to the acoustic features |

0:15:20 | i |

0:15:23 | yeah |

0:15:28 | yeah |

0:15:29 | i |

0:15:35 | what we need to train the ubm from scratch |

0:15:40 | oh yeah i did i tried i've seen some papers to |

0:15:44 | i didn't think i think the way i did i thought |

0:15:50 | or |

0:15:52 | sure |

0:15:53 | you can |

0:16:01 | so |

0:16:02 | i |

0:16:03 | cluster a feature dimension you have to have some kind of measurement |

0:16:07 | usually you can find the find the mixture by oh the most |

0:16:12 | the make sure that you to the highest posterior probability |

0:16:15 | but in this distribution i'm showing that |

0:16:18 | oh it's not always a one to one mixture because sometimes if the maximum value |

0:16:23 | of |

0:16:23 | the posterior probability of the mixture is if it's giving you point to that is |

0:16:27 | there other mixtures |

0:16:29 | one |

0:16:30 | point something that means |

0:16:32 | if you take point to as the maximum mixture and use that mixtures transformation it |

0:16:36 | will be |

0:16:38 | so |

0:16:40 | yeah we can you "'cause" to do it |

0:16:42 | but i try because i just have seen this and i thought |

0:16:45 | it would be nicer generate things that make things |

0:16:48 | are |

0:16:51 | together what is |

0:17:05 | i |

0:17:17 | oh |

0:17:20 | so a number of trials |

0:17:23 | i |

0:17:24 | yeah |

0:17:26 | yeah |

0:17:27 | i |

0:17:36 | i think i normalized in a binary invariance |

0:17:40 | oh |

0:17:56 | although i |

0:18:10 | right yes |

0:18:11 | oh maybe what you're saying is true |

0:18:14 | since i get |

0:18:15 | maybe |

0:18:17 | conditions |

0:18:21 | maybe i don't know if i the folding problem |

0:18:25 | i believe |

0:18:26 | just to |

0:18:28 | well |

0:18:43 | yeah i think that |