0:00:15the next presentation is not factor analysis of acoustic features you i mixture of probabilistic
0:00:20principal component analyzers
0:00:22moreover speaker
0:00:26i
0:00:48and
0:00:50that is
0:00:53factor analysis of acoustic features using a mixture problems
0:00:57component analysis
0:00:59for robust speaker very i
0:01:05so in the introduction what i want to say is
0:01:09so factor analysis is very popular technique when applied in gmm supervectors
0:01:14and the main assumption there is
0:01:17therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace
0:01:24we actually it's kind of not that the acoustic features are also represent a low
0:01:30dimensional sub-spaces
0:01:32and this phenomenon is not really
0:01:35taken into consideration in gmm supervector bayes factor analysis
0:01:40so we propose to try to see
0:01:44what happens if we do factor analysis on the acoustic features
0:01:48in addition to those i based cross
0:01:53so just to say more about the motivation
0:01:57we do not a speech spectral components are highly related to so our in the
0:02:03mfcc features
0:02:06we have a pca dct to detect these
0:02:10a lot of work on trying to be really features
0:02:14it has been shown that the first few eigen directions of the feature covariance matrix
0:02:19is more speaker-dependent
0:02:22so by maximizing
0:02:25back into the
0:02:26so what we believe is the retaining the full feature all the directions of the
0:02:33eigen directions
0:02:34the features might actually be harmful there might be some
0:02:38directions that are not benefiting
0:02:41we also get the evidence from the full covariance based i-vector system that
0:02:45oh what a better than eigen system
0:02:48which
0:02:49so motivates us to investigate this further
0:02:54so if you look at a full covariance matrix
0:02:58the covariance matrix of a full covariance ubm this is how it kind of looks
0:03:03and if you look at the eigenvalue distribution see most of the energy is compressed
0:03:09in the forest
0:03:10as in thirty two eigenvalues in this case
0:03:12so they're pretty much contact
0:03:14so i
0:03:15i kind of thought okay that there might be a chance to
0:03:19the reason to believe that there is some components the image
0:03:26which are not really
0:03:28so we use the factor analysis
0:03:33on acoustic features so this is the basic formulation very simple
0:03:37so you have a feature vector X is the factor loading matrix
0:03:42why is the acoustic factors which is basically the
0:03:45the hidden variables
0:03:47you is the mean vector and
0:03:49absolute is the isotropic noise
0:03:51so this is basically a ppca
0:03:54and the interpretation of the covariance is now modeled by the cuban variables
0:04:00and the covariance of the acoustic features
0:04:03and the residual variance is modeled by a voice model
0:04:09so is the pdf of the model
0:04:13and so what we try to do here is we want to place the acoustic
0:04:17features by the acoustic factors basically the or the estimation of the acoustic factors
0:04:23and try to use them as the features
0:04:26believing that these acoustic factors
0:04:29have more speaker-dependent information and the full feature vector might have some nuisance components
0:04:35so a transformation matrix is derived
0:04:38so it's also coming from the testing condition papers you can see first you have
0:04:42to select the number of
0:04:44coefficients you want to change
0:04:46suppose they have six features i want to keep
0:04:49forty
0:04:50so he would be cost forty
0:04:52and that was varies estimation is done by this also that's the remaining components in
0:04:57the S this coverage
0:04:59oh of the
0:05:00eigenvalues
0:05:02sorted eigenvalues
0:05:04so the in its eigenvalue of the covariance matrix of X
0:05:07and this is the factor loading matrix the maximum likelihood estimate
0:05:12and it's also from the keeping initial paper
0:05:17so this is how we estimate the acoustic factors which is basically
0:05:23the expected value of the posterior mean of the acoustic factors
0:05:27and it can be shown to be to use the
0:05:30expression here so it's basically removal of the meeting and the transformation by this matrix
0:05:36so what is given by this
0:05:38and so it's just are the linear transformation
0:05:42and if you take a this is the transformed feature vector which were like to
0:05:47call it
0:05:47and if you look at the mean and covariance matrix of this quantity it's a
0:05:51zero-mean gaussian distributed with
0:05:54a diagonal covariance matrix given by this
0:06:00burgers
0:06:01in the paper
0:06:03i
0:06:05so what to do a mixture of if it models which is basically the mixture
0:06:09of ppca equation
0:06:11so it's basically like a gaussian mixture models the same
0:06:16but could think about this is you can
0:06:18directly compute the parameters we
0:06:22the fa parameters
0:06:23from the full covariance ubm
0:06:25and then becomes really handy the C
0:06:28oh
0:06:30next i'd like to talk about how we want to use the
0:06:34the transformation so you have set and twenty four mixtures and to make sure has
0:06:38a transformation so what you could do us a you take a feature vector and
0:06:42you find the most likely mixture and you transform the feature and then
0:06:47you know replace the original vector right
0:06:49but what we saw is
0:06:52actually it's kind of not be the optimal way of doing it because
0:06:57so if you find the top scoring mixture of say your development data across the
0:07:02again
0:07:02so this is kind of the distribution
0:07:05so what this tells you is
0:07:07it's very rare that the acoustic feature is unquestionable the online
0:07:11two in mixture most of times that you can get like that was like one
0:07:16point four point five
0:07:17so that kind of means is a
0:07:20you can't really say that this feature vector comes from this mixture it kind of
0:07:24the last a lot of mixtures
0:07:26maybe more than one so what we want to do not keep all the all
0:07:30the transformations
0:07:32that are done by of the mixtures
0:07:35so this is how we do it
0:07:36basically
0:07:38integrating the process within the total variability model
0:07:43so with the i-vector system
0:07:45so for speech and the ubm full covariance
0:07:48and then we compute the parameters like we set the value of Q well just
0:07:53a
0:07:53fifty
0:07:54i think
0:07:55oh data we find the noise variance these are all you different pictures
0:07:59for each mixture you find a
0:08:01a factor loading matrix and the transformation
0:08:03so how it flies is basically
0:08:06directly those on to the first order statistics you actually have to by frame-by-frame so
0:08:13you compute the statistics and you can just take a transformation of that estimation
0:08:17so it becomes very simple you just transform the first order statistics
0:08:22and actually know the transformation is completely integrated within this is
0:08:29so these are differences with the conventional the t-matrix training
0:08:34so the feature size becomes Q instead of D
0:08:36support vector becomes in Q
0:08:39and the T V image of size becomes smaller
0:08:41and most importantly the ubm gets replaced by the distribution of the transformed features so
0:08:48since we are not using the original features in the subsequent processing we will use
0:08:53this is not really the ubm this is basically to how the parameters can place
0:08:59and the i-vector expected
0:09:01procedures similar
0:09:05i system we have a phone recognizer based fantasy two-dimensional
0:09:10six with feature
0:09:12cepstral mean normalization
0:09:13we have a ubm a gender dependent on ten twenty four mixtures
0:09:18oh we train
0:09:20we train the full covariance ubm with
0:09:22a variance flooring it's the investigate parameter it's that's the
0:09:27mean value of the corpus matrix to be
0:09:30a fixed value
0:09:32and the i-vector size was four hundred
0:09:36and we used five iterations
0:09:38so we have the pot a backend where we have a full covariance was model
0:09:45and the only free parameters the eigenvoice size
0:09:49next to the we have the fa which i just talked about we derive all
0:09:53the parameters from the ubm directly
0:09:55and we performed experiments on sre twenty ten basically
0:10:00conditions want to find we use the male trials
0:10:05so this is the initial results as we can see
0:10:08we change the
0:10:10P of the inside the eigenvoice size from fifteen
0:10:14then we use the cubicles fifty four forty eight and forty two
0:10:18our feature sizes sixteen so you can see
0:10:21taking off six components and so on
0:10:25so also what we can get nice improvement using the proposed technique
0:10:31so here's
0:10:33table showing you some of the systems
0:10:36that we fused
0:10:38so the baseline is sitting here
0:10:39and we are getting nice improvement in all three a couple of two thousand Q
0:10:45it's kind of heart to say which that would work
0:10:47that's in challenge
0:10:50and also
0:10:52this to that kind of
0:10:53it can be optimal and it can have different value in each mixture depending on
0:10:58how the mixture how the covariance structure is in the mixture
0:11:02i also did some work on that and
0:11:05probably
0:11:06see interspeech
0:11:09oh
0:11:10so anyway
0:11:12when we fuse the systems it's too late fusion and we can see still we
0:11:17can get a pretty nice improvement
0:11:20by fusing
0:11:21and different combinations
0:11:23so these systems to have a complementary information
0:11:27so these are actually extra experiments that performed after the
0:11:31this paper submitted source one are shown
0:11:33oh in other conditions works in condition one
0:11:37oh maybe cubicles forty eight what's nicely what condition two Q was forty two words
0:11:43yeah
0:11:44condition
0:11:45three
0:11:47cubicles forty eight and fifty four
0:11:50oh but in take you information for we have
0:11:54maybe of the dcf
0:11:56the new dcf didn't from improve
0:11:59but you of the conditions
0:12:02but you can see clearly that a
0:12:04the proposed techniques
0:12:06a technique works well it reduces
0:12:09all three
0:12:10a performance in this is
0:12:12and after fusion you can actually see nice
0:12:15a really different from all three of parameters
0:12:22so here is the det curve it's on the to a condition one to five
0:12:28and we just pick the cubicles forty two system
0:12:32oh and you can see it's
0:12:34almost all
0:12:36the fa system is
0:12:37better than the baseline
0:12:39and with fusion we get
0:12:41for the
0:12:45so
0:12:47we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation
0:12:56a compact representation well
0:12:59and we propose the be probabilistic feature alignment method
0:13:04instead of hard-clustering a feature vector to a mixture
0:13:08and so we show that
0:13:10i provides better
0:13:12oh when we integrate it with the i-vector system
0:13:15and the as a kind of
0:13:18nice artifact it kind of makes it faster because
0:13:22you know you're reducing the feature vector dimensionality which actually in turn reduces that support
0:13:27vector size and tv matrix size
0:13:29and it's
0:13:30you can see in this paper is discussed that V
0:13:34the computational complexity is proportional to be
0:13:37supervectors
0:13:39so and future work
0:13:41there's nothing to like
0:13:43not
0:13:45it can be mixture dependent basically so
0:13:47we obtain colour feature dimension like say
0:13:51forty eight from all the mixtures
0:13:53what you can be different so one of my papers that supported in interspeech which
0:13:58deals about the trying to
0:13:59optimize the parameter in each mixture
0:14:03and also
0:14:05some of future work will be
0:14:07using iterative techniques in proposed to begin bishops method
0:14:12in table four mixture of ppca
0:14:16most of all actually
0:14:18this opens up
0:14:21we have
0:14:22using other transformations also in mixture wise which might also didn't in another interesting to
0:14:26people where i actually a by conventional transformations and the
0:14:32and
0:14:33nap or other techniques
0:14:34which actually sort of take
0:14:36transformations in each mixture and then
0:14:39yeah so
0:14:40and then basically integrated with the i-vectors
0:14:45so
0:14:46that is all i have a given
0:15:15sorry how do you can go back to the acoustic features
0:15:20i
0:15:23yeah
0:15:28yeah
0:15:29i
0:15:35what we need to train the ubm from scratch
0:15:40oh yeah i did i tried i've seen some papers to
0:15:44i didn't think i think the way i did i thought
0:15:50or
0:15:52sure
0:15:53you can
0:16:01so
0:16:02i
0:16:03cluster a feature dimension you have to have some kind of measurement
0:16:07usually you can find the find the mixture by oh the most
0:16:12the make sure that you to the highest posterior probability
0:16:15but in this distribution i'm showing that
0:16:18oh it's not always a one to one mixture because sometimes if the maximum value
0:16:23of
0:16:23the posterior probability of the mixture is if it's giving you point to that is
0:16:27there other mixtures
0:16:29one
0:16:30point something that means
0:16:32if you take point to as the maximum mixture and use that mixtures transformation it
0:16:36will be
0:16:38so
0:16:40yeah we can you "'cause" to do it
0:16:42but i try because i just have seen this and i thought
0:16:45it would be nicer generate things that make things
0:16:48are
0:16:51together what is
0:17:05i
0:17:17oh
0:17:20so a number of trials
0:17:23i
0:17:24yeah
0:17:26yeah
0:17:27i
0:17:36i think i normalized in a binary invariance
0:17:40oh
0:17:56although i
0:18:10right yes
0:18:11oh maybe what you're saying is true
0:18:14since i get
0:18:15maybe
0:18:17conditions
0:18:21maybe i don't know if i the folding problem
0:18:25i believe
0:18:26just to
0:18:28well
0:18:43yeah i think that