Speech Transcript - Factor Analysis of Acoustic Features using a Mixture of Probabilistic Principal Component Analyzers for robust Speaker Verification

the next presentation is not factor analysis of acoustic features you i mixture of probabilistic

principal component analyzers

moreover speaker

and

that is

factor analysis of acoustic features using a mixture problems

component analysis

for robust speaker very i

so in the introduction what i want to say is

so factor analysis is very popular technique when applied in gmm supervectors

and the main assumption there is

therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace

we actually it's kind of not that the acoustic features are also represent a low

dimensional sub-spaces

and this phenomenon is not really

taken into consideration in gmm supervector bayes factor analysis

so we propose to try to see

what happens if we do factor analysis on the acoustic features

in addition to those i based cross

so just to say more about the motivation

we do not a speech spectral components are highly related to so our in the

mfcc features

we have a pca dct to detect these

a lot of work on trying to be really features

it has been shown that the first few eigen directions of the feature covariance matrix

is more speaker-dependent

so by maximizing

back into the

so what we believe is the retaining the full feature all the directions of the

eigen directions

the features might actually be harmful there might be some

directions that are not benefiting

we also get the evidence from the full covariance based i-vector system that

oh what a better than eigen system

which

so motivates us to investigate this further

so if you look at a full covariance matrix

the covariance matrix of a full covariance ubm this is how it kind of looks

and if you look at the eigenvalue distribution see most of the energy is compressed

in the forest

as in thirty two eigenvalues in this case

so they're pretty much contact

so i

i kind of thought okay that there might be a chance to

the reason to believe that there is some components the image

which are not really

so we use the factor analysis

on acoustic features so this is the basic formulation very simple

so you have a feature vector X is the factor loading matrix

why is the acoustic factors which is basically the

the hidden variables

you is the mean vector and

absolute is the isotropic noise

so this is basically a ppca

and the interpretation of the covariance is now modeled by the cuban variables

and the covariance of the acoustic features

and the residual variance is modeled by a voice model

so is the pdf of the model

and so what we try to do here is we want to place the acoustic

features by the acoustic factors basically the or the estimation of the acoustic factors

and try to use them as the features

believing that these acoustic factors

have more speaker-dependent information and the full feature vector might have some nuisance components

so a transformation matrix is derived

so it's also coming from the testing condition papers you can see first you have

to select the number of

coefficients you want to change

suppose they have six features i want to keep

forty

so he would be cost forty

and that was varies estimation is done by this also that's the remaining components in

the S this coverage

oh of the

eigenvalues

sorted eigenvalues

so the in its eigenvalue of the covariance matrix of X

and this is the factor loading matrix the maximum likelihood estimate

and it's also from the keeping initial paper

so this is how we estimate the acoustic factors which is basically

the expected value of the posterior mean of the acoustic factors

and it can be shown to be to use the

expression here so it's basically removal of the meeting and the transformation by this matrix

so what is given by this

and so it's just are the linear transformation

and if you take a this is the transformed feature vector which were like to

call it

and if you look at the mean and covariance matrix of this quantity it's a

zero-mean gaussian distributed with

a diagonal covariance matrix given by this

burgers

in the paper

so what to do a mixture of if it models which is basically the mixture

of ppca equation

so it's basically like a gaussian mixture models the same

but could think about this is you can

directly compute the parameters we

the fa parameters

from the full covariance ubm

and then becomes really handy the C

next i'd like to talk about how we want to use the

the transformation so you have set and twenty four mixtures and to make sure has

a transformation so what you could do us a you take a feature vector and

you find the most likely mixture and you transform the feature and then

you know replace the original vector right

but what we saw is

actually it's kind of not be the optimal way of doing it because

so if you find the top scoring mixture of say your development data across the

again

so this is kind of the distribution

so what this tells you is

it's very rare that the acoustic feature is unquestionable the online

two in mixture most of times that you can get like that was like one

point four point five

so that kind of means is a

you can't really say that this feature vector comes from this mixture it kind of

the last a lot of mixtures

maybe more than one so what we want to do not keep all the all

the transformations

that are done by of the mixtures

so this is how we do it

basically

integrating the process within the total variability model

so with the i-vector system

so for speech and the ubm full covariance

and then we compute the parameters like we set the value of Q well just

fifty

i think

oh data we find the noise variance these are all you different pictures

for each mixture you find a

a factor loading matrix and the transformation

so how it flies is basically

directly those on to the first order statistics you actually have to by frame-by-frame so

you compute the statistics and you can just take a transformation of that estimation

so it becomes very simple you just transform the first order statistics

and actually know the transformation is completely integrated within this is

so these are differences with the conventional the t-matrix training

so the feature size becomes Q instead of D

support vector becomes in Q

and the T V image of size becomes smaller

and most importantly the ubm gets replaced by the distribution of the transformed features so

since we are not using the original features in the subsequent processing we will use

this is not really the ubm this is basically to how the parameters can place

and the i-vector expected

procedures similar

i system we have a phone recognizer based fantasy two-dimensional

six with feature

cepstral mean normalization

we have a ubm a gender dependent on ten twenty four mixtures

oh we train

we train the full covariance ubm with

a variance flooring it's the investigate parameter it's that's the

mean value of the corpus matrix to be

a fixed value

and the i-vector size was four hundred

and we used five iterations

so we have the pot a backend where we have a full covariance was model

and the only free parameters the eigenvoice size

next to the we have the fa which i just talked about we derive all

the parameters from the ubm directly

and we performed experiments on sre twenty ten basically

conditions want to find we use the male trials

so this is the initial results as we can see

we change the

P of the inside the eigenvoice size from fifteen

then we use the cubicles fifty four forty eight and forty two

our feature sizes sixteen so you can see

taking off six components and so on

so also what we can get nice improvement using the proposed technique

so here's

table showing you some of the systems

that we fused

so the baseline is sitting here

and we are getting nice improvement in all three a couple of two thousand Q

it's kind of heart to say which that would work

that's in challenge

and also

this to that kind of

it can be optimal and it can have different value in each mixture depending on

how the mixture how the covariance structure is in the mixture

i also did some work on that and

probably

see interspeech

so anyway

when we fuse the systems it's too late fusion and we can see still we

can get a pretty nice improvement

by fusing

and different combinations

so these systems to have a complementary information

so these are actually extra experiments that performed after the

this paper submitted source one are shown

oh in other conditions works in condition one

oh maybe cubicles forty eight what's nicely what condition two Q was forty two words

yeah

condition

three

cubicles forty eight and fifty four

oh but in take you information for we have

maybe of the dcf

the new dcf didn't from improve

but you of the conditions

but you can see clearly that a

the proposed techniques

a technique works well it reduces

all three

a performance in this is

and after fusion you can actually see nice

a really different from all three of parameters

so here is the det curve it's on the to a condition one to five

and we just pick the cubicles forty two system

oh and you can see it's

almost all

the fa system is

better than the baseline

and with fusion we get

for the

we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation

a compact representation well

and we propose the be probabilistic feature alignment method

instead of hard-clustering a feature vector to a mixture

and so we show that

i provides better

oh when we integrate it with the i-vector system

and the as a kind of

nice artifact it kind of makes it faster because

you know you're reducing the feature vector dimensionality which actually in turn reduces that support

vector size and tv matrix size

and it's

you can see in this paper is discussed that V

the computational complexity is proportional to be

supervectors

so and future work

there's nothing to like

not

it can be mixture dependent basically so

we obtain colour feature dimension like say

forty eight from all the mixtures

what you can be different so one of my papers that supported in interspeech which

deals about the trying to

optimize the parameter in each mixture

and also

some of future work will be

using iterative techniques in proposed to begin bishops method

in table four mixture of ppca

most of all actually

this opens up

we have

using other transformations also in mixture wise which might also didn't in another interesting to

people where i actually a by conventional transformations and the

and

nap or other techniques

which actually sort of take

transformations in each mixture and then

yeah so

and then basically integrated with the i-vectors

that is all i have a given

sorry how do you can go back to the acoustic features

yeah

what we need to train the ubm from scratch

oh yeah i did i tried i've seen some papers to

i didn't think i think the way i did i thought

sure

you can

cluster a feature dimension you have to have some kind of measurement

usually you can find the find the mixture by oh the most

the make sure that you to the highest posterior probability

but in this distribution i'm showing that

oh it's not always a one to one mixture because sometimes if the maximum value

the posterior probability of the mixture is if it's giving you point to that is

there other mixtures

one

point something that means

if you take point to as the maximum mixture and use that mixtures transformation it

will be

yeah we can you "'cause" to do it

but i try because i just have seen this and i thought

it would be nicer generate things that make things

are

together what is

so a number of trials

yeah

i think i normalized in a binary invariance

although i

right yes

oh maybe what you're saying is true

since i get

maybe

conditions

maybe i don't know if i the folding problem

i believe

just to

well

yeah i think that

Factor Analysis of Acoustic Features using a Mixture of Probabilistic Principal Component Analyzers for robust Speaker Verification

SESSION 08: Features for Speaker Recognition

Taufiq Hasan