Speech Transcript - Factor Analysis of Acoustic Features using a Mixture of Probabilistic Principal Component Analyzers for robust Speaker Verification

0:00:15	the next presentation is not factor analysis of acoustic features you i mixture of probabilistic
0:00:20	principal component analyzers
0:00:22	moreover speaker
0:00:26	i
0:00:48	and
0:00:50	that is
0:00:53	factor analysis of acoustic features using a mixture problems
0:00:57	component analysis
0:00:59	for robust speaker very i
0:01:05	so in the introduction what i want to say is
0:01:09	so factor analysis is very popular technique when applied in gmm supervectors
0:01:14	and the main assumption there is
0:01:17	therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace
0:01:24	we actually it's kind of not that the acoustic features are also represent a low
0:01:30	dimensional sub-spaces
0:01:32	and this phenomenon is not really
0:01:35	taken into consideration in gmm supervector bayes factor analysis
0:01:40	so we propose to try to see
0:01:44	what happens if we do factor analysis on the acoustic features
0:01:48	in addition to those i based cross
0:01:53	so just to say more about the motivation
0:01:57	we do not a speech spectral components are highly related to so our in the
0:02:03	mfcc features
0:02:06	we have a pca dct to detect these
0:02:10	a lot of work on trying to be really features
0:02:14	it has been shown that the first few eigen directions of the feature covariance matrix
0:02:19	is more speaker-dependent
0:02:22	so by maximizing
0:02:25	back into the
0:02:26	so what we believe is the retaining the full feature all the directions of the
0:02:33	eigen directions
0:02:34	the features might actually be harmful there might be some
0:02:38	directions that are not benefiting
0:02:41	we also get the evidence from the full covariance based i-vector system that
0:02:45	oh what a better than eigen system
0:02:48	which
0:02:49	so motivates us to investigate this further
0:02:54	so if you look at a full covariance matrix
0:02:58	the covariance matrix of a full covariance ubm this is how it kind of looks
0:03:03	and if you look at the eigenvalue distribution see most of the energy is compressed
0:03:09	in the forest
0:03:10	as in thirty two eigenvalues in this case
0:03:12	so they're pretty much contact
0:03:14	so i
0:03:15	i kind of thought okay that there might be a chance to
0:03:19	the reason to believe that there is some components the image
0:03:26	which are not really
0:03:28	so we use the factor analysis
0:03:33	on acoustic features so this is the basic formulation very simple
0:03:37	so you have a feature vector X is the factor loading matrix
0:03:42	why is the acoustic factors which is basically the
0:03:45	the hidden variables
0:03:47	you is the mean vector and
0:03:49	absolute is the isotropic noise
0:03:51	so this is basically a ppca
0:03:54	and the interpretation of the covariance is now modeled by the cuban variables
0:04:00	and the covariance of the acoustic features
0:04:03	and the residual variance is modeled by a voice model
0:04:09	so is the pdf of the model
0:04:13	and so what we try to do here is we want to place the acoustic
0:04:17	features by the acoustic factors basically the or the estimation of the acoustic factors
0:04:23	and try to use them as the features
0:04:26	believing that these acoustic factors
0:04:29	have more speaker-dependent information and the full feature vector might have some nuisance components
0:04:35	so a transformation matrix is derived
0:04:38	so it's also coming from the testing condition papers you can see first you have
0:04:42	to select the number of
0:04:44	coefficients you want to change
0:04:46	suppose they have six features i want to keep
0:04:49	forty
0:04:50	so he would be cost forty
0:04:52	and that was varies estimation is done by this also that's the remaining components in
0:04:57	the S this coverage
0:04:59	oh of the
0:05:00	eigenvalues
0:05:02	sorted eigenvalues
0:05:04	so the in its eigenvalue of the covariance matrix of X
0:05:07	and this is the factor loading matrix the maximum likelihood estimate
0:05:12	and it's also from the keeping initial paper
0:05:17	so this is how we estimate the acoustic factors which is basically
0:05:23	the expected value of the posterior mean of the acoustic factors
0:05:27	and it can be shown to be to use the
0:05:30	expression here so it's basically removal of the meeting and the transformation by this matrix
0:05:36	so what is given by this
0:05:38	and so it's just are the linear transformation
0:05:42	and if you take a this is the transformed feature vector which were like to
0:05:47	call it
0:05:47	and if you look at the mean and covariance matrix of this quantity it's a
0:05:51	zero-mean gaussian distributed with
0:05:54	a diagonal covariance matrix given by this
0:06:00	burgers
0:06:01	in the paper
0:06:03	i
0:06:05	so what to do a mixture of if it models which is basically the mixture
0:06:09	of ppca equation
0:06:11	so it's basically like a gaussian mixture models the same
0:06:16	but could think about this is you can
0:06:18	directly compute the parameters we
0:06:22	the fa parameters
0:06:23	from the full covariance ubm
0:06:25	and then becomes really handy the C
0:06:28	oh
0:06:30	next i'd like to talk about how we want to use the
0:06:34	the transformation so you have set and twenty four mixtures and to make sure has
0:06:38	a transformation so what you could do us a you take a feature vector and
0:06:42	you find the most likely mixture and you transform the feature and then
0:06:47	you know replace the original vector right
0:06:49	but what we saw is
0:06:52	actually it's kind of not be the optimal way of doing it because
0:06:57	so if you find the top scoring mixture of say your development data across the
0:07:02	again
0:07:02	so this is kind of the distribution
0:07:05	so what this tells you is
0:07:07	it's very rare that the acoustic feature is unquestionable the online
0:07:11	two in mixture most of times that you can get like that was like one
0:07:16	point four point five
0:07:17	so that kind of means is a
0:07:20	you can't really say that this feature vector comes from this mixture it kind of
0:07:24	the last a lot of mixtures
0:07:26	maybe more than one so what we want to do not keep all the all
0:07:30	the transformations
0:07:32	that are done by of the mixtures
0:07:35	so this is how we do it
0:07:36	basically
0:07:38	integrating the process within the total variability model
0:07:43	so with the i-vector system
0:07:45	so for speech and the ubm full covariance
0:07:48	and then we compute the parameters like we set the value of Q well just
0:07:53	a
0:07:53	fifty
0:07:54	i think
0:07:55	oh data we find the noise variance these are all you different pictures
0:07:59	for each mixture you find a
0:08:01	a factor loading matrix and the transformation
0:08:03	so how it flies is basically
0:08:06	directly those on to the first order statistics you actually have to by frame-by-frame so
0:08:13	you compute the statistics and you can just take a transformation of that estimation
0:08:17	so it becomes very simple you just transform the first order statistics
0:08:22	and actually know the transformation is completely integrated within this is
0:08:29	so these are differences with the conventional the t-matrix training
0:08:34	so the feature size becomes Q instead of D
0:08:36	support vector becomes in Q
0:08:39	and the T V image of size becomes smaller
0:08:41	and most importantly the ubm gets replaced by the distribution of the transformed features so
0:08:48	since we are not using the original features in the subsequent processing we will use
0:08:53	this is not really the ubm this is basically to how the parameters can place
0:08:59	and the i-vector expected
0:09:01	procedures similar
0:09:05	i system we have a phone recognizer based fantasy two-dimensional
0:09:10	six with feature
0:09:12	cepstral mean normalization
0:09:13	we have a ubm a gender dependent on ten twenty four mixtures
0:09:18	oh we train
0:09:20	we train the full covariance ubm with
0:09:22	a variance flooring it's the investigate parameter it's that's the
0:09:27	mean value of the corpus matrix to be
0:09:30	a fixed value
0:09:32	and the i-vector size was four hundred
0:09:36	and we used five iterations
0:09:38	so we have the pot a backend where we have a full covariance was model
0:09:45	and the only free parameters the eigenvoice size
0:09:49	next to the we have the fa which i just talked about we derive all
0:09:53	the parameters from the ubm directly
0:09:55	and we performed experiments on sre twenty ten basically
0:10:00	conditions want to find we use the male trials
0:10:05	so this is the initial results as we can see
0:10:08	we change the
0:10:10	P of the inside the eigenvoice size from fifteen
0:10:14	then we use the cubicles fifty four forty eight and forty two
0:10:18	our feature sizes sixteen so you can see
0:10:21	taking off six components and so on
0:10:25	so also what we can get nice improvement using the proposed technique
0:10:31	so here's
0:10:33	table showing you some of the systems
0:10:36	that we fused
0:10:38	so the baseline is sitting here
0:10:39	and we are getting nice improvement in all three a couple of two thousand Q
0:10:45	it's kind of heart to say which that would work
0:10:47	that's in challenge
0:10:50	and also
0:10:52	this to that kind of
0:10:53	it can be optimal and it can have different value in each mixture depending on
0:10:58	how the mixture how the covariance structure is in the mixture
0:11:02	i also did some work on that and
0:11:05	probably
0:11:06	see interspeech
0:11:09	oh
0:11:10	so anyway
0:11:12	when we fuse the systems it's too late fusion and we can see still we
0:11:17	can get a pretty nice improvement
0:11:20	by fusing
0:11:21	and different combinations
0:11:23	so these systems to have a complementary information
0:11:27	so these are actually extra experiments that performed after the
0:11:31	this paper submitted source one are shown
0:11:33	oh in other conditions works in condition one
0:11:37	oh maybe cubicles forty eight what's nicely what condition two Q was forty two words
0:11:43	yeah
0:11:44	condition
0:11:45	three
0:11:47	cubicles forty eight and fifty four
0:11:50	oh but in take you information for we have
0:11:54	maybe of the dcf
0:11:56	the new dcf didn't from improve
0:11:59	but you of the conditions
0:12:02	but you can see clearly that a
0:12:04	the proposed techniques
0:12:06	a technique works well it reduces
0:12:09	all three
0:12:10	a performance in this is
0:12:12	and after fusion you can actually see nice
0:12:15	a really different from all three of parameters
0:12:22	so here is the det curve it's on the to a condition one to five
0:12:28	and we just pick the cubicles forty two system
0:12:32	oh and you can see it's
0:12:34	almost all
0:12:36	the fa system is
0:12:37	better than the baseline
0:12:39	and with fusion we get
0:12:41	for the
0:12:45	so
0:12:47	we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation
0:12:56	a compact representation well
0:12:59	and we propose the be probabilistic feature alignment method
0:13:04	instead of hard-clustering a feature vector to a mixture
0:13:08	and so we show that
0:13:10	i provides better
0:13:12	oh when we integrate it with the i-vector system
0:13:15	and the as a kind of
0:13:18	nice artifact it kind of makes it faster because
0:13:22	you know you're reducing the feature vector dimensionality which actually in turn reduces that support
0:13:27	vector size and tv matrix size
0:13:29	and it's
0:13:30	you can see in this paper is discussed that V
0:13:34	the computational complexity is proportional to be
0:13:37	supervectors
0:13:39	so and future work
0:13:41	there's nothing to like
0:13:43	not
0:13:45	it can be mixture dependent basically so
0:13:47	we obtain colour feature dimension like say
0:13:51	forty eight from all the mixtures
0:13:53	what you can be different so one of my papers that supported in interspeech which
0:13:58	deals about the trying to
0:13:59	optimize the parameter in each mixture
0:14:03	and also
0:14:05	some of future work will be
0:14:07	using iterative techniques in proposed to begin bishops method
0:14:12	in table four mixture of ppca
0:14:16	most of all actually
0:14:18	this opens up
0:14:21	we have
0:14:22	using other transformations also in mixture wise which might also didn't in another interesting to
0:14:26	people where i actually a by conventional transformations and the
0:14:32	and
0:14:33	nap or other techniques
0:14:34	which actually sort of take
0:14:36	transformations in each mixture and then
0:14:39	yeah so
0:14:40	and then basically integrated with the i-vectors
0:14:45	so
0:14:46	that is all i have a given
0:15:15	sorry how do you can go back to the acoustic features
0:15:20	i
0:15:23	yeah
0:15:28	yeah
0:15:29	i
0:15:35	what we need to train the ubm from scratch
0:15:40	oh yeah i did i tried i've seen some papers to
0:15:44	i didn't think i think the way i did i thought
0:15:50	or
0:15:52	sure
0:15:53	you can
0:16:01	so
0:16:02	i
0:16:03	cluster a feature dimension you have to have some kind of measurement
0:16:07	usually you can find the find the mixture by oh the most
0:16:12	the make sure that you to the highest posterior probability
0:16:15	but in this distribution i'm showing that
0:16:18	oh it's not always a one to one mixture because sometimes if the maximum value
0:16:23	of
0:16:23	the posterior probability of the mixture is if it's giving you point to that is
0:16:27	there other mixtures
0:16:29	one
0:16:30	point something that means
0:16:32	if you take point to as the maximum mixture and use that mixtures transformation it
0:16:36	will be
0:16:38	so
0:16:40	yeah we can you "'cause" to do it
0:16:42	but i try because i just have seen this and i thought
0:16:45	it would be nicer generate things that make things
0:16:48	are
0:16:51	together what is
0:17:05	i
0:17:17	oh
0:17:20	so a number of trials
0:17:23	i
0:17:24	yeah
0:17:26	yeah
0:17:27	i
0:17:36	i think i normalized in a binary invariance
0:17:40	oh
0:17:56	although i
0:18:10	right yes
0:18:11	oh maybe what you're saying is true
0:18:14	since i get
0:18:15	maybe
0:18:17	conditions
0:18:21	maybe i don't know if i the folding problem
0:18:25	i believe
0:18:26	just to
0:18:28	well
0:18:43	yeah i think that

Factor Analysis of Acoustic Features using a Mixture of Probabilistic Principal Component Analyzers for robust Speaker Verification

SESSION 08: Features for Speaker Recognition

Taufiq Hasan