Speech Transcript - Fast Scoring for PLDA with Uncertainty Propagation

0:00:14	a single channel and i hope available you into two you kindness the output of
0:00:21	a lunch
0:00:22	okay
0:00:23	the talking i'm going to talk about is how to make uncertainty propagation run fast
0:00:29	and also consume less memory
0:00:32	my name and why max from the home component alec university
0:00:37	so here used a liar representation i will first to keep an overview of i-vector
0:00:42	p d a and x spring how the uncertainty properties and can
0:00:48	can
0:00:49	model the uncertainty of the i-vector
0:00:52	and how to how to make the uncertainty fabrication run faster one possible use less
0:00:58	memory
0:00:59	and we evaluate the proposed to our problem on nist two thousand trial
0:01:05	and
0:01:08	finally we keep looser okay so here is the outline of the i-vector p lda
0:01:13	map onto the probably all you already know this though i mean cool for go
0:01:17	through these i very quickly a here
0:01:22	we use a the posterior be not the latent factor
0:01:28	you to
0:01:29	tool
0:01:31	to as a low dimensional representation of the speaker so given the mfcc wetter or
0:01:39	phrase utterance we compute the posterior mean of the
0:01:44	a latent factor and recall this time at
0:01:47	okay and t is the total variability matrix that define the channel and speaker subspace
0:01:55	or you represent a subspace where the i met okay very
0:02:00	so here's the procedure for the i-vector extraction given a sequence of mfcc what are
0:02:05	we extract the
0:02:07	i-vector using the post your and beam of the latent factor
0:02:11	and because if we would like to use the gaussian the lda therefore will lead
0:02:16	to
0:02:17	suppress the lawn the non gaussian behavior of the i-vector through some preprocessing
0:02:24	okay for example whitening and also length normalization
0:02:29	and after this preprocessing step because the process that i-vector pairs
0:02:35	top you
0:02:36	and then this not you
0:02:38	i-vector or preprocessed the i-vector can be modeled by the t i would be
0:02:44	so the idea is to decrease of the phone of you have a modeling
0:02:47	we represent a speaker subspace
0:02:50	and that h i is the use the speaker factor
0:02:55	and that you can see for the change sets exception of the i've speaker we
0:03:01	only have one
0:03:04	latent factor h i okay
0:03:07	and epsilon i j work that sentiment she to that cannot be represented by the
0:03:12	speaker subspace
0:03:14	so now that to in the scoring
0:03:17	at time we have that
0:03:20	the test i-vector
0:03:21	we have a test i-vector w t
0:03:24	and also we have this target speaker i-vector w s
0:03:28	and they're
0:03:30	and we compute these likelihoods are assuming that the top us to cut but come
0:03:36	from the same speaker
0:03:38	and of the also have the alternative hypothesis where the top us and w t
0:03:43	come from different speaker
0:03:45	there are four after some mathematical at manipulation become if this very nice equation so
0:03:51	in this equation we only have matrix and wet
0:03:55	multiplication and
0:03:57	the nice thing use the matrix c
0:04:00	hence i'll can all be computed as you can see these set of creation here
0:04:05	at the bottom
0:04:06	and all these sigma a c segment total
0:04:11	and thus and so on can be pre-computed from the
0:04:16	the lda model parameter that that's explain why the
0:04:20	scoring p lda very fast
0:04:26	but one problem of these conventional i-vector p lda there
0:04:31	i'm not two years
0:04:34	that's not have the ability to work that stands the reliability of the i-vector
0:04:39	so whether the utterance is very long already sought we still use
0:04:45	low dimension i-vector to work that stands
0:04:48	the speaker characteristics
0:04:50	of the whole utterance
0:04:52	so this we propose a problem for sort utterance speaker verification
0:04:57	but not problem for very long utterance that's how you we have three mean is
0:05:02	or sixteen use all speech
0:05:04	but if we're utterances only about ten second three second and you
0:05:09	then the variability or uncertainty of the i-vector will be so high that's
0:05:14	and the plp a scroll wheel favourite same speaker hypothesis
0:05:19	even if the test utterance is given by a in a imposed
0:05:24	about the recent years you've the spectrum is very short we will not have enough
0:05:30	acoustic vector for the nbp estimation or in we do not have enough acoustic webster
0:05:36	to compute the posterior mean of the leading factor you know factor analysis model
0:05:44	so in the ideal i'm certainly publication
0:05:48	we not only extract i-vectors but also clancy the
0:05:52	the posterior covariance matrix q
0:05:55	so this i this time to illustrate the idea
0:06:00	this gaussian represent the posterior density of the latent factor
0:06:05	and to do so i-vector is that's me so it is a point estimate
0:06:11	and this
0:06:12	equation initial the procedure of computing it
0:06:16	okay so t c use access to cease partition of the total variability matrix but
0:06:22	as you can see you
0:06:24	if the variance of this gaussian use very large
0:06:28	then the point estimate we'll not be where a correct
0:06:32	and this happened went utterances where is not the recent years
0:06:36	if the utterance is very short and see which is that zero although sufficient statistic
0:06:42	will be very small
0:06:43	so use this party where is more than the whole covariance matrix l university'll be
0:06:48	very big
0:06:49	and thus the means that these variations and be large as a result of point
0:06:54	estimate we might be very reliable
0:06:57	so
0:06:59	and
0:07:00	that's why two thousand and thirteen
0:07:02	that any proposed ideas that the lda and certainly propagation
0:07:06	so you that is so to extracting the i-vector we also express the posterior covariance
0:07:11	matrix
0:07:14	the latent factor
0:07:15	and that represents the uncertainty don't i-vector
0:07:19	and with some preprocessing as i have mentioned because we want to use a
0:07:24	a thousand you lda for the
0:07:27	as the as the final stage of the modelling for the scoring
0:07:32	therefore we also need to preprocessed
0:07:35	matrix
0:07:37	you time school
0:07:38	which is if you processed version of the of the posterior covariance matrix and i
0:07:45	thought that we could have a the lda modelling now how can
0:07:49	how to a certain these corpora t
0:07:52	other publication come from here so you know generative model
0:07:56	it a generative model we have wh i press i to allow you can see
0:08:02	this you
0:08:03	use and like at the conventional
0:08:06	the lda model we've the eigen channels
0:08:09	so this is my eigen channel but instead you keep and so on
0:08:13	the section
0:08:15	so it depends on the i
0:08:17	speaker
0:08:18	and the change section of the i speaker
0:08:20	as a result the z is also depends on the i and j
0:08:26	now the trouble of this year's
0:08:28	for every test utterance
0:08:30	we also lead to compute the u i t so unlike the i can channel
0:08:34	situation
0:08:35	we only need to pre-computed and make use of which during scoring now in a
0:08:40	uncertainty propagation this you i to have to be computed during
0:08:45	a scoring time
0:08:47	because the ssm dependent
0:08:49	and do compute this u r j which was performed at a dusky decomposed system
0:08:53	of the posterior covariance matrix
0:08:56	and that's why we have these the intra speaker covariance matrix like this
0:09:05	so loud finally and during the score
0:09:08	with the p l d a u p then we have these at equation
0:09:13	which is very similar to the equation the actual you
0:09:18	all the conventional p lda right as you can see this s u
0:09:23	matrix and vector multiplication
0:09:26	but the difference yes
0:09:28	this time the at b c and d all depends on the test utterance
0:09:33	an issue can see from this a set of recreation a s t e s
0:09:38	t c s t and the st the all depends on the test utterance
0:09:42	that that's me as they have to be pre-computed
0:09:46	and only very small number of matrix can be pretty compute this have to be
0:09:51	computed this set have to be computed during scoring time
0:09:55	and this that have to be compute a can be computed up before scoring time
0:09:59	so we will thus save much computation might become use the covariance matrix
0:10:07	so dissatisfy summarised some summarize the computation that needed to take phone a conventional lp
0:10:13	lda we almost have nothing to
0:10:15	to compute all you need to compute these
0:10:19	i went and matrix multiplication
0:10:22	but for the plp a review g
0:10:24	we have to compute all these set of matrix on the right
0:10:28	so as you can see that so we'll a increase the computation complexity a lot
0:10:34	and also if we increase the memory about this place
0:10:39	because of for every time the speaker we need to store
0:10:43	this may take a p c d for every target speaker
0:10:47	so we propose a to a way of in a speeding up the computation and
0:10:53	also
0:10:54	a two we use the memory consumption
0:10:57	the whole idea come from
0:10:59	come from d c equation
0:11:01	okay come on this equation case here the posterior covariance matrix and only depends on
0:11:06	n c
0:11:08	which
0:11:09	and two and a testing time and c will be to zero or the sufficient
0:11:12	statistics of the test utterance
0:11:15	well
0:11:16	you okay so you the two i-vectors are also meeting that's integration
0:11:21	we assume that all we
0:11:23	i think okay
0:11:25	the composed here are covariance matrix a similar because as you can see we plot
0:11:30	and the mfcc audible acoustic that only
0:11:34	the zero all the sufficient statistic
0:11:36	so having this hypothesis
0:11:39	we and
0:11:42	proposals
0:11:43	to roll direct a according to their be activity
0:11:47	now can be
0:11:50	we find w happy that we you we used a scalar to define the we're
0:11:55	not be by facing for each scroll the i-vector reliability is modeled by performance vehicle
0:12:00	right matrix
0:12:01	and we obtain the posterior covariance matrix from the development data
0:12:06	okay so here
0:12:08	that i take a k stand for the
0:12:12	case
0:12:13	and i'll this u k
0:12:16	is independent of the section
0:12:18	nice to look at
0:12:19	well at the bottom of the slide
0:12:22	we have you i j i taste depends on the
0:12:26	section
0:12:27	but now if you look at this here
0:12:30	we successfully
0:12:33	make the u i j which is the session dependent
0:12:37	is now becomes session independent
0:12:40	now you've having this u k become session independent we could
0:12:45	do a lot of precomputation on there
0:12:48	so one way of doing this
0:12:50	used to
0:12:51	used to grow
0:12:54	used to grow
0:12:58	the
0:13:00	the i-vector
0:13:01	using these three approaches one is the base on the utterance recent which is intuitive
0:13:07	to group the i-vector based on the
0:13:09	at a race of because we as we believe that reason we use related to
0:13:14	the uncertainty of related to the reliability of the i-vector
0:13:19	we have also tried using the mean of the diagonal elements of the posteriori matrix
0:13:23	of this is a nice thing to do because
0:13:27	the being of the diagonal and on there is a scalar so working will become
0:13:31	very easy
0:13:32	okay and the last one we have tried is the largest eigenvalue of the reference
0:13:37	matrix
0:13:38	so this i basically tell us how to perform the grouping you for example you
0:13:43	uses the time access
0:13:45	then this one corresponding to extremely soft uncertains
0:13:49	go to medium length but am sort
0:13:52	and we're case where is
0:13:54	long utterance and u h group be fine one representative
0:13:58	okay from the k two
0:14:00	we're consensus the whole group
0:14:02	so this
0:14:03	or percent at u one u one times will work at santa
0:14:06	the posterior covariance matrix a very strong extremely short utterance
0:14:12	u k or u k tricycle corresponding to the posterior covariance matrix
0:14:17	what and certainty all that very long utterance
0:14:22	so now that all you really two
0:14:25	during the scoring time really to find
0:14:28	the real identity
0:14:30	so by using the three approach to quantify cook reliability noise gave a
0:14:36	we will be able to find what i the nn and so that we case
0:14:42	the
0:14:42	all the session dependent
0:14:45	matrix in two
0:14:47	am and
0:14:49	and c n and
0:14:51	not as to compare with the conventional original plp a few p
0:14:56	this eight easy all the session dependent because
0:15:00	t is the test utterance
0:15:03	so t stand for a test utterance s spent for the attack at six speaker
0:15:08	utterance
0:15:09	and now it's a it's two am an
0:15:12	and c n and the n and all these have been pre-computed already
0:15:17	using a development data
0:15:20	so as to can see that will be ice
0:15:22	for this computation saving my
0:15:25	using the pre-computed but rather than computers the covariance matrix on the prior
0:15:32	so again that this lie there are some more ice
0:15:35	the computation saving that we could have
0:15:38	so this is the p lda we've a fast what we've using a reference fast
0:15:43	scoring okay so we to only to determine the group i t m and n
0:15:49	but for the conventional plp a beep
0:15:51	and so that the publication we have to compute all this matrix during the scoring
0:15:58	so be performed experiments on
0:16:02	sre two thousand trial common condition two
0:16:06	using the classical sixty dimensional mfcc wetter one or two for gaussian
0:16:11	find the total factor in the total variability matrix
0:16:16	and we tried this three different way off on a single the i-vector
0:16:21	you know how to create a procedure of the
0:16:25	posterior covariance matrix
0:16:28	okay so this diagram of summarize the results nice okay cp lda just ultra fast
0:16:34	a piece of a represents the
0:16:38	scoring time on the back to the times the total time for the whole evaluation
0:16:42	on its common condition two
0:16:47	but unfortunately the performance is not very good
0:16:50	well the reason is that what we white is not very good use of because
0:16:54	it we use
0:16:55	where is the utterance of arbitrary duration so we need to do that this segmentation
0:17:01	or cutting utterance into sought medium sort and a long very soul
0:17:10	so that it is not
0:17:12	we are we do not use the original data for training and testing but instead
0:17:16	we use some of that at ones used were sought some of the utterance used
0:17:19	medium sought some of that when it is very long so we create a situation
0:17:23	be a victory
0:17:25	to raise and you both the training and test utterance
0:17:29	now the plp every u p performed extremely well
0:17:33	unfortunately the scoring time is also where i
0:17:37	and we've our fast scoring approach we successfully we used a scoring kind from here
0:17:43	to here
0:17:45	if only a very small increase in the eer
0:17:48	we have using a more groups okay so that developed a you know our from
0:17:54	you with the number of these larger we can make the eer almost the same
0:17:59	as the one achieved by the
0:18:01	p lda beep uncertainty project each so what happened use
0:18:05	we successfully we use the computation time but we followed increasing
0:18:10	the eer
0:18:11	as the same situation ocarina been dcf the detailed
0:18:17	you know paper
0:18:18	and also we show system three here because the performance you some two and system
0:18:23	three are very similar so i only saw this only show the system to
0:18:29	and system one space on utterance duration we want to solve this because you syllables
0:18:33	in three d way of doing
0:18:36	a memory consumption
0:18:39	a domain reconnaissance a i have i have in a similar trend
0:18:44	the lda used very small amount of memory
0:18:48	and
0:18:49	the plp a but use much of a large amount of memory
0:18:54	because we need to store all of the posterior covariance matrix of the utterance
0:19:01	and we have talk about well
0:19:03	what they're gigabyte here
0:19:06	and
0:19:08	this is not one videos that memory consumption almost by how
0:19:14	case and system to a this set the same
0:19:19	the memory consumption
0:19:21	and if we increase the number groups
0:19:24	or obviously not require something will increase
0:19:27	but you value that
0:19:29	number really a lot of forty five
0:19:33	it still use less memory and
0:19:36	the original plp we've and set and the propagation
0:19:41	so it's the det curve and
0:19:45	not as you can see the
0:19:47	paying
0:19:50	or the paper
0:19:51	this leo and we use them for to conventional lp lda report performance
0:19:56	about all the others system one two three and also the one with u p
0:20:01	one
0:20:02	much better
0:20:04	because it with the uncertainty propagation you can do the utterance of a feature integration
0:20:11	and what we have used that we find that system one use i c pool
0:20:16	then the system two and three
0:20:19	but system one has the largest
0:20:22	we that's and in terms of computation time
0:20:27	so in conclusion
0:20:29	we propose a very fast scoring map for the lda with certain people bifurcation
0:20:35	as the whole idea used to become people's the
0:20:39	posterior covariance matrix
0:20:42	or the loading matrix representing the reliability of the i-vector
0:20:46	two pre-computed
0:20:47	all of them
0:20:49	that's much as possible and you know how to do this precomputation really two
0:20:55	to the grouping first two in the development time
0:20:58	and we find three ways all performing the grouping
0:21:02	and all this grouping
0:21:04	are based on some a scalar just like a the k-means outgrow from you need
0:21:10	to use the distance of the way to say
0:21:12	it's a
0:21:14	criteria for a finding al
0:21:18	the cruel a what do you we mean by process now we use the
0:21:23	the be all the diagonal covariance matrix
0:21:26	okay or well sort then be all the diagonal elements of the posterior covariance matrix
0:21:32	what the maximum
0:21:35	eigenvalue all the posterior covariance matrix order to radiation as a way of doing this
0:21:44	huh set as the criteria for the grouping
0:21:47	and all these use a computationally light and sre so it's
0:21:53	as a result
0:21:54	the proposed f okay perform yes us a similar to the standard u p but
0:22:00	we only two point three percent of the scoring time
0:22:03	thank you
0:22:12	we have time for questions yes we
0:22:17	we do not frankly them randomly but use that set for every one second interval
0:22:23	we have a week rate
0:22:27	so for three second for second five seconds so we randomly extracted from there
0:22:33	that's speech data after
0:22:36	also when we extract every randomly extract the problem
0:22:40	so we durations range between three seconds and how much
0:22:45	well as well as long test utterance o can excel some utterances small groups of
0:22:51	five a traditional to therefore for different utterance we will have a different operating
0:22:58	my experience i wonder if you could just comment on this my experience with this
0:23:02	with this method
0:23:04	it is the i found other works well in a situation other than the specific
0:23:13	problem where was intended
0:23:15	okay if there is a gross mismatch between a enrollment and test such as telephone
0:23:22	enrolment the microphone channels
0:23:24	or a huge mismatch and the in the duration
0:23:28	then i found that this work well but i was a bit disappointed with the
0:23:33	with the performance only specific problem that you're addressing here which is the problem just
0:23:39	a duration variability
0:23:43	you fact we could be involved in our experiments we also have recently because
0:23:50	well we literacy generated duration mismatch in order to create a situation having a at
0:23:56	times a picture it duration therefore the test utterance and the target speaker utterance will
0:24:03	have different k
0:24:05	of course that we've each are very small times that
0:24:10	you one of the u one or two open you a
0:24:13	harry
0:24:14	the utterance will have various
0:24:18	but really then
0:24:21	excluding
0:24:22	so because everything random so there will be a lot of utterance with various packet
0:24:28	utterance which operates and also a trend which are real all that would be a
0:24:33	duration mismatch
0:24:35	a tree in the test
0:24:40	i be very interested to see what so what happens in the in the upcoming
0:24:44	nist evaluation where this problem is good is going to be in the in the
0:24:49	forefront of our have excellent thank
0:24:53	it is you know the truncation the duration will be truncated to between ten seconds
0:24:59	and sixty seconds
0:25:02	so i think we're all looking up to five percent equal error rate you know
0:25:08	a before we moved to chinese and no
0:25:12	target more go
0:25:14	verification trials
0:25:17	okay then that's like the speaker

Fast Scoring for PLDA with Uncertainty Propagation

Speaker Recognition: i-vector approaches

Weiwei Lin, Man-Wai Mak