Speech Transcript - Memory and Computation Effective Approaches for i-Vector Extraction

0:00:39	oh
0:00:41	also
0:00:43	fusion techniques for extracting i-vectors
0:00:46	by efficient the
0:00:49	we went looking for some way to address
0:00:52	most of the memory of patient of the i-vector extractor
0:00:55	extracting genes
0:00:58	so the results a bit more state-of-the-art technology nowadays is based on i-vectors which are
0:01:04	very good as
0:01:05	produced a traditional
0:01:08	the computation of i-vector can be quite demanding in that at least one of the
0:01:13	time
0:01:15	so while some solutions
0:01:19	proposed for a system action with low memory requirements the namely
0:01:24	the diagonal isolate vectors proposed bigram but plus the
0:01:29	that is
0:01:31	we should also shown to have some degradation in accuracy
0:01:36	when
0:01:38	some to show some degradation of accuracy so we
0:01:42	well looking for a solution which does not include such degradation
0:01:46	but still those two
0:01:49	and greatly reduce the amount of memory required to store
0:01:54	so how variation action again today
0:01:59	that represent the original baser aside for i-vector extraction which is
0:02:04	can see that one in the previous two
0:02:07	then we present our conjugate gradient approach for i-vector extraction and finally present some experimental
0:02:14	results of these techniques
0:02:17	so
0:02:20	i guess everybody else what's i-vectors are but does brief introduction
0:02:25	there are not only for low dimensional informative for each utterance the presentations which is
0:02:30	that i don't is like model
0:02:34	so the most widely used
0:02:36	i-vector race if we
0:02:39	assume that
0:02:41	most of the speaker and channel variations like that small subspace in the supervector space
0:02:47	then we assume a session prior for the latent variable representing these variation
0:02:54	and
0:02:55	approximating the data likelihood by means so well with statistics we can compute the posterior
0:03:00	of these latent variable
0:03:02	and then we compute the i-vector a maximum a posteriori estimate of the latent variables
0:03:11	we can show that the post
0:03:13	is abortion these correspond to the a posteriori
0:03:17	and for the i-vector
0:03:19	so as you can see yeah
0:03:22	computing is the computational cost matrix
0:03:26	which and tasty a multiplication of the for the inverse matrix times the eigenvoice matrix
0:03:32	is that
0:03:33	or additional
0:03:36	these
0:03:38	this dataset
0:03:39	which are
0:03:41	a dimensionality which is where the i-vector dimensionality
0:03:47	so
0:03:48	we can see that
0:03:50	no plastic on a selection techniques can be
0:03:54	and
0:03:55	that is all so you see represents the number of abortions and the feature dimensionality
0:04:02	and then use the i-vector dimensionality
0:04:05	so if we don't put anything we have a
0:04:09	complexity which is the
0:04:11	with a quadratic in the i-vector dimensionality
0:04:15	and is the examples in the number of gaussian in the definition of two features
0:04:21	we can reduce the complexity but i mean and we ask that if we want
0:04:26	to this matter since he
0:04:28	but this is we have a shot of memory constraint which is again quadratic in
0:04:33	the effect of the nation's and proportional to the number of abortions
0:04:38	with
0:04:39	jessica like two thousand forty eight dimension of the ubm as used in this is
0:04:46	easily the most expensive
0:04:47	part that's memory of an i-vector instead
0:04:52	so i thought that was the last yeah i organisation based on that have a
0:04:58	nice mess over a vector instruction was proposed
0:05:02	which essentially okay we can i forgot mention that we can have the same as
0:05:08	that yeah
0:05:10	from the form has just by performance a normalization for the problem with statistics and
0:05:15	in this case of the eigenvoice matrix
0:05:18	then we can assume that these are simultaneously that as a model by some methods
0:05:23	Q and that we cannot compute an approximation of the posterior covariance which is
0:05:30	the yeah not so that
0:05:32	and session
0:05:34	can be performed in a much faster way with a very limited additional requirements
0:05:40	however you know it's
0:05:42	yes
0:05:43	right i can cause a degradation recognition accuracy
0:05:47	so we wanted to do better in terms of what you see here
0:05:52	so
0:05:53	and we said that the problem is the computation of the covariance matrix
0:05:59	the problem is that the covariance matrix is not that yeah
0:06:03	if you
0:06:05	this means that the i-vector components would be uncorrelated
0:06:10	you're and the posteriors that would factorize
0:06:14	so even though the posterior said that cannot be factorized about the different components we
0:06:19	look for an approximation of the posterior which factorizes all the sets of the i-vector
0:06:25	components
0:06:27	so we partition the i-vector components in to be disjoint sets
0:06:32	and we assume that the
0:06:33	here are can be approximated by
0:06:36	i distribution which factorizes of these states
0:06:39	yeah
0:06:41	the correlation baseband for facades a
0:06:44	way to estimate is the approximate posterior
0:06:48	by minimizing the kl divergence between the original posterior and this approximation
0:06:55	so
0:06:58	yeah i need to introduce some notation
0:07:00	namely we just
0:07:03	then all the
0:07:05	a simple the eigenvoices an associated to each block
0:07:09	of the i-vectors all each can i
0:07:12	we i is associated with a low that you wanna buy vector components
0:07:18	and these are just the compliments of those
0:07:20	subsets so that we can express
0:07:24	duplication in this way
0:07:26	so if we do some until we updated for each
0:07:31	a factor of the posterior of the approximate posterior
0:07:35	the its distribution is a great nor without expression which is very see that the
0:07:41	original i-vector inspiration
0:07:43	the difference is that this precision matrix is here are computed using the eigenvoices relative
0:07:49	to this subset
0:07:51	and for the mean of the posterior we are essentially centering the statistics over a
0:07:58	slightly different ubm
0:08:00	essentially we
0:08:02	say that
0:08:04	if we assume that are not components of the i-vector a fixed size and we
0:08:09	are
0:08:10	to this end the statistics of these new ubm
0:08:15	and
0:08:16	this is
0:08:18	these are allows us to see what is the complexity of this that be
0:08:24	we do not take a
0:08:27	okay reestimations only a new implementation implementing this technique because
0:08:32	if we just compute this at every time with a block size with a block
0:08:37	of size one
0:08:39	the complexity is again what that the unit vector images because every time
0:08:44	centering this
0:08:47	so we need is
0:08:50	we keep a supervector of a set of statistics which are always cat center of
0:08:57	the i-vector estimate
0:09:00	and we use the real well then you mean is computed by removing the centre
0:09:07	and all those components that we are estimating and then after we had they the
0:09:13	mean we update and you'll a vector of since order statistics so that its center
0:09:19	of the joystick to be a vector
0:09:22	so this way if we consider the contribution of the computational the precision matrix the
0:09:28	complexity of this approach is proportional to the dimensionality of i-vectors and the number of
0:09:35	iterations that we need to perform
0:09:37	to compute the i-vector
0:09:41	i can see is so that the similarity of this form with the original i-vector
0:09:46	was the covariance matrix essentially these are the block diagonal of that the last matrix
0:09:52	and we can model
0:09:54	again
0:09:55	two different techniques to compute the and you know we
0:09:58	compute
0:09:59	we therefore computation to compute the every time this covariance matrices
0:10:04	or we can restore the block diagram but also the audience matrix so in this
0:10:09	case we get
0:10:11	plus i selection time but slightly higher memory and the memory requirements depend on the
0:10:16	size we choose for the block
0:10:19	so essentially well we can show that this variational bayes and the variational bayes approach
0:10:26	implements a gaussian approach to the solution of this you know system
0:10:32	and we also investigated a different
0:10:35	techniques for
0:10:37	so it is used and namely the jacobi method in the conjugate gradient vector
0:10:43	what we found out is that the jacobi method is very see that this approach
0:10:47	but instead of updating the
0:10:50	i-vector after each iteration you have a vector is updated only after all components to
0:10:55	be estimated
0:10:56	in these encoders and this causes slightly slow whatever
0:11:01	the
0:11:02	the convergence rates in our experience
0:11:06	yeah we analyze is conjugate gradient
0:11:09	what's nice about squinted at it is that we don't need to be bad
0:11:14	the
0:11:16	covariance matrix here
0:11:19	what to do is that we don't even need to compute it really because we
0:11:23	just need to do the product of this matrix time a general vector which is
0:11:27	required by the conjugate gradient algorithm
0:11:31	so if we write the computation in the
0:11:34	but for your precious in this way we can see that the computation of this
0:11:38	product is a say should be you know in apples
0:11:41	you don't the components so it's not in the number of the components of the
0:11:46	ubm
0:11:47	number of features and dimensionality of i-vector
0:11:50	so we have a complexity which is the same as the variational bayes approach
0:11:57	so i guess
0:11:59	this kind of what's nice about this technique is that we don't require any kind
0:12:03	of additional memory
0:12:05	and has the for the variational bayes approach we can use this technique what's a
0:12:11	full covariance ubm if we do the prewhitening all the transmitters
0:12:18	ubm ones
0:12:21	so
0:12:22	i'll show you how we show you some results on the female dataset the extended
0:12:28	telephone conditions one is
0:12:30	so we do then
0:12:33	our setup is a sixty dimensional ubm we
0:12:37	two thousand four components
0:12:40	we ask for permission to make
0:12:44	we use
0:12:44	but i will length normalized i-vectors classifier you have
0:12:50	you know where
0:12:51	limitation we assume efficiency issues so i'm sure you
0:12:55	the results
0:12:57	those
0:13:00	so
0:13:01	before seen the results just one point out that
0:13:05	you directions
0:13:08	yeah one is an article
0:13:13	the exact i-vector also
0:13:16	and
0:13:17	so if we don't know that we can recover exactly same
0:13:21	accuracy or you know classifier
0:13:26	so you interest in is
0:13:28	see if we can do that i mean
0:13:31	we can stop yeah and still
0:13:33	achieve good results we
0:13:35	process structure of course
0:13:38	which one
0:13:40	which was the first one
0:13:42	so yeah i'm showing the results of the baseline system the egg that i
0:13:49	well approximated i-vectors
0:13:52	variational bayes the case we
0:13:54	size is ten twenty and these are the same six
0:14:00	we gotta
0:14:02	estimation that just a special yeah stuff
0:14:06	both
0:14:07	chosen
0:14:09	so as to was evaluated using the difference between the do not before S L
0:14:16	two successive based i-vector estimates
0:14:19	so essentially this experiment is doing between two or three iterations for estimation is
0:14:26	in between three and four
0:14:29	so that's is a specialist in this sort of two norm of the residual
0:14:37	so essentially what we see you know that
0:14:40	most of the system performance X and
0:14:44	yeah
0:14:45	and this was the reason why we phones
0:14:48	so that is
0:14:50	two
0:14:51	find out
0:14:52	so you
0:14:53	section
0:14:55	so what is that sometimes these are
0:15:00	this system including the required courses
0:15:07	and
0:15:08	okay system is the one which implies
0:15:12	the request and is comparable to the variational bayes approach does last
0:15:22	you see that
0:15:24	essentially the slow
0:15:29	yeah
0:15:30	we can be used to always
0:15:34	yeah voice matrix
0:15:36	however
0:15:37	note that
0:15:38	the lattice as we can see that
0:15:42	that is
0:15:43	quite high baseline
0:15:45	on the other the original the variational bayes we can obtain an accurate results just
0:15:52	a few percent reason
0:15:54	done
0:15:55	which one compared to
0:15:57	the time required tools it's not forced zero so statistics is
0:16:03	what was used
0:16:07	so that's addition
0:16:11	yeah he also that the not exist
0:16:14	the size of the box
0:16:16	is
0:16:17	and we can see that using
0:16:19	yeah it is of course there were requirements
0:16:26	this case it's function
0:16:29	significantly
0:16:31	and
0:16:32	essentially
0:16:34	it is comparable to that of the country
0:16:38	while the using
0:16:40	reason not to block size is allows us to
0:16:45	improve
0:16:46	right
0:16:47	and
0:16:48	and the
0:16:51	so
0:16:53	oh
0:16:54	we
0:16:56	we have some and you never efficient accurate vectors
0:17:00	techniques
0:17:01	which are based on variational bayes submission
0:17:05	and the use of and
0:17:07	so
0:17:09	yeah
0:17:11	we present a little sizes line
0:17:14	but since then
0:17:16	we have some role channels to it's not very accurate i-vector we
0:17:22	a very
0:17:23	we i don't know the we present the time required vector itself
0:17:31	well i think that is
0:17:33	on the other and allows to
0:17:37	yeah the right directions
0:17:41	well we use a high
0:17:46	i
0:17:56	to say let's thank the speaker
0:17:59	so you have
0:18:00	a few minutes for questions
0:18:03	for a
0:18:10	yes
0:18:14	yes
0:18:15	yes
0:18:17	well
0:18:19	i
0:18:20	yeah
0:18:22	nice
0:18:24	and
0:18:26	okay
0:18:28	yeah or
0:18:30	yes i
0:18:32	yeah
0:18:35	one
0:18:37	it's
0:18:38	then
0:18:39	really
0:18:42	yeah
0:18:43	oh
0:18:46	yeah
0:18:49	i
0:18:51	i
0:18:53	i
0:18:54	i
0:18:57	well
0:18:59	so
0:19:01	oh
0:19:05	okay
0:19:08	that's this
0:19:09	five
0:19:10	which was
0:19:12	and what's
0:19:15	yeah
0:19:16	say that the results are i see that
0:19:21	you know
0:19:24	yeah
0:19:25	yeah
0:19:25	or
0:19:27	but
0:19:29	that's right
0:19:31	i
0:19:32	yeah
0:19:34	yeah
0:19:37	of course
0:19:39	vol
0:19:40	one of us
0:19:43	right
0:19:44	you want
0:19:52	i
0:19:52	oh
0:20:04	i
0:20:05	yes as well
0:20:08	the base classifier
0:20:10	i would say that
0:20:12	no is this is
0:20:13	the classifier
0:20:15	right
0:20:16	very fast
0:20:18	i
0:20:20	i
0:20:21	you don't
0:20:26	i
0:20:33	i
0:20:33	yeah
0:20:35	one
0:20:36	yeah
0:20:44	questions
0:20:47	let me ask
0:20:49	i have seen the difference between what partly depend what you need or what we
0:20:52	try to
0:20:54	rotate the
0:20:55	the space of eigenvectors so that
0:20:57	it would be already gonna do you start from the same
0:21:01	oh
0:21:03	this
0:21:04	since
0:21:05	use
0:21:09	yeah
0:21:13	or
0:21:16	yes
0:21:17	say
0:21:23	yeah
0:21:26	i
0:21:27	i
0:21:30	but then you effect compared with what we did basically he try to diagonalized a
0:21:34	separate transmitted first and what you need to diagonal structure and i
0:21:43	yeah
0:21:45	results
0:21:46	oh
0:21:47	yeah
0:21:49	well as
0:21:51	oh
0:21:54	just
0:21:57	oh
0:21:58	make
0:22:07	that's in fact the speaker again and
0:22:09	i

Memory and Computation Effective Approaches for i-Vector Extraction

SESSION 01: Speaker Recognition - Compact Representation

Sandro Cumani