Speech Transcript - Exploring some limits of Gaussian PLDA modeling for i-vector distributions

0:00:15	given an i-vector on the value
0:00:18	can be decomposed in part
0:00:20	speaker all part with your town zero on
0:00:24	with
0:00:27	matrix v
0:00:29	score one contains them but this of
0:00:32	and they can based voice subspace
0:00:34	and always it you
0:00:37	which is like to speaker factor normally distributed
0:00:42	which is
0:00:43	so we to do is a
0:00:45	consummate weeks from now
0:00:48	which is for inside too
0:00:50	and the lies in most commonly used
0:00:53	p l system for i-vectors in which shown in effect is kept for
0:00:59	the decision score
0:01:01	proposed by someone prince
0:01:03	is or log likelihood ratio
0:01:06	in which we can see that
0:01:09	the computing the scroll depends only on the
0:01:12	nolan shelf
0:01:13	i matrix fifty transpose of five but
0:01:16	for speaker
0:01:18	factor
0:01:19	and vector transpose it proves long down
0:01:23	which content to talk about every reliability
0:01:30	there shouldn't fairly on modeling can provide good performance but
0:01:34	it has been shown that just performance are achieved only if the condition and prosody
0:01:40	a it follows the and extraction of i-vector all this conditioning posteriors
0:01:46	and is summarized by whitening most commonly used a
0:01:51	whitening is a standardization and length normalisation
0:01:58	i was matrix
0:01:59	of liability shown in for the standardisation
0:02:02	can is a total covariance matrix
0:02:05	so within speaker covariance matrix the volume
0:02:09	eventually to eventually we iterate this process
0:02:13	parameters are computed for the i-vectors present in the training corpus and applied to test
0:02:18	i-vectors
0:02:22	assumptions of the mission p lda the vicinity
0:02:26	justly
0:02:27	and the linearity of eigenvoices it means that
0:02:31	so the speaker are but can be constrained in a linear subspace
0:02:36	and the mostly just a city of the radio or
0:02:39	it means that a system to build your model assumes
0:02:42	that a speaker classes
0:02:46	statistics means that channel effects can be modeled
0:02:50	in a speaker independent way
0:02:53	so that the distributions shells a seven grams metrics
0:02:58	so it's independency between number
0:03:01	and the speaker factor
0:03:03	and the equality of covariance
0:03:06	garlic which occurrence it means that there are also between the residual between the actually
0:03:12	beach of a class
0:03:13	and the middle parameter
0:03:14	computed
0:03:16	for the jelly a seem to be uncorrelated
0:03:20	normally distributed a on the explained by
0:03:24	so front it's a simple
0:03:27	of the development corpus
0:03:30	so randomly
0:03:31	and that surrounds the not vary with the effects being more target
0:03:38	on the left as a graph
0:03:41	is the simple condition of the p lda model is in speaker factor one dimension
0:03:47	one additional subspace
0:03:50	where is no more
0:03:52	while stoned a normal prior for the speaker factor
0:03:56	and some classes with the same
0:03:59	viability metrics
0:04:01	or am is that i-vector no lie on
0:04:04	the nonlinear and find it connects subsets of an impostor
0:04:10	so as the distribution of i-vector noise
0:04:13	which is referred to as it's very core distribution
0:04:19	we think that perhaps insurance that exists a renowned speaker-independent admits a parameter on the
0:04:24	of within stego abilities questionable
0:04:27	in such a not affect be modeled in a speaker independent way
0:04:31	it's difficult to sure that something is right or something is wrong
0:04:37	for example if we find out or ration significant duration between
0:04:42	the whole
0:04:44	and the class parameter
0:04:46	the effect drama to it where you're late the estimation of random variable
0:04:54	first we present the deterministic approach
0:04:58	why printing deterministic approach to compute a purely apparently fast
0:05:03	because first two and we try some
0:05:07	deterministic approach is an remarks and that other approaches
0:05:11	not
0:05:12	all relevant sometimes a not so but the to suit
0:05:17	it should still there is not optimal for i-vector cycle distribution
0:05:22	can we replace is sophistication of the expectation maximization maximum likelihood
0:05:28	estimation of
0:05:30	parameters
0:05:32	by a simple and straightforward while stifle wildest an acoustic approach
0:05:37	so we want to know if
0:05:40	so application of the maximum likelihood
0:05:44	approach compute the parameters of the india
0:05:49	brings significant improvement of performance
0:05:54	we did not sorry may be the value when signals the between losing into programs
0:05:59	matrix was completely
0:06:00	on our development corpus
0:06:04	a singular value decomposition of the between speaker covariance matrix
0:06:08	give a matrix
0:06:10	whose columns are
0:06:12	so eigenvectors of the weighting between speaker
0:06:16	liability and the their remote matrix of eigenvalues
0:06:21	sorted in decreasing order
0:06:24	un a wrong are less and b
0:06:27	we can
0:06:30	compute
0:06:31	as arounds principle between speaker variability
0:06:36	and summarize it's and metric speech times t matrix
0:06:40	defined by the question for
0:06:45	the fast not to x p one two we are used to be turned on
0:06:49	matrix composed of the first occurrence of p
0:06:52	and so they're gonna matrix don't i want to well
0:06:57	is only comprise of the
0:06:59	highest hardest
0:07:01	eigenvalues
0:07:03	and so we propose a two
0:07:07	carry out
0:07:09	experiment with only
0:07:11	i w conditioning
0:07:14	conditioning and the system the still addition according to
0:07:19	within class covariance matrix
0:07:21	followed by next lemmatization
0:07:23	and the direct estimation of others at the parameters of the p l
0:07:28	the lda without which emitted and then and
0:07:31	on the bus on development corpus
0:07:35	so the scoring replaced by is the smart this is the total covariance matrices
0:07:40	for
0:07:42	is estimated by
0:07:44	that at the transmitters of the development corpus
0:07:47	and speaker levity metrics fifty transpose by be want to all
0:07:55	suppose can be justified if we consider somebody solely data from the development corpus
0:08:02	we can express as a factor and the parameters
0:08:06	speaker and with your
0:08:08	factors and she
0:08:10	well i on the value s is the mean vector director of speaker s
0:08:15	we show in the article that the covariance matrix is be i two well as
0:08:19	desirable that the speaker factor is standardised mean zero and i don't to metrics for
0:08:26	ability
0:08:27	and the dependence between that and variables
0:08:32	remark that only the new which of the covariance which is a necessary condition
0:08:36	is the
0:08:39	shift
0:08:40	and we cry and to obtain the lda scoring
0:08:48	next mission is known to improve the question it is so we compute the custody
0:08:53	of the speaker and was or fact also for development corpus
0:08:57	before and after length normalisation
0:09:00	top graphs shows
0:09:03	and distribution offices quell line source to standardise digital factors
0:09:09	left as the speaker factors on whites are ways of the optimum
0:09:14	the dashed lull i and is
0:09:17	the distribution of the key to
0:09:20	the speaker factor or must follow
0:09:25	a key with a degrees of freedom
0:09:29	and still on
0:09:31	okay to is a p u is of freedoms peas dimension of the i-vector space
0:09:37	we show it's not use that
0:09:39	for all
0:09:40	development board line
0:09:43	and so for evaluation
0:09:46	datasets
0:09:48	there is a mismatch
0:09:50	between them
0:09:54	and as a distribution of an intimate we can give it a distribution
0:10:01	remark also the several dataset shift between
0:10:05	development and evaluation dataset
0:10:09	after length normalization
0:10:12	is the volume
0:10:14	right care to experiments with
0:10:18	manage to compute parameters and with a deterministic approaches
0:10:23	in both cases
0:10:24	we can see that
0:10:26	so the numbers and the t v
0:10:29	partially reduced
0:10:31	and the shift
0:10:34	between the development and evaluation
0:10:38	mark sets a deterministic approach
0:10:42	improves the question e g
0:10:43	in a similar manner to ml technique
0:10:47	what's that is on the and it's recognition but most distant of motion t
0:10:53	always use of
0:10:54	three systems
0:10:58	we ultraviolet of conditions of the nist speaker recognition evaluations on eight ten
0:11:05	twelve telephone
0:11:08	is that the noisy environment
0:11:11	with the system
0:11:13	was a length normalization
0:11:15	following do not exist from going to signal
0:11:19	so that learns metrics and
0:11:21	two w which two cases
0:11:24	what is you know and mel
0:11:26	estimate of parameters and is a deterministic
0:11:29	an estimate of parameters
0:11:32	we can see
0:11:34	you can see that the result of the same in terms of
0:11:39	the colour right
0:11:40	between the two last the last two techniques
0:11:43	in terms of this you have the probabilistic approach women superior
0:11:48	and we mark sets l w conditioning performed a bitter
0:11:53	done the l signal conditioning
0:11:56	event with a deterministic approach
0:12:04	so no you consider that maybe the fact that so
0:12:09	and the end and better approach doesn't bring as expected
0:12:13	improvement of performance
0:12:15	maybe is due to the fact that the g p l d ar model is
0:12:20	not optimal for i-vector spherical distributions
0:12:26	so we compute
0:12:28	two series for development corpus
0:12:32	first the average people celebrated of zero is you of our observations
0:12:36	given the model
0:12:38	even
0:12:39	she and until
0:12:40	and standard t money t
0:12:45	which we consider that are likely would
0:12:48	off
0:12:49	class but also for likelihood
0:12:51	of the class given number
0:12:55	then we compile this
0:12:57	likelihood to the parameter of position of the class consider a wider of probabilistic class
0:13:04	position it's pasta or like a likelihood of the speaker and for speaker factor of
0:13:09	the class
0:13:12	and we display
0:13:16	the two series
0:13:19	all horizontal
0:13:21	wasn't really is parameter of class position and optical
0:13:26	the likelihood
0:13:28	of the reason you
0:13:30	according
0:13:31	we can model
0:13:35	the first graph
0:13:37	shows results always that would next normalisation
0:13:41	with i-vector lost provided buys extractor
0:13:44	and we remark here that
0:13:47	no volition a cross between the position of the class
0:13:53	and is a likelihood
0:13:55	of the residue
0:13:58	each time we displays a coefficient of determination task well
0:14:02	two scroll from zero to one
0:14:05	when which indicates l well data for points fit alignment
0:14:10	the task was equal to zero point zero four
0:14:13	close to zero
0:14:16	after are all length normalisation
0:14:19	a significant reduction
0:14:21	appears between the likelihoods of the class factors and the likelihood of there is you
0:14:27	a squirrel are equal to zero point filing nine and zero point six four
0:14:35	so there is a dependency between
0:14:37	the actual vulnerability
0:14:41	matrix of class and the probability position of this classic sperry by the likelihood of
0:14:46	the fractal
0:14:48	so we can see they are that's the show and it there was a dusty
0:14:51	of the raising your
0:14:58	we compute the previews
0:15:02	results who is well training set
0:15:05	in which that are not evenly distributed across speakers
0:15:10	so we can object that relations due to the can to differ information to speaker
0:15:15	some four
0:15:17	so we compute the same graphs and before
0:15:21	but on the for is speaker
0:15:25	training classes
0:15:27	with the minimum number of sessions per training speaker
0:15:33	we don't are you see that a minimal number of sessions speaker
0:15:36	one from two to sixty two
0:15:41	and this time for only segments of speaker which
0:15:46	the more than this minimum well
0:15:48	we compute the l score
0:15:50	we see that before makes them addition there are no problems
0:15:54	because the
0:15:55	the two series are independent
0:15:57	and after maximization be seen
0:16:00	that event for
0:16:04	uses speaker classes with the
0:16:07	the maximum number of sessions
0:16:10	the same
0:16:11	was we took us
0:16:14	is else well which are higher than zero point six
0:16:24	so we remark
0:16:27	that the j p alone modelling is a good model
0:16:32	but if we are obliged to
0:16:34	project that on the nonlinear also phones are
0:16:38	problem is to be sure that and the most acoustic model with the quality of
0:16:44	covariance
0:16:46	will for from the simpson
0:16:51	we don't dusty does take this out to replace the overall with a cluster between
0:16:55	that parameter by the class dependent parameter
0:16:58	steak the queen the local position of the class to fit to it
0:17:01	actual distortions
0:17:05	such an adrenaline is difficult to carry out
0:17:11	because it induces a complex density
0:17:14	i passing the within class variability parameters will nonlinear function
0:17:19	or getting up length normalization and
0:17:22	posting approaches as well which present over the i-vector on the one
0:17:27	attempting to find out attic what why also as heavy tailed be
0:17:31	discriminative classifiers pairwise discriminative
0:17:36	all just on why we are obliged to ignore the non maybe because and all
0:17:41	contain expected
0:17:43	the art abilities
0:17:46	may be related to some parameters
0:17:49	acoustic
0:17:51	just remark
0:17:54	which the and w conditioning
0:17:59	transform is the within class variability in the identity matrix
0:18:03	and identity matrix as no
0:18:06	principal components
0:18:08	maybe it at alleviates is a constant of
0:18:13	almost a dusty city
0:18:16	thank you
0:18:22	i
0:18:33	condition man something eat with experiments that you replaced the probabilistic approach of estimating the
0:18:42	parameters with the say on the screen
0:18:47	the minister
0:18:49	i think that's
0:18:50	in the limit if you're stream have main speakers
0:18:53	these two conditions exactly the same sort the only difference is that you're putting the
0:18:58	prior in the one case
0:19:00	okay so we present us with the number of the number of speakers average number
0:19:06	of speakers a and i guess that's you can go to a small number of
0:19:10	speakers when you train the model
0:19:12	yes and it and difference that's as of the deterministic approach is not intended competes
0:19:19	with a man a matter and then is the best way
0:19:22	but just i was surprised by is a slight yelp of performance
0:19:28	and so it's
0:19:31	assume that maybe because our aim ml count
0:19:35	be optimal because there is a problem of sphericity of data
0:19:40	but deterministic approach is not
0:19:43	and that's exactly this topic when we try to show that the norm
0:19:49	of the speaker factors
0:19:52	whether the full weight a
0:19:56	yes i guess a because you have to treat them as random variables because the
0:20:01	not simply points
0:20:03	under the plp scheme there they have a posterior distribution
0:20:07	"'kay" a better way
0:20:09	to consider whether they following the distribution
0:20:13	would be
0:20:14	broccoli to at the trace
0:20:16	the posterior covariance matrix v should be should also be added when you can't leave
0:20:20	the norm in order to see the overall distribution rather than
0:20:25	dot products on okay
0:20:30	marketing that's
0:20:31	so that in that's the same rationale "'cause" with evaluation was test like toss a
0:20:37	rice with development corpus used as vectors
0:20:42	same effect okay was that the difference is that between length normalization they'll score is
0:20:47	not close to zero
0:20:49	the cost for off test
0:20:52	i-vectors before estimators and has provided by the extract all
0:20:58	is the to zero point three
0:21:02	where is a vector of test not used for training the lda factor analyses so
0:21:08	there is a shift only not only for mean
0:21:11	but only four
0:21:13	this problem of almost instantly
0:21:19	just one quick what i just missed your point when you said
0:21:23	i think you were saying that
0:21:27	trying to make the det spherically distributed you thought was inconsistent with being gaussian
0:21:32	why's that
0:21:36	its empirical but the gaussian high dimensional space are sphere
0:21:41	yes some very
0:21:45	but
0:21:47	we constrained speaker fact all floral
0:21:52	and sphere
0:21:53	the just a goat
0:21:55	to assume that the within class but not be a set of the problem but
0:22:03	we will be affected
0:22:05	by the position
0:22:07	writings the posterior
0:22:09	the prior distribution of the i-vectors
0:22:11	zero mean unit identity rate both in high dimensional space that will be approximate
0:22:19	so that that's i mean that's care what happened i not as mathematically what a
0:22:23	high dimensional space so why's it in its just
0:22:28	here we actually a lot of what's a spherical distribution for phase as well
0:22:38	and applying model with the quality of correlators is a difficult the surface
0:22:44	maybe a see that length normalization is a whole technique projects on the sphere
0:22:50	instead of adjusting the tanks taking the information i think
0:22:59	good but which
0:23:02	but not so
0:23:05	discussion

Exploring some limits of Gaussian PLDA modeling for i-vector distributions

Speaker Modeling I

Pierre-Michel Bousquet, Jean-François Bonastre and Driss Matrouf