Speech Transcript - Constrained discriminative speaker verification specific to normalized i-vectors

0:00:15	how do
0:00:16	so i reference investigations about discriminative training
0:00:22	applied to vectors i-vectors that have been probably normalized
0:00:28	shown us the system on which focus
0:00:33	says using more i-vector based system first cognition
0:00:37	who is normalisation within class covariance the next normalization
0:00:42	then modeling notion p lda modeling providing parameters
0:00:48	me mean value mean mu and covariance matrices
0:00:52	and llr score
0:00:57	some works have been point one of the two
0:01:01	optimize parameters of this modeling be lda modeling
0:01:06	by using a discriminative the way
0:01:09	this discriminative classifiers use the logistic regression
0:01:15	maximisation
0:01:16	applying to score conditions of p lda
0:01:21	or for one to period parameters
0:01:27	statistics
0:01:30	the goal here is to have the new step an additional step to the normalization
0:01:36	procedure
0:01:37	which doesn't modifies the distance between i-vectors
0:01:41	unlike maximization em within class and then into constraints a discriminative training
0:01:49	once the and this additional no posted you
0:01:52	is carried out it's possible to
0:01:56	train the discriminative classifier with limited order of questions to optimize records that
0:02:03	as the older of questions to optimize by discriminative way
0:02:08	the core to z-score all of the dimension of the i-vector
0:02:13	then we carry out to the state-of-the-art logistic regression based
0:02:18	discriminative training
0:02:19	and also a new approach that for two hours and also norman discriminative classifier
0:02:25	which is a novel tint
0:02:28	first from addition the mattress
0:02:32	using the f e
0:02:35	is assumed to be statistically
0:02:38	statistically independent of t i s and the sit on
0:02:42	of the is constrained to lie in are line or in our own shove
0:02:50	the eigenvoice subspace
0:02:53	then a new zones comments about two weeks
0:02:56	long dot is four
0:03:00	the most commonly used mode and fourteen year
0:03:04	in speaker recognition
0:03:07	so the at all score can be written as the second degree polynomial function
0:03:11	of components of the two vectors of the trial w
0:03:15	and the value chain
0:03:17	which is can be written
0:03:20	all sonically out with marcus is p and q
0:03:28	we call that the state-of-the-art two days
0:03:31	was duration based
0:03:33	discriminative classifiers
0:03:35	try to optimize coefficients initialize bar be lda modeling
0:03:42	the use of as a low probability of correctly classifying or training
0:03:48	target as target non-target just target trials cold to tell cross entropy
0:03:55	by using gradient descent respect to some coefficients
0:03:59	the coefficients
0:04:01	that have to be maximized can be
0:04:03	is the period and it a score coefficients
0:04:06	so i do not missus p and q
0:04:09	previous slide
0:04:11	and following this way we propose a bible get an hour and so on
0:04:16	there are score can be written
0:04:18	as a dot product
0:04:20	between and expanded vector of trial
0:04:23	and the i-vector w use it is initialized with purely parameters
0:04:30	but books from a marketing proposed in two thousand
0:04:34	thirteen two
0:04:36	optimize purely a parameters mean value
0:04:40	eigenvoice subspace the mattress
0:04:43	three and nuisance variability matrix lambda
0:04:48	by using this
0:04:50	to tell cross entropy
0:04:51	function
0:04:56	discriminative training consider from those limitations of the recall that i since it is in
0:05:00	c
0:05:01	overfitting
0:05:02	overfitting on development data
0:05:05	and the respect of is about a made a conditions
0:05:09	matrices of covariance must be positive
0:05:14	the night the night
0:05:16	and the mattress experience you to the negative or positive
0:05:21	the condition right
0:05:22	so
0:05:23	some solutions have been proposed
0:05:27	constrained discriminative training
0:05:30	attempt to train only a small amount of parameters
0:05:33	for their
0:05:35	d where these the dimension of the i-vector
0:05:37	or then address instead of this call
0:05:42	so it shows proposed for example by wrote in and all
0:05:46	as your own box to mark screen
0:05:48	optimize only some coefficients for each dimension of the i-vector
0:05:53	and also for which a counts like make up scroll
0:06:02	sure you
0:06:04	can see that the scores composes some of
0:06:08	so what terms
0:06:10	it is possible to optimize the problem it coefficients for
0:06:14	each
0:06:16	bottom system
0:06:21	also only mean vector or
0:06:24	and eigenvalues of peeling matrices
0:06:27	can be train and we optimize it when the scaling factor also on the fact
0:06:32	of all
0:06:33	a unique or scholar for each matrix
0:06:39	it's possible so as to what we singular value decomposition of p into four parameters
0:06:44	to respect them it and it to parameter conditions
0:06:50	if it is gonna teach training
0:06:53	as the probably in the interesting results when i-vector we'll not normalized
0:06:58	it struggles to improve
0:07:00	speaker detection one i-vector have been first normalized
0:07:04	whereas assumption that she's the best performance
0:07:09	and represents all the additional normally the simplicity on the screen
0:07:14	propose an intended to constrain the discriminative training
0:07:19	recall that after within class covariance matrix w is a topic
0:07:25	after links number normalisation it has been shown that it remains
0:07:30	almost exactly isn't to pick
0:07:32	i mean and identity matrix in light bias colour
0:07:37	we propose just two
0:07:40	to rotation by z eigenvector basis of between class covariance matrix b of the training
0:07:45	dataset
0:07:46	computed over decomposition of b
0:07:49	and we apply is matrix of eigen vectors of be to each i-vector or
0:07:56	training or test
0:07:58	this is very simple person doesn't twenty four distance between i-vectors
0:08:03	so that doesn't deterministic matrices b is diagonal the value remains almost expected is a
0:08:09	true peak
0:08:11	and therefore they are not
0:08:13	because it b eigenvector basis is also going or
0:08:16	we assume
0:08:18	okay point is that we assume that building matrices from transposed and number become almost
0:08:23	they're going out of and then these all topic for longer
0:08:27	as a consequence is the mattresses of score involved in the air of scorpions you
0:08:32	almost signal
0:08:36	moreover as the solution of lda is
0:08:39	most exactly
0:08:41	according to the subspaces just a convict also be
0:08:45	"'cause" they were doing that is almost exactly equal to
0:08:48	i up to constant negative constant
0:08:52	so the first components of i-vector also proximity the projects them into the ldr also
0:08:57	space
0:09:00	so the score can be written as isomorph
0:09:04	allpass one down
0:09:08	that's there is a one ton for each dimension of the i-vector
0:09:12	and we
0:09:14	the other things are what is your turn
0:09:17	or is it i z off diagonal terms of the initial scoring
0:09:22	all the diagonal terms be on the asked to mention
0:09:25	and the offsets
0:09:29	so stressed and another proportion of a between zero score can be concentrated into this
0:09:34	song of all
0:09:36	terms one for each
0:09:38	dependent of independent
0:09:39	terms
0:09:42	here is an analysis of purely parameters before and after this with addition
0:09:46	and we modules the dignity always entropy of the matrices
0:09:52	value of maximal of one indicates that not expect exactly diagonal
0:09:58	we can see that after the right after
0:10:02	dissertation
0:10:03	all the value or a close to one
0:10:05	whose nearly matrices are very close to be diagonal
0:10:09	and also score metrics
0:10:11	and women's you result of p
0:10:14	so lofty lda by using some functions projection
0:10:19	distance between projects and then
0:10:21	sure the
0:10:23	matrix
0:10:24	aspects
0:10:25	and we see that and i is the most exactly the topic
0:10:30	to misuse the negligible or
0:10:33	part
0:10:34	assume that of for that you're violence we
0:10:36	compute on the last line table
0:10:39	the rest should between the violence
0:10:42	of the residual term and the variances along scroll
0:10:46	and we can see that after a four
0:10:48	manner
0:10:50	female
0:10:50	training set values and i close to zero
0:10:55	in terms of performance
0:10:57	we can possibly lda full baseline with the as a simplified scoring
0:11:01	in which we have removed
0:11:05	was it your term can see that's was it's a single
0:11:08	there is a d or don't of no
0:11:12	or
0:11:13	the plate of or in the speaker detection
0:11:18	so we can
0:11:20	carrier to discriminative training applied to the vectors
0:11:26	first a state-of-the-art logistic regression based
0:11:30	first approach following buggered
0:11:33	and are also then it is an interesting coefficient is the schematic training can be
0:11:38	performed by optimising
0:11:42	vector omega
0:11:44	score is a dot product between an expanded vectors trial given two i-vectors
0:11:51	you're marking on that the score can be written
0:11:54	as vector or of the auto
0:11:58	all that's and the steed off although this war owens initial
0:12:04	descriptive training
0:12:07	so one way second approach is based on works of books from one mike rate
0:12:13	and can be remarked that as a matter this is a close to be diagonal
0:12:18	there are close as you to their eigenvalue
0:12:22	a diagonal matrix
0:12:23	and so we perform following boxed on my we only
0:12:28	performance measures training
0:12:31	intended to optimize as a diagonal off if you transposed the scout are of long
0:12:37	vowel
0:12:38	and the mean value me
0:12:44	then will introduce no anomaly an alternative to the logistic regression
0:12:50	discriminative training
0:12:55	we define a is spectral
0:12:59	expanded vector or score of the trial
0:13:02	i was all this one
0:13:05	spectral where like to all
0:13:08	with a one
0:13:10	component for each dimension of cd
0:13:13	eigenvoice subspace and the last component which is
0:13:18	so was it your terms
0:13:21	so the score is equal to this vector or dot product of this data and
0:13:25	of a vector of ones
0:13:28	the goal here is to replace this
0:13:31	unique normal spectral
0:13:32	the problem vector by the buses
0:13:35	basis of discriminant axes are extracted by using fisher project
0:13:40	then i
0:13:41	we have extracted in
0:13:43	one can but not one but
0:13:45	several vectors we have to combine these buses
0:13:48	basis of the control to fronted the unique normal a vector
0:13:53	needed by speaker detection
0:13:58	so we can use a one woman shucked italian two
0:14:02	extract as the disk a discriminant axes
0:14:07	in this space of expanded vector
0:14:11	so we can see there are data set comprised of for trials target and non-target
0:14:15	trials
0:14:17	for each of one of those of them we
0:14:20	by the expanded vector all
0:14:23	of the destroyer
0:14:25	so in these datasets we can compute the constrain the dimension
0:14:31	we can compute the statistics of trial or a target and non-target trials
0:14:37	the within class between class covariance matrices of
0:14:41	this dataset
0:14:45	in this case of two class classifier target non-target and we can extract is taxes
0:14:51	you maximizing the fisher criterion
0:14:54	of a question nine
0:15:01	problem
0:15:02	since you understand what the problem
0:15:05	with two class
0:15:07	the
0:15:08	between just middle east forms one so we can only
0:15:12	extractor one non you're
0:15:14	value
0:15:16	one axis only can be extracted because we are
0:15:21	limit of is the number of class
0:15:25	but some time ago we get a random it or of proposed them in order
0:15:30	to extract marxism class is like using the fisher we do i am so different
0:15:35	as middle bars also normal discriminative classifier
0:15:39	since you was use the sometimes in face to face recognition
0:15:46	to
0:15:47	two cells and
0:15:49	researchers use it in those errors
0:15:52	the idea is in a given in this other reason we then a training corpus
0:15:56	td off expanded vectors
0:15:59	of scroll trial
0:16:01	target non-target trials
0:16:03	we compute the statistics we compute is are extracted vector maximize
0:16:10	which maximizes as official italian
0:16:13	and born as
0:16:15	we project the data set onto the orthogonal subspace of is a vector
0:16:20	so we extract a vector we have the background and we
0:16:24	project data on the aeroplane of this electoral
0:16:31	and we t right so we can extract more taxes
0:16:35	then
0:16:37	class classes
0:16:41	can be that is that fisher returns the geometrical approach which doesn't need
0:16:48	assumptions of ago sanity for vector corresponding latent all schools
0:16:53	i'm not
0:16:55	additionally
0:16:56	distributed
0:16:58	i can be shown that they follow independent each component of expanding score for one
0:17:04	c dimension following dependent non sound toolkit you distributions with distant parameters
0:17:10	for target trials and non-target trials
0:17:14	can be more supposing that if you
0:17:17	carry out an experiment using expanded vectors course whiskey to distribution
0:17:24	we obtain exactly the sandwich you
0:17:26	then we select a loss the idea that off cool
0:17:30	because if you chew
0:17:32	does not
0:17:33	a new informations
0:17:36	extract i-vectors of standard normal prior
0:17:40	so this is a
0:17:41	the we to put in a multifunctional score
0:17:44	for look at you
0:17:46	so that was on the same
0:17:49	but if we use this method to extract a try to extract the
0:17:54	discriminant axis
0:17:57	or an menstrual to address is to combine this subspace of
0:18:02	discriminant
0:18:03	axis to
0:18:05	to obtain the unique
0:18:07	normal vector are needed by speaker detection we need only
0:18:11	one vector to apply
0:18:14	so we have to find weights to
0:18:18	applied to each
0:18:19	also no discrete on tech vectors
0:18:25	that's proposed
0:18:27	weights equal to the norms the spectral
0:18:30	because by this way it can be shown that the variance of scores off
0:18:34	the
0:18:37	the axis
0:18:38	i don't iteration
0:18:41	the variance is decreasing
0:18:43	and so this is this missile is similar to a singular value decomposition
0:18:48	in which we extract the
0:18:51	most important axes in terms of variability of scroll then
0:18:56	the others
0:18:58	with decreasing violence and remark that at the end
0:19:02	the impact of the lasts and are
0:19:06	discriminant vectors is negligible or in this in the score
0:19:11	so
0:19:14	question ten show that to a trial we can have to rotation by be computed
0:19:20	expanded vector of g i g between two i-vectors
0:19:24	and the price of the product
0:19:27	of cs benedict always is
0:19:31	discriminant axes with seizes is
0:19:33	weighted sum of fisher could tie on
0:19:37	axis
0:19:40	for task training event if the dimension of expanded vector
0:19:46	is folder or do you can not disk or
0:19:50	we can of more than one hundred millions of non-target
0:19:56	trials
0:19:57	and since we have to compute the covariance matrix of
0:20:01	set of more than
0:20:03	and
0:20:05	so i four hundred
0:20:07	billions
0:20:09	trials
0:20:11	we can parameterize just cores that others statistics of
0:20:17	the training set
0:20:18	if we but make a pass training of the system things that can be expressed
0:20:22	as linear combinations
0:20:24	of statistics of subsets
0:20:26	so it's possible to split the task
0:20:31	i don't for experiments to split the task of computation of this you which
0:20:38	current training dataset
0:20:41	another remark
0:20:44	which was not and done by the also has a nice old
0:20:48	i
0:20:50	the nist needs
0:20:51	vertically to project data onto a to one answer space
0:20:55	at each iteration
0:20:56	and also if you are
0:20:59	billions of data it's very long but the paper was an unruly to me
0:21:06	extract i-vectors without
0:21:10	the concern of projecting data at each iteration only by updating statistics
0:21:16	it is possible to extract i-vectors without
0:21:19	are effective
0:21:21	where are projection of data at each iteration
0:21:26	lines use
0:21:28	of z recognition five
0:21:33	of phone is the sorry the two thousand ten telephone extended
0:21:40	with a vector provided by
0:21:44	borrow university of technology so santana
0:21:47	so as an eleven
0:21:49	thanks to on the chernotsky and of a month ago
0:21:53	for male set and from a set
0:21:55	and of the first line for h and i is the baseline
0:22:00	p lda
0:22:02	first as the two approaches using logistic regression on coefficient of score of punitive parameters
0:22:09	and the fourth line easier or something more discriminative classifier
0:22:15	we can see first that logistic regression there is the approach is frightening improving the
0:22:20	performance of p lda
0:22:23	it's why that's why the of the weighting because the incentives the cup
0:22:30	the corresponding is constrained
0:22:34	maybe overfitting on data all
0:22:38	although i don't know
0:22:40	and as the results are not better than p lda
0:22:45	maybe asked other links normalisation a vector r
0:22:50	go shown
0:22:51	it proves gaussianity
0:22:53	and seuss logistic regression is enabled maybe
0:22:56	to improve a getting
0:22:59	the performance
0:23:01	we remark that was more discriminative classifier is able to improve performance in terms of
0:23:08	equal error rate
0:23:09	and see it at all
0:23:12	for all send us more than female
0:23:17	not that's a to take into account and distortions in the television on the critical
0:23:22	original false alarms
0:23:24	it's able to learn or the only on is trials provide things the highest
0:23:32	as a non-target trials providing the highest schools
0:23:38	with the dentist and highest non-target
0:23:42	trial scores
0:23:44	we trained the thirty two
0:23:47	be bitter done with or
0:23:50	so the non-target set
0:23:56	what is the recent speaker in the one and to silence
0:24:01	you know evaluation which is a good way to assess what business of an approach
0:24:04	covers the conditions are not controlled
0:24:08	i'm with the real version noise short duration and mixing
0:24:12	male female
0:24:14	we can see that visit hardly are i-vector of
0:24:19	that or d is able to improve slightly performance of p lda
0:24:26	not just sets present indicated
0:24:30	on all those of the
0:24:32	official score board there are more suited our cruise the channels and their or and
0:24:37	we applaud
0:24:39	or this cost
0:24:40	well in don't not correctly calibrate
0:24:45	the discourse the development set
0:24:48	and so as a result
0:24:51	two versions
0:24:54	future works well working on short duration of the utterance of a team use a
0:24:59	desirable to improve slightly or
0:25:02	sometimes more
0:25:03	others ple baseline
0:25:06	and particulars the speaker variabilities system issue is not very accurate
0:25:13	as
0:25:14	the ones for short duration
0:25:16	and the also on i-vector like representations
0:25:22	following
0:25:24	whole v are which propose them
0:25:27	to extract a lower want to probability factors for speaker diarization
0:25:32	by using deep neural networks
0:25:35	we showed that is p lda framework a is able to texas
0:25:42	a new representation
0:25:45	and to deal with system in addition
0:25:50	thank you

Constrained discriminative speaker verification specific to normalized i-vectors

Speaker Recognition: i-vector approaches

Pierre-Michel Bousquet, Jean-Francois Bonastre