Speech Transcript - A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

0:00:16	i don't
0:00:17	i am in the centre from a i'm research about the computer science institute which
0:00:21	is affiliated to can be set and to university of one of size you know
0:00:25	judy
0:00:26	the work i'm going to talk about today was done in collaboration with me h
0:00:30	one car from the startup
0:00:31	ut sri international
0:00:35	so let me start we describe the one of the most standard speaker verification pipelines
0:00:39	these days
0:00:40	and the pipeline is composed of
0:00:42	three stages
0:00:44	we have first the speaker but in extractor which is meant to transform the sequences
0:00:49	in the two trials into fixed-length vectors x one x two here
0:00:54	then we have a stage that thus lda followed by mean and variance normalization
0:00:59	and then we next normalize
0:01:02	and those resulting vectors x one x two are then processed with a the lda
0:01:07	stage which computes a score for the trial
0:01:10	which can then be threshold it to make the final decision
0:01:14	so that the lda scores
0:01:16	are computed s and rs log-likelihood ratios
0:01:19	and their state of gaussian assumptions
0:01:23	the form of the llr is these
0:01:24	it's the logarithm
0:01:26	of the racial between two probabilities which are the probabilities of it
0:01:30	two inputs
0:01:31	given that the speakers are the same
0:01:33	and the probability of the inputs given that the speakers are different
0:01:37	and these in an r
0:01:39	given the gaussian assumptions in the lda
0:01:42	can be computed with a close form which is a polynomial units one x two
0:01:47	you can find a for mean in the paper
0:01:51	so
0:01:52	the problem is that in most cases what comes of a purely eye are scores
0:01:57	that are very nice kind of rate is means that
0:01:59	no the we computed unless and an hour's data really are not and ours
0:02:04	and the cost for these mismatch a is that
0:02:09	they assumption that we may can be lda not really much they're real data
0:02:16	so
0:02:18	is calibrated scores have the problem that they have not probabilistic interpretation this means that
0:02:25	in consequence we cannot
0:02:26	and
0:02:28	used unless absolute values we can use them relative to each other
0:02:32	so we could run examples of trials
0:02:35	but we cannot interpret the
0:02:37	so let's say for example that you get a score minus one for certain system
0:02:41	for certain trial
0:02:44	you would only be able to tell one these minus one means a there you've
0:02:48	seen a distribution
0:02:50	or
0:02:51	some development data that has gone through the system
0:02:55	so once you see
0:02:57	this emotion and then you can interpret this minus one
0:03:00	properly and you could actually threshold the score and decide the thesis the target samples
0:03:10	okay so we would like scores to be equally weighted because
0:03:14	then
0:03:15	they have these nice property that they are in an hour so that we can
0:03:18	interpret their values
0:03:20	and we can also use based rules to make a
0:03:24	decision on the threshold
0:03:26	without having to see a development data
0:03:30	so
0:03:31	but calibration is done and generally with an affine transformation
0:03:35	there is trained using logistic regression so let's say you and all your score some
0:03:39	is calibrated
0:03:42	then what you do these
0:03:43	train these alpha and beta which are the two
0:03:47	parameters in the affine transformation
0:03:50	so value maximize the cross entropy
0:03:52	that's the logistic regression
0:03:54	objective function
0:03:55	and then you get at the output
0:03:58	properly calibrated and
0:04:02	okay so basically what these means is that we take these by applying we had
0:04:07	are we just at one stage
0:04:09	the global calibration
0:04:12	now the problem is that if this doesn't really solve the problem
0:04:17	and
0:04:18	in general so we are only solving the problem with this global calibration for
0:04:23	the extract set
0:04:24	for which we train the calibration parameters
0:04:27	if the calibration
0:04:30	the calibration set doesn't match our test set
0:04:33	then we will still have a calibration problem
0:04:36	and these results illustrate this so
0:04:39	the wearable one this sets are for now well explained them later but for now
0:04:44	what's important is then i'm showing three different be lda sets
0:04:50	that are
0:04:51	really a systems
0:04:53	that are identical to the calibration stage on what the first is
0:04:57	what training data was used to train
0:05:00	the calibration parameters
0:05:02	though so that the
0:05:05	red bars
0:05:07	i
0:05:08	one it's important here is to compare the height of the bar which is the
0:05:11	actual c and the lower for each of the systems
0:05:14	and the black line
0:05:15	which is the meaning of the llr
0:05:17	for that system
0:05:19	so if the difference between the two
0:05:21	is smaller than it means that the system is well calibrated if it's be it
0:05:26	means it is not what kind
0:05:28	so what we see here
0:05:30	is that the performance the actual c in an hour is very sensitive to reach
0:05:35	set was used to train the calibration
0:05:38	well
0:05:39	so for example
0:05:40	box
0:05:42	necessary to switch or which is
0:05:44	mostly box in this case
0:05:48	it's very well the speakers in the wild dataset
0:05:52	so it gives very good calibration but horrible for sre
0:05:56	and similarly the say the rats data is very good much more lasers but is
0:06:01	not so good for exactly sixty
0:06:04	so basically this means we cannot get
0:06:07	a single global calibration model that we work
0:06:10	well across the board
0:06:14	alright so the goal of this work is based digital but system that doesn't require
0:06:18	these we calibration for every new condition
0:06:21	it's quite ambitious goal
0:06:23	and
0:06:24	we basically want to speaker verification system that can be used out of the box
0:06:29	without having to lead to have been dataset
0:06:34	okay so
0:06:36	one back to the by line a the standard approach
0:06:40	in the by pinata showed
0:06:41	is to train each of the stages separately maybe you reach the previous stage
0:06:47	and when the
0:06:49	we they put data
0:06:50	that comes out of that stage train the next state
0:06:54	with different objectives so the first one this speaker media extractor is trained with
0:06:59	speaker classification also what object the
0:07:02	lda on the lda is used is trained to maximize the likelihood
0:07:07	and then finally the calibration stage is trained to
0:07:11	optimize minor cross entropy which is a speaker verification
0:07:19	now
0:07:21	one simple thing we can do is just integrated three stages in the market we
0:07:26	may think this is
0:07:29	some solution to the calibration problem and you may actually sol our initial of needs
0:07:34	calibration across conditions
0:07:37	so
0:07:38	what we do is basically keeping the same exact functional form
0:07:43	passing the standard pipeline
0:07:45	but instead of training them with different objectives
0:07:49	separately
0:07:50	we just trained them jointly using stochastic gradient descent
0:07:54	for this of course when integrating batch is that are trials
0:07:59	my budget of trials rather than samples
0:08:02	and we simply just
0:08:04	what we do is
0:08:06	randomly select speakers for each speaker select
0:08:10	two samples
0:08:11	and then
0:08:12	from that list of samples to create all the trials all the possible trials
0:08:17	across those samples all tool pursues older since all samples
0:08:24	so we know we can compute the
0:08:27	the binary cross entropy and we optimize that
0:08:31	so this is not the first time that something like this
0:08:35	is proposed of course i to solve the mean m and we'll get and others
0:08:40	what was something very similar
0:08:43	at the time of the actually
0:08:45	train the
0:08:47	but kind of with the svm what we linear logistic regression
0:08:50	is that of stochastic gradient descent but basically that the concept is saying
0:08:55	and more recently now there's been a few papers than two and to have a
0:09:00	speaker verification and they use some claymore of these
0:09:04	idea where the training data but can't which is usually very similar formats this tandem
0:09:10	again in a discriminatively
0:09:13	the of this paper is actually here you know these
0:09:16	and i'm sorry finest only in the upper
0:09:21	so this paper is actually report improving discrimination performance
0:09:25	but i don't usually report calibration performance which is one we care
0:09:30	in this work
0:09:32	and what we actually found in our previous paper is that this approach of just
0:09:37	trained discriminatively
0:09:39	at the lda back-end
0:09:41	is not sufficient to get good calibration across conditions
0:09:45	and that we know from our previous papers so
0:09:49	it means this is not a these architecture and training jointly is not e
0:09:55	so what n
0:09:57	what is the problem
0:09:59	in this basic form
0:10:01	and we
0:10:02	we show before the calibration stage is a global
0:10:07	well anyway
0:10:09	same as in the standard white nine
0:10:11	and it seems that this is not enough flexibility for the model to adapt to
0:10:16	the different conditions in the date
0:10:18	even if you train a small with a lot of different conditions you will just
0:10:22	of that to the
0:10:23	my jewelry the condition
0:10:27	so what we propose to do is to i and branch
0:10:30	so these model
0:10:32	so we keep the speaker verification range the same
0:10:36	and then we added a branch that
0:10:39	is in charge of computing calibration parameters as a function
0:10:43	both input vector sets one and x two
0:10:46	and the form for this branch is starts the same as the top one
0:10:51	it's an affine transformation
0:10:53	that's length normalization of course the parameters of these something transformation on different
0:10:58	on the top ones
0:11:00	then we do dimensionality reduction
0:11:02	i we go to very low dimensional seen in that paper we use of dimensional
0:11:06	five
0:11:07	to compute the mean vectors which are
0:11:10	and we call
0:11:11	side-information vectors
0:11:13	and then we use these vectors to compute an alpha and beta using and very
0:11:19	simple form which is based similar to the be lda form here
0:11:24	at so
0:11:26	when we and that is we had two branches one is in charge of computing
0:11:30	the score and the other one is its actual computing the
0:11:33	calibration parameters
0:11:36	for each of the sample c and
0:11:40	so i'll show the results now so let me
0:11:42	talk about the data
0:11:44	we have
0:11:46	a bunch
0:11:47	i had a whole lot of training data
0:11:49	we used books and of one and two
0:11:52	sre data speaker recognition evaluation data from
0:11:55	two thousand five two thousand twelve
0:11:58	blast mixer six
0:12:00	and switch for all of that it is actually share we
0:12:04	and the embedding extractor training data
0:12:06	we just use half of what we use one but in extractor training just for
0:12:10	expediency the experimentation
0:12:14	and then we have two more sets but source
0:12:16	which is telephone data in that would just other non-english for different languages
0:12:21	and then if it's just trying to which is forensic voice comparison
0:12:26	we just the very clean data set
0:12:28	it's a studio microphone anything
0:12:31	i australian english
0:12:34	and then for testing we use sre six sixteen sorry eighteen speakers in the while
0:12:40	the then on the ml
0:12:41	and lasers which is a bilingual
0:12:44	set recorded over several different microphones
0:12:48	and a forensic voice comparison
0:12:50	the chinese version so the
0:12:53	recording conditions of these two are very similar
0:12:57	but the language is the
0:12:59	and ask that sets
0:13:01	and we use the that part
0:13:04	all these three sets aside a sixteen sre eighteen of speakers in the way
0:13:08	i with that we do all the parameter tuning
0:13:11	we choose the iteration best iteration for each of the models
0:13:15	stuff like
0:13:18	okay so here we use a rear their results
0:13:21	and
0:13:22	the
0:13:23	rand bars have the same ones
0:13:25	as in the previous figure they showed
0:13:29	and i didn't the blue
0:13:31	bar which is the system we propose
0:13:34	we each as you can see
0:13:36	you know training rules
0:13:38	most cases over the best or that the global calibration model
0:13:44	so
0:13:44	we basically achieved what we want it which is to have a single model that
0:13:49	kind of that to the test conditions without that's telling them
0:13:53	what the test conditions are
0:13:56	the only exception is these lpc cmn case
0:14:00	which is not well calibrated idle
0:14:03	and in fact there is one global
0:14:05	the lda model that is better
0:14:08	than the one we propose
0:14:10	is still applied
0:14:11	but is better than ours
0:14:13	and
0:14:14	and the problem with that set
0:14:16	is basically that it's
0:14:18	it's a condition that is not seeing
0:14:21	in combination
0:14:22	during training so
0:14:24	we have clean data in training
0:14:27	but is not in chinese are we have training but is not key
0:14:32	so the model doesn't seem to be able to
0:14:34	learn
0:14:35	how to properly calibrated a that they
0:14:39	unfortunately so this just means
0:14:42	there's to work to be done we haven't really achieve that ambitious goal that i
0:14:46	mentioned before which was to have
0:14:48	a completely
0:14:50	general
0:14:51	out of box system
0:14:54	okay so before to finish i i'd like to describe a few details so how
0:14:59	this model is trained because they are essential to get would performance
0:15:04	so one important thing is to
0:15:06	do an
0:15:07	non random initialization so
0:15:10	what we do and
0:15:11	many of the papers than two and two and training do similar things
0:15:15	is
0:15:18	initialize the speaker brunch with the parameters that a standard the lda baseline
0:15:24	that's very sing
0:15:25	and then for this
0:15:26	side information much we
0:15:29	this first stage we initialize if we the bottom
0:15:33	and components of this anyway lda transform that we trained for
0:15:38	the speaker match
0:15:41	that means that what comes out of here he's
0:15:44	basically the words you could do for speaker i e
0:15:48	we should be
0:15:49	the best you can do for conditionality
0:15:51	so we're trying to get from the input
0:15:54	they condition information
0:15:56	then these matrix here
0:15:59	which doesn't have any recent level before value
0:16:02	we just initialized randomly anyway
0:16:05	and these two
0:16:07	components here we initialize them so that what comes out of here
0:16:11	are the global parameters
0:16:13	at the first iteration you portray
0:16:16	so
0:16:16	basically at the initialization what the scores that them out of here are the same
0:16:22	that would come out or a the lda
0:16:25	standard p only a by i
0:16:28	here the results
0:16:30	it comparing three different
0:16:31	initialization approaches
0:16:34	random
0:16:36	then
0:16:37	a one star partial which means
0:16:40	what i described before but without
0:16:43	initialising bees
0:16:44	stage with the lda what on components just one only
0:16:49	and then the louise
0:16:50	what is correct
0:16:52	so the blue is the best of the three
0:16:54	so it means it's worth the trouble two
0:16:58	take the time to find a initial parameters this marking
0:17:04	so
0:17:06	another important thing is to that we train them only two stages
0:17:10	so the first stage uses all the training data to train the formal all the
0:17:15	parameters
0:17:16	and then the second stage
0:17:18	we freeze the lda mp lda blocks
0:17:21	i'm trying to on the rest of the parameters using
0:17:24	domain balance
0:17:25	data
0:17:26	and this is important because if the data is not about and then
0:17:29	most of the trials in you a novel batch would be from one the mean
0:17:33	and then we would just be optimising things for that only
0:17:38	that something that has more samples
0:17:42	finally the convergence of the model is kind of a big issue
0:17:47	validation performance jumps of one from batch to batch and a lot
0:17:52	so you see that curve of optimization in
0:17:55	com one much to the next i in can change significantly
0:18:00	so what we do is basically choose the best iteration using the validation sets that
0:18:04	i mentioned before
0:18:06	and the good thing is that these approach seems to generalize well to other sets
0:18:11	even two sets that are not very well matched to the limitations
0:18:15	and we tried a bunch of tricks to smooth out the validation performance and they
0:18:20	do set sitting smoothing out the validation mccormick like regularization
0:18:25	sloane everybody
0:18:27	but they actually make the minimum
0:18:29	worse so we
0:18:31	keep the while when initial curves i'm just choose the mean
0:18:38	and well so
0:18:39	and say
0:18:40	did how repository
0:18:43	we the exactly these
0:18:45	model
0:18:46	implemented for training them for evaluation at you just want to have a pre-computed and
0:18:51	endings
0:18:52	and have an example we then bindings that we provide
0:18:56	three to use a modified let me know we could find box
0:19:01	i'll be how to respond questions and comments
0:19:05	okay so
0:19:07	conclusion we developed a model that achieves excellent performance across a wide variety of conditions
0:19:13	and it integrates different stages in a speaker verification looking into one stage
0:19:18	and trains the whole thing doing c
0:19:21	you also integrates an automatic extractor of side-information then in then uses to condition calibration
0:19:27	parameters
0:19:28	and these chips our goal of getting and good performance across different conditions
0:19:36	of course there are many open issues with like temporal
0:19:39	training convergence i don't think we are done with that i would like to see
0:19:44	it easier to
0:19:47	optimize model
0:19:49	and of course we'd like to plug in these small with the mle extractor and
0:19:53	training and
0:19:56	okay thank you very much
0:19:59	if you have any questions please by two need to be solved resort to the
0:20:02	ldc platform be more detail
0:20:05	thank you

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

Speaker Recognition 2

Luciana Ferrer, Mitchell Mclaren