Speech Transcript - Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition

0:00:15	okay this is going to be account where
0:00:17	technical
0:00:19	so that recovery problem
0:00:22	maybe uncertainty modeling in
0:00:25	text dependent speaker recognition
0:00:27	of the issues that i'm concerned with here is that
0:00:34	what to do about the
0:00:38	the problem of a speaker recognition context where you have very little data and the
0:00:43	features you extract
0:00:46	are necessarily going to be noisy and the and the statistical sense
0:00:50	okay the this comes straight to the for in text dependent speaker recognition where you
0:00:55	have maybe just two seconds of data
0:00:59	it's also an important problem in a text-independent speaker recognition
0:01:05	because of the need to be able to
0:01:09	just set a uniform threshold even in cases where you're test utterances are variable duration
0:01:17	that would be interesting to say what happens in the forthcoming nist evaluation
0:01:23	with that a particular problem
0:01:26	some progress has been made on a with
0:01:29	but subspace methods with i-vectors you try to quantify three
0:01:34	the statistical noise in the i-vector extractor process
0:01:39	and see that into the into p of you get a
0:01:45	but
0:01:46	i've taken a possibility of the table for the purposes of interest and said look
0:01:51	of subspace methods
0:01:53	in general are not gonna work in text-independent speaker record text dependent speaker recognition because
0:01:59	of the data distribution
0:02:01	okay so
0:02:02	what i attempted to do it was to tackle this problem
0:02:06	well and speaker variability when one could not able to characterize that by a subspace
0:02:14	methods
0:02:17	i realise what was preparing the presentation that the that the paper in the proceedings
0:02:21	is very tense it's rather difficult tree
0:02:25	but
0:02:26	the idea it's a tricky but it's fairly simple okay so i have made an
0:02:32	effort in the slides just to communicate the core idea and if you are interested
0:02:37	them i recommend that you
0:02:40	that you look at the slides rather than the paper i poles these lines one
0:02:44	on my web page
0:02:48	so the
0:02:50	the test that for this task we took the dior store for three that's the
0:02:54	random digit
0:02:56	a portion or
0:02:58	of the arms or data i just mentioned two things about this
0:03:02	a because the a design we have five random digits at test time
0:03:10	well all ten digits were repeated three times at enrollment time in random order you
0:03:15	only see half of the digits at test time
0:03:18	and actually turns out that under
0:03:22	those conditions
0:03:24	a gmm methods have an advantage okay because you can use all of the
0:03:31	all of your enrollment data no matter what the test utterances or as if you
0:03:36	pre-segment the data into digits you are actually constraining yourself chose to using the enrollment
0:03:43	data that corresponds to the digits are actually think
0:03:46	in the in the vocabulary so in practice you actually need to you need
0:03:53	one of the thing a dimension is that this paper is about the back end
0:03:56	we used a standard sixty dimensional plp front end which is maybe not hiding
0:04:01	i think that'll come up and the
0:04:03	in the next talk you can get much better results on female speakers if you
0:04:08	use a low dimensional front-end
0:04:10	which i think of resort was the was the first two of discover
0:04:15	so
0:04:18	the model that i was using here is a is that you have a model
0:04:22	which uses the low dimensional of hidden variables to characterize speaker effects but does not
0:04:29	attempt to something to characterize channel effects
0:04:33	but does not attend so to characterize speakers using subspace methods
0:04:40	and these the z factor that characterizes speakers would be used as a feature vector
0:04:47	for
0:04:50	speaker recognition
0:04:52	and the problem i wanted to address was to design a back and that would
0:04:57	take account of the fact
0:04:59	that the number of observations to re
0:05:03	available to estimate the components of this vector is very small
0:05:09	in general a step of it has them
0:05:12	one frame per mixture component if you have a two second utterance and the ubm
0:05:17	with five hundred and twelve gaussians to calculation you see that you have extremely sparse
0:05:23	data
0:05:26	so there are two a backends the but are present
0:05:31	one the joint density back end uses a point estimates of the features that are
0:05:38	extracted at enrollment time and at test time and it's
0:05:43	a models the correlation between the two
0:05:46	okay to construct a likelihood ratio and the innovation in this paper the hidden supervector
0:05:52	back end it treats
0:05:54	those two feature vectors as hidden variables as in the original formulation jfa
0:06:03	so that the key ingredient is to supply up a prior distribution that on the
0:06:11	correlations between those hidden variables
0:06:17	how much time don't have like i'm sorry i dependence one a strong
0:06:21	how much
0:06:23	okay good
0:06:24	okay so
0:06:26	the
0:06:28	so i was just digress a minute
0:06:30	the way uncertainty modeling is used is usually tackled in text independent speaker recognition is
0:06:39	that you try to characterize the uncertainty in a point estimate of an i-vector using
0:06:44	a posterior covariance matrix
0:06:47	but is calculated using zero order statistics
0:06:50	and you do this on the enrollment side and on the test side independently
0:06:56	if you think about this you realise that this isn't quite the right way to
0:07:00	do
0:07:01	okay the reason is that if you are hypothesized thing a target trial then what
0:07:07	you see on the test site has to be highly correlated with what you see
0:07:10	on the from inside
0:07:13	they are not statistically independent
0:07:16	and that has to be of benefit
0:07:18	from
0:07:19	using those correlations to quantify the uncertainty in the feature that comes out of the
0:07:26	test utterance we could something called a little total variance that says on average
0:07:32	when you condition one random variable on another u models reduce the variance
0:07:37	okay so
0:07:39	this the critical thing that i introduced in this paper is there is this correlation
0:07:46	between the enrollment on the on the test
0:07:50	okay so here's a the mechanics of how the joint density back end work that's
0:07:54	pretty straight the features has point estimates
0:07:58	at this was inspired by us central command is work at the last odyssey
0:08:04	he implemented this the upper level of i-vectors there's nothing to start with just a
0:08:10	few doing at the level of supervectors as well
0:08:13	you can't obviously training
0:08:17	correlation matrices of supervector dimension but you can implement this idea at the level of
0:08:22	individual mixture components and that's
0:08:25	so that gives you a trainable
0:08:27	back end for text dependent speaker recognition even if you can't use subspace methods
0:08:36	so that's the
0:08:37	that's our best back end and thus the one that i used as a benchmark
0:08:41	for our experiments
0:08:46	so
0:08:46	the supervector backend is the hidden version of this
0:08:51	okay that says
0:08:53	you're in a position to observe by one statistics but you're not position to observe
0:08:58	the these that factors you have to make inferences
0:09:02	about the posterior distribution of those features and bayes
0:09:06	you're likelihood ratio on a on that calculation
0:09:11	now it turns out that three
0:09:14	probability calculations are formally just mathematically equivalent to calculations with and then the fusion i-vector
0:09:23	extractor that has just two gaussians in a
0:09:25	weekly effects mixture component from the ubm
0:09:28	you say you observed that mixture component once on the enrollment side once on the
0:09:32	test side so you're two
0:09:34	hidden gaussians
0:09:36	okay you have a variable number of observations on the on one side of variable
0:09:40	number of observations on the test site that's the type of situation that we model
0:09:45	with an i-vector extractor so here
0:09:49	the there is an i-vector extractor
0:09:51	but it's only being used to do probability calculations is not going to be used
0:09:57	to
0:09:59	to extract features
0:10:03	no
0:10:04	one thing about this i-vector extractors that you're not going to use a to impose
0:10:08	subspace constraints because it's just have the two gaussians right
0:10:12	you don't need to say that those two gaussians
0:10:16	line of low dimensional subspace of the
0:10:21	of the supervector space
0:10:24	so you might as well just take the total variability matrix to be the identity
0:10:28	matrix and shift all of the burden
0:10:31	modeling the data
0:10:32	so the to the prior distribution
0:10:36	in a i-vector modeling we always take us time or prior a standard normal prior
0:10:42	zero mean identity covariance matrix that's because there is in fact norm general stick to
0:10:49	be gained by using a non-standard prior okay you can always compensate for a non-standard
0:10:55	prior
0:10:56	by fiddling with the
0:10:58	with the total variability matrix here
0:11:01	take the total variability matrix of the to be the identity but you have to
0:11:05	train the prior
0:11:07	and
0:11:09	that involves doing well you can do a posterior calculations if you look at those
0:11:15	formulas you see they look just like the standard ones except i now have and
0:11:20	mean and the precision matrix which
0:11:23	would be zero and the identity matrix in the case of the
0:11:28	standard
0:11:30	standard normal prior
0:11:31	and you can do minimum divergence estimation okay which is the way in the fact
0:11:37	of training the prior if you think about the way you minimum divergence estimation wiretaps
0:11:42	in fact what you're doing is your estimating a prior
0:11:47	okay and then we say well there's no gain in using a non-standard prior
0:11:53	so with standard as the prior and modified the total variability matrix instead so here
0:11:59	we just
0:12:00	estimate the prior to put that were estimated in inverted commas estimating a prior is
0:12:06	not something
0:12:08	variation due but we do it all the time so it works
0:12:12	so how would you how would you training assuming a have to organise your training
0:12:16	data into target trials
0:12:18	you with collect
0:12:20	the
0:12:21	the i-vector for each trial three each mixture component in the ubm you would have
0:12:27	an observation on the enrollment set an observational that over multiple observations so you're bound
0:12:32	was statistics
0:12:33	and you just implement this minimum divergence estimation procedure
0:12:38	then get a prior distribution that tells you what correlations to expect
0:12:44	between the enrollment data and the test data
0:12:47	in the case of a target trial
0:12:53	if you want to handle non-target trials and you just a impose a statistical independence
0:12:59	assumption you just zero while the correlations
0:13:04	okay so the way you would use this machinery to calculate a likelihood ratio
0:13:10	is that
0:13:11	given enrollment data and test data
0:13:14	you would calculate the evidence but this is just the likelihood of the data
0:13:19	but you get when you integrate out the hidden variables
0:13:24	it's not usually done but i think everybody should of a gender implementation of i-vectors
0:13:29	we should always calculate the evidence
0:13:31	okay the because it tells you
0:13:33	it's a very it's a very good a diagnostic
0:13:36	that tells you whether you're implementation is correct you have to evaluate an integral it's
0:13:41	a gaussian entropy role
0:13:43	the answer can be expressed and close form in terms of a bomb or statistics
0:13:47	as in the paper
0:13:49	so you in order to use that's for speaker recognition you evaluate the evidence in
0:13:54	two different ways one with the prior for target trials and one with prior for
0:13:59	nontarget trials you take the ratio of the two and that gives you your likelihood
0:14:03	ratio
0:14:04	for speaker recognition
0:14:07	so the mechanics then of getting destroyed depends critically on how you prepare
0:14:13	the baumwelch statistics
0:14:15	that us summarize the enrollment data and the and the test data
0:14:21	the first thing you need to do
0:14:23	is that stick the enrollment things i'm
0:14:26	okay each of those is potentially contaminated by channel effects so you take the role
0:14:31	by one statistics and
0:14:32	filter out the channel effects
0:14:35	just using the jfa model
0:14:39	so in that way you get a set of syntactic up on well statistics which
0:14:43	characterizes a speaker you just pooled at the bottom one statistics together after you have
0:14:48	filtered out the channel effects you do that on the enrollment side you do the
0:14:52	huntley on the test side and you end up in and the trial having to
0:14:57	compare
0:14:58	one set upon was statistics with another using this hadn't supervector back end
0:15:06	i here's a here's a new wrinkle that really makes a more
0:15:13	we know that in order to
0:15:16	the sort of achilles heel of jfa models is the gaussian assumption
0:15:22	okay and the reason why we do length normalization
0:15:28	in between extracting i-vectors and feeding them to be able to get a is in
0:15:34	order to fix the are only a scale some assumptions in really i
0:15:40	we have to do a similar
0:15:43	track here but the normalization is a bit tricky because
0:15:46	you have to normalize one statistics are not normalising a vector
0:15:50	obviously the first order statistics
0:15:53	the magnitude is going to depend on zero order statistics so it's not
0:15:57	immediately obvious that the one
0:16:00	so the this recipe that a the data used it comes from well go back
0:16:08	to the jfa model and see how the jfa model is trained in these that
0:16:12	i-vectors this treat them as hidden variables
0:16:16	okay that come with
0:16:18	both a point estimate and an uncertainty of posterior covariance matrix that tells you
0:16:25	well how answer you more about the about the observations
0:16:30	a lot of the underlying hidden vector
0:16:34	and the thing to
0:16:36	but it turns out to be convened in to normalize is the expected norm of
0:16:41	that hidden variable
0:16:45	set of making the normal equal to one you make the expected norm one
0:16:50	okay so
0:16:52	a curious thing is that the second term on the right hand side of the
0:16:55	trace of the posterior covariance matrix is actually the dominant term
0:17:00	we can be because the encircling is you
0:17:04	and there is an experiment and paper that shows that you better not neglect factor
0:17:11	and the role of the relevance factor in the in the experiments that i reported
0:17:17	in the paper
0:17:19	as you fiddle with the relevance factor you're actually filling with the relative
0:17:23	magnitude of this term here so you have you do have just without
0:17:28	all possible relevance factors in order to get this thing working problem
0:17:38	okay so here some results using what i call global feature vectors that's where we
0:17:46	don't bother to pre-segment the data
0:17:49	in two digits
0:17:51	okay
0:17:51	member a set of the beginning
0:17:53	that the was an advantage on this task to not
0:17:57	segment
0:17:58	okay in other words just ignoring left to right structure
0:18:01	that you give and in the problem
0:18:04	so there is a gmm ubm benchmark
0:18:08	joint density benchmark and the two versions of the hidden supervector back and one without
0:18:14	but length normalization is applied to baumwelch statistics and the other with that
0:18:20	so you see that the length normalization really is the key
0:18:25	to getting this thing to work
0:18:30	i should mention that those you use it is you should reduction in error rate
0:18:34	there on the female side from eight percent to six percent
0:18:38	that appears to have to do with the with the front and i
0:18:42	we fixed a standard front-end for these experiments
0:18:46	but it appears that if you use
0:18:50	a lower dimensional feature vectors for female speakers you get better results and think that
0:18:55	sticks explanation of that
0:18:59	so there are there's a actually fairly big improvement okay if you go for one
0:19:04	twenty eight gaussians to five twelve even though the uncertainty in the case of five
0:19:08	twelve is necessarily going to be more ones
0:19:10	it was this
0:19:12	this phenomena that celibate motivated us to look at the uncertainty modelling problem
0:19:20	you can also implement this with
0:19:23	if you pre-segment the data into digits and extract what i call locals that vectors
0:19:28	and the paper
0:19:30	and it works in that case as well
0:19:33	there is a tree that we use here famous circles at component fusion
0:19:40	you can break the likelihood ratio up into a contribution from the individual gaussians and
0:19:46	wait and where the weights are
0:19:50	calculated using a logistic regression
0:19:54	that helps quite a lot
0:19:57	it requires however that you have a development set
0:20:03	in order just to choose the fusion weights
0:20:07	so in fact you see in the paper that
0:20:11	the results with these locals that vectors although we did obtain an improvement on the
0:20:17	evaluation set it was not as big nice improvement we obtained on we
0:20:22	on the development set
0:20:24	unique data if you're going to use a regularized logistic regression
0:20:29	so there is a way the way we found a way around that
0:20:35	instead of presegmenting the data into individual digits
0:20:41	we used a speech recognition system
0:20:44	in order to collect rebound one statistics
0:20:47	i mean if you can
0:20:48	in text-independent speaker recognition if you can use
0:20:54	signal and discriminant neural network to collect almost statistics be obvious thing to do in
0:20:59	text dependent speaker recognition for you know we
0:21:03	the phonetic transcription is just use a speech recognizer to collect the
0:21:09	to collect the common statistics
0:21:11	there because
0:21:13	individual signals are very unlikely to occur more than once
0:21:17	i in a digit string you are implicitly imposing a left-to-right structure
0:21:22	but you're not so you don't have to do it explicitly
0:21:26	the that works the horse just as well
0:21:31	okay so there's some fusion results
0:21:35	three two approaches with and without the
0:21:39	paying attention to that remind structure but they work if you use them you
0:21:43	you do get better results
0:21:46	okay so just adjustments one thing that i didn't well on
0:21:51	this thing can be implemented very efficiently
0:21:54	okay you come
0:21:56	you can basically things up in such a way that the linear algebra that needs
0:22:00	to be performed at runtime just involves
0:22:03	diagonal matrices
0:22:05	so it's not so it's nothing like the i-vector back end but i that i
0:22:11	presented
0:22:13	the last interspeech conference which
0:22:16	was just a trial run for this wasn't intended to be a realistic solution problem
0:22:23	it involves essentially
0:22:26	extracting an i-vector approach which is not something you would so you would normally do
0:22:32	but this is very computationally reason so it is effective in a lisp right
0:22:38	okay that's the that's only have to say thank you
0:22:48	okay we
0:22:50	question
0:23:00	are normalising the channel effect in the baum-welch statistics and he also normalized the for
0:23:06	any the phoneme variability there as well
0:23:10	well what was i said this is future work i to do something about but
0:23:19	i
0:23:19	i think this is a problem we should pay attention to
0:23:23	okay so famous that some preliminary work on a but it it's a vector
0:23:28	and sell something we well here for a very this
0:23:33	fanatics israel email down for you in
0:23:37	text dependent speaker recognition it's not so much an issue as it is and in
0:23:42	text independent where it's really going to come from
0:23:47	a neural network that's trying to discriminate
0:23:57	okay and ask questions any using a skinny channel
0:24:03	we estimate
0:24:06	that's right that's what the rest of the calls for i think can be somehow
0:24:13	we define it
0:24:15	it's what the jfa recipe calls for even though the channel
0:24:21	variables are treated as hidden variables that have
0:24:25	posterior expectation the posterior covariance matrix
0:24:28	if you look
0:24:31	the role that
0:24:32	likely
0:24:34	if you merrily interest and filtering of the channel thanks to turns out that all
0:24:38	you need is the
0:24:41	is the posterior expectation
0:24:43	this is just with the model sets
0:24:44	so i
0:24:46	the model is very simple you just two
0:24:50	this the only way i'm not once the last
0:24:55	okay tend to stick

Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition

Text Dependent Speaker Verification

Patrick Kenny, Themos Stafylakis, Jahangir Alam, Vishwa Gupta and Marcel Kockmann