Speech Transcript - Multiclass Discriminative Training of i-vector Language Recognition

0:00:15	all right about multi class discriminative training of i-vector language recognition this morning are now
0:00:21	on the crate from john hopkins university
0:00:23	and like to acknowledge some interesting discussions during this work with
0:00:27	my current colleague daniel my previous colleagues dog in l yet
0:00:31	duggan pedro sorry and more recently with nico
0:00:37	so
0:00:39	as an introduction
0:00:41	you guys know i think we had one discussion this morning that
0:00:45	language id using i-vectors is state-of-the-art system
0:00:49	what i wanna talk about is some particular aspects of it is typically done as
0:00:54	a two-stage process where we use a classifier is the first thing even after we've
0:00:58	got the i-vectors first we build a classifier
0:01:01	and then we separately build a backend which does the calibration and perhaps fusion as
0:01:06	well so i wanna talk about two aspects that are a little different of that
0:01:10	first i wanna talk about what if we try to have one system that does
0:01:14	the discrimination
0:01:15	the classification and the calibration once they're using discriminative training nobody ever said we have
0:01:21	used to systems back to back what we do it all together
0:01:24	and then the secondly i wanna talk about an open set extension to what is
0:01:28	usually a closed set language recognition task
0:01:34	so in the top i will start with a description of the gaussian model in
0:01:38	the i-vector space it something that many be seen before but i need to talk
0:01:42	about some particular aspects of it in order to get into the details here
0:01:47	also talk about how that relates to the open set case in that case of
0:01:50	go into some of the bayesian stuff that we do in speaker recognition and how
0:01:54	that could or couldn't be relevant in language recognition what the differences are
0:01:59	then i will talk about the two key things here which is the discriminative training
0:02:03	that i'm using in particular which is based on mmi and then i'll talk about
0:02:07	how i do the out of set model
0:02:12	so as a signal processing guy i like to thank of
0:02:15	this as an additive gaussian noise model and signal processing this is one of the
0:02:19	most basic things that we see
0:02:21	so
0:02:22	it in this context what we're talking about is that the observed i-vector you see
0:02:26	was generated from a language so it should look like the language vector mean but
0:02:31	it's corrupted by an additive gaussian noise
0:02:34	which we typically call a channel for lack of a better word
0:02:38	so this model here from a pattern recognition point of view we have a unknown
0:02:43	lean of each of our classes
0:02:45	we have a channel which is gaussian looks the same for all of the classes
0:02:49	that means that our classifier is a shared covariance gaussian model
0:02:54	and each language model is described by its mean
0:02:58	and that shared covariance is a channel or a within class covariance
0:03:06	so the building language recognition system then we need a training process in the scoring
0:03:11	process
0:03:12	training means we need to learn this shared within class covariance and then for each
0:03:16	language we need to learn what its mean looks like
0:03:19	and testing again is this gaussian scoring
0:03:22	and i guess unlike some people and stream are not particularly uncomfortable with closed set
0:03:27	detection
0:03:28	and that gives you a sort of funny looking form bayes rule the target if
0:03:33	it is this class then that's just the likelihood of this class
0:03:37	that's easy
0:03:38	but the non-target means that it's one of the other classes and then you need
0:03:41	some implicit prior of the distribution of the other classes
0:03:45	which for the other eases design where you can use a flat prior
0:03:49	given that is not the target
0:03:55	so that the key question then for building language model is how do we estimate
0:04:00	the mean estimating the mean of a gaussian is not one of the most complicated
0:04:04	things and statistics but there are multiple ways to do it of course this paper
0:04:08	or thing to do with just take the sample mean maximum likelihood
0:04:11	and that's mainly what i'm gonna end up using here in this work but i
0:04:14	wanna imprecise there are other things you could do and in speaker recognition we do
0:04:19	not do that we do something more complicated
0:04:22	the next a more sophisticated thing is map adaptation that we all know from gmm
0:04:26	ubms and dogs work
0:04:28	but you can do that in this context as well it's very simple formula that
0:04:33	requires however that you have a second covariance matrix which we can probably across class
0:04:37	covariance which is the prior distribution of what all models could look like
0:04:42	where in this case the distribution of what means are drawn from
0:04:47	and then finally from there you can go instead of taking point estimate you can
0:04:51	go to a bayesian approach where you don't actually estimate the mean for each class
0:04:56	you estimate the posterior distribution of the mean of each class given the training data
0:05:00	for that class
0:05:02	and in that case
0:05:04	you can see that posterior distribution and then you could be scoring with what's called
0:05:08	the predictive distribution which is
0:05:10	a bigger gaussian the fatter gaussian it includes the within class covariance
0:05:15	but also has an additional term which is the uncertainty of how much you didn't
0:05:18	show me that particular class
0:05:25	one little trick that i only learned recently no we started learned a lot sooner
0:05:28	it's a
0:05:30	develop many years ago have a reference in the book but it's really handy for
0:05:33	all these kind of systems is
0:05:36	everybody knows you can buy wise one covariance matrix recognizer data such that the covariance
0:05:40	you can but a linear transform and datasets the covariance which by the fact you
0:05:45	can do it for two
0:05:46	and since we have to this is really helpful
0:05:49	and i have a formulas in the paper it's actually not very heart
0:05:54	and you end up with a linear transform where within classes identity which we're often
0:05:58	used to w c and for example compasses that
0:06:01	but across class is also diagonal and it's sorted in order so the most important
0:06:05	dimensions are first
0:06:07	and the beautiful global transformation
0:06:10	it means that you can do linear discriminant analysis you can do dimension reduction easily
0:06:16	in the space just by picking the most interesting mentions the person
0:06:20	and it's also a reminder that when you say you do lda in your system
0:06:26	these little careful because lda
0:06:29	there's a number of ways to formulate lda they all give the same subspace but
0:06:33	they don't give the same transformation within that subspace
0:06:37	because that's not part of the criterion
0:06:40	and that's what this doesn't give the same subspace but it's not the same linear
0:06:43	transformation
0:06:48	so i'm gonna that some experiments here i'll start with some simple ones of the
0:06:51	want to the discriminative training next
0:06:54	where using acoustic i-vectors i think maybe it was mentioned here
0:06:58	the main thing we're gonna a lid system is you need to do shifted delta
0:07:01	cepstra and you need to do vocal tract length normalisation might not do speaker
0:07:07	i'm gonna present lre eleven "'cause" it's the most recent lre but as the kind
0:07:12	of hinted i'm not gonna use pair detection "'cause" i'm not a big fan of
0:07:15	pair detection
0:07:18	somebody's the over metric c average
0:07:21	but you get similar performance rankings
0:07:24	when you pair
0:07:25	detection as well
0:07:27	and
0:07:30	within lre
0:07:31	you build your own train and that's of these are of lincoln's training data sets
0:07:35	that are currently
0:07:36	zero mean
0:07:40	so i mentioned just as generative gaussian models i mentioned that you can do ml
0:07:45	and you can do these other things i mentioned ml map of a
0:07:48	have a nice applied here we just three things but is actually not those three
0:07:50	things
0:07:52	so you have to pay attention but didn't describe what this is
0:07:57	but ml so what i'm doing here is there is no back and there is
0:08:00	just bayes rule applied "'cause" that's the formula that i showed you to the generative
0:08:04	model of gaussian
0:08:06	and these numbers for people who do our reads these are not very good numbers
0:08:09	but this is what happened straight out of the generative model
0:08:13	and what i'm showing is c average didn't in c average
0:08:17	so means the average means you had a heart detection hard-decisioning on the detection
0:08:23	so the ml system
0:08:24	is the baseline
0:08:27	if you do this is the bayesian system so where you make the bayesian estimation
0:08:31	of the being then you actually in the end don't actually have the same covariance
0:08:34	for every class "'cause" they had different counts and that gives different a predictive uncertainty
0:08:39	but in a factor are very similar because in language recognition
0:08:42	you have many instances per class so it almost degenerates to the same thing
0:08:47	the reason i didn't show map is "'cause" it's in between those two and there's
0:08:50	not much space in between those two so it's not a very interesting thing
0:08:54	this last one is kind of interesting in that
0:08:59	it's not right but it actually works better
0:09:01	from a calibration
0:09:03	well as you say calibration that you think that it works better in the bayes
0:09:07	rule
0:09:08	what i've done here is what we typically do in speaker recognition where you use
0:09:10	the right map what you pretend that there's only one cut instead of keeping the
0:09:14	correct count of the number of cuts
0:09:16	and that gives you in terms of the predicted distribution that gives you a greater
0:09:20	uncertainty that and a wider covariance
0:09:23	and so happens that actually works a little better in this case
0:09:29	but i
0:09:31	once you put a back into the in the system which is what everybody's usually
0:09:34	showing with some then these differences really disappears so i'm gonna use ml systems for
0:09:40	the rest of the discriminative training work
0:09:43	as i said these numbers are very good there about three times as bad as
0:09:46	a state-of-the-art
0:09:48	what usually done it is with additionally trained back end the simplest one i think
0:09:52	john had was the full tell the scalar multiclass thing that a coded decoded before
0:09:58	that's logistic regression
0:09:59	you can do a full of logistic regression with a matrix instead of with a
0:10:02	scalar you can put a gaussian backend in front
0:10:09	a logistic regression which is something that we tried for or you can use a
0:10:13	discrimate we train gaussian as the back and which is something we were doing it
0:10:17	lincoln for quite awhile
0:10:19	and these systems all work much better and pretty similar to each other
0:10:24	you can also build a classifier to be discriminative one of the more common things
0:10:28	to do is an svm one verses rest
0:10:31	that's not that still doesn't solve the final task but it can help
0:10:35	and if you do one verses rest logistic regression you also still need to back
0:10:39	end or you can do recently uniquer has been doing a multiclass
0:10:44	training of the classifier itself followed by multiclass backend
0:10:47	but what i wanna talk about is trying to do everything together one training of
0:10:51	the multiclass system that won't need its own separate back end ready to apply bayes
0:10:56	rule straight out
0:10:57	and
0:10:59	it's not commonly used in backends but in our field mmi is a very common
0:11:03	thing in a given in the gmm world in the speech recognition work
0:11:07	the criterion if you're not familiar with it
0:11:10	it is another name for the cross entropy which is the same metric that logistic
0:11:14	regression uses
0:11:15	it is a multiclass are your probabilities correct kind of a metric
0:11:21	and it is a this is a closed set
0:11:24	discriminative training of classes against each other
0:11:28	the update equations
0:11:30	are you haven't seen are kind of cool and they're kind of different it's a
0:11:33	it's a little bit of a we're derivation compare the gradient descent that everybody's two
0:11:38	it can be interpreted of like a gradient descent with kind of a magical step
0:11:42	size
0:11:44	but it's quite effective and the weights it's always done in speech recognition is
0:11:49	since you were doing this to a gaussian system you start with an ml version
0:11:52	of the gaussian and then you discriminatively updated so to speak
0:11:56	a that makes the converse is much easier
0:11:59	it gives an actual regularisation because you're starting with something that is already a reasonable
0:12:03	solution and in fact the simplest form of regularization is just to not let it
0:12:07	run very long it is also a lot cheaper
0:12:10	and it also gives you something you can tie back and put a penalty function
0:12:14	that says don't be too different from the ml solution
0:12:17	so regularization is it is and straightforward thing to do an mmi
0:12:22	and this diagonal covariance transformation that i was talking about is really helpful there here
0:12:27	because
0:12:28	then we can only discriminately update these diagonal covariances instead of full covariances
0:12:33	so we have fewer parameters than a full matrix logistic regression but more parameters the
0:12:38	lowest logistic burst
0:12:45	so now these are pretty much state-of-the-art numbers now remember the previous number that couldn't
0:12:50	were up here essentially
0:12:54	so this is the ml gaussian followed by an mmi gaussian backend in the score
0:12:59	space which is kind of our dpot way of doing things when i was at
0:13:03	lincoln
0:13:05	this for score is kind of a disappointment which is what if you take the
0:13:08	training set and you discrimate we trained with them in mine and they don't have
0:13:12	a back here
0:13:13	it is in fact
0:13:15	considerably better than the ml system really of its equivalent would which i started
0:13:20	but is nowhere near where we wanna be obviously
0:13:23	so
0:13:24	why not
0:13:25	and
0:13:28	one of the core of our e
0:13:30	that
0:13:32	is more data dependent i think then realistic
0:13:36	is the dataset actually looks different than the training set
0:13:39	so this is only done on the training set it's not using any dev set
0:13:43	at all
0:13:44	the most obvious thing is that the training set and that at the data set
0:13:47	in the test set are all thirty seconds approximately the training set is whatever sides
0:13:52	of conversations that happen to be so that's an obvious mismatch selected the training set
0:13:57	and truncated everything to be thirty seconds instead of the entire sorry
0:14:00	drawing away data in that way turned out to be very helpful because it's now
0:14:03	what better match to what the test data looks like
0:14:06	but not everything i wanted so then i to the thirty second training set
0:14:10	concatenated together with the dev set which is a thirty second set
0:14:14	used the entire set at once
0:14:17	for training the system and that in fact works as well as in and slightly
0:14:22	better
0:14:22	then the two different us as the system followed by
0:14:26	discriminant right by a backend
0:14:32	so i looked at the number of different ways
0:14:37	permutations of this mmi system the anybody who's done it for gmm mmi is no
0:14:41	you can you can
0:14:42	train this that or the other and various things that
0:14:45	and that the simplest thing to do is just to do the means only and
0:14:48	that is fairly effective at the moment
0:14:52	you can train the mean and the within class covariance which is
0:14:56	and of course in the clothes that system the across class covariance is not coming
0:15:00	into play it's only the within class covariance which is having five
0:15:05	one thing that i found kind of interesting used to instead of training the entire
0:15:08	covariance matrix to train the scale factor which scales the covariance that's to a little
0:15:14	bit simpler system with fewer parameters
0:15:16	and you can also play with the sequential system
0:15:20	and in particular i found interesting to do the scale factor first and then the
0:15:24	means just in terms of the it it's really
0:15:29	that will given the end the same solution but
0:15:32	when you only do a limited number of iterations to starting point in the sequence
0:15:36	does affect you get
0:15:39	so
0:15:42	again these same sorts a lot this is what happens if you do so this
0:15:47	is now purely no back-end and the discriminately train classifier itself if you do need
0:15:51	only
0:15:53	your partisan system is not terribly good but you're means the average is pretty close
0:15:59	so that is an indication
0:16:01	what is calibration mean in a multiclass detection
0:16:05	task is kind of controversy all but
0:16:08	one thing that i think i can say comfortably is whenever you see this happen
0:16:13	it means that you're not calibrated
0:16:15	the fact that they might not doesn't necessarily mean that you are calibrated "'cause" bayes
0:16:18	rule is more complicated than that but
0:16:20	but this means that it is clearly not calibrated
0:16:23	so once we do something to the variance this is doing the mean and the
0:16:26	entire variance this is doing the mean and the scale factor is very except same
0:16:30	time
0:16:31	and this is due in a two stage process or of the scale factor of
0:16:34	the various followed by the mean
0:16:36	all of those
0:16:37	work much better so in order to get calibration you need to actually adjust the
0:16:41	covariance matrix which kinda makes sense you need to scale factor or something
0:16:45	and
0:16:46	once you fine tune on the numbers as we typically do when we're actually working
0:16:50	on these kind of task
0:16:52	been actually see that the two stage process it is the baddest the best one
0:16:56	and it is better than error
0:16:58	our a two-step process that we used to have before of separate system followed by
0:17:03	back in
0:17:06	okay so that's the discriminative training part the other thing i want to talk about
0:17:09	is the out of set problem that has mentioned in a question earlier
0:17:15	because oftentimes were interested in something where there's it could be another language is not
0:17:20	one of the closer
0:17:23	the nice thing about our two covariance mathematics that we've been using for speaker recognition
0:17:28	is it has in front of you a model for what out of set is
0:17:31	supposed to the
0:17:33	already mentioned that essentially that if you have a gaussian
0:17:35	distribution of what all models look like then an out of set languages are randomly
0:17:40	drawn language from that who
0:17:42	and that's represented by the gaussian distribution
0:17:46	then at test time
0:17:48	you have again and even bigger gaussian because the uncertainty is both the channel plus
0:17:55	which language was
0:17:56	so now you have
0:17:59	the out of set is also a gaussian bided have the bigger covariance then all
0:18:03	the others have a share variance which is smaller so it you no longer have
0:18:07	this a linear system
0:18:09	when you make a comparison
0:18:11	this is the most general formula when you have
0:18:15	and open-set problem which is both out of set and closed set
0:18:19	this is how you would combine them this is what i had before the sort
0:18:22	of bayes rule a quick competition of all the other closer classes this is the
0:18:26	new distribution the out of set distribution
0:18:29	if you wanna pure out of set problem which is what i'm gonna talk about
0:18:32	here you just take it needs to be out of set is one but in
0:18:35	fact you could make a mix distribution well
0:18:38	okay so i wanna talk about the out of set
0:18:42	just a touch on is what i have now
0:18:44	if i where to do the bayesian numerator for each class that i mentioned before
0:18:48	and then this denominator
0:18:51	then i have what would you like to call
0:18:53	bayesian speaker comparison
0:18:55	jones narrative paper about four
0:18:59	it is the same answer as p lda or the two covariance model
0:19:04	and i'd like to
0:19:06	emphasise that
0:19:07	they're set up differently so the numerator and denominator are different in these two mathematics
0:19:12	but the ratio is the same thing "'cause" it's a models and is the same
0:19:16	correct answer
0:19:18	i think you know formalism like i'm talking about here i find it much easier
0:19:21	to understand it in this context
0:19:23	the philosophy
0:19:25	and
0:19:27	daniel i've spent a lot of time on this can see that only a few
0:19:30	of the a guy from this perspective point of view
0:19:33	but in this terminology we say that we have a model for each class and
0:19:38	the covariances are hyper parameters in this terminology you guys like to say that there
0:19:43	is no model
0:19:44	and the parameters of the system are the covariance matrices again is the same
0:19:49	system it's a different same answer to different perspective but when we're talking about close
0:19:53	that and ml models
0:19:55	i know how to say that in this context and i don't know so well
0:19:58	how to say that
0:19:59	in the p lda one
0:20:02	so discriminative training of the out that i described this is the out of set
0:20:06	model but as i've said now i have this mmi hammer in my toolbox
0:20:10	and this is just one more covariance that i can train so i've got an
0:20:14	across class mean and covariance
0:20:17	the ml out of set system just takes that all of these where the sample
0:20:22	covariance matrices so this p
0:20:24	but i can
0:20:26	do an mmi updated this out of set classes well the simplest way for me
0:20:30	to do this is to take the
0:20:32	the closed set system there are presented and then separately
0:20:36	frees the closed set models and then separately update the out of set model given
0:20:40	the closed set models
0:20:42	i can do that with the by scoring would one verses rest instead of scoring
0:20:47	with bayes rule and doing a round robin on the same training set so
0:20:52	the advantage of this is i can actually build a system without ever actually having
0:20:56	any out-of-class data probably do better if i really did have out-of-class data but in
0:21:00	this case i don't and i can build a perfectly legitimate system
0:21:03	so
0:21:06	the performance of this system whatever done here
0:21:10	is scored this lre even though there is no out of set data scoring without
0:21:14	bayes rule where the system is then allowed to know what the other classes were
0:21:21	and so that the simulation one open set
0:21:23	scoring function
0:21:25	the ml version of this
0:21:27	the actual c average is actually a the chart it is that's kind of bad
0:21:31	numbers that i started with four
0:21:33	the mmi training of the closed set system
0:21:37	and then the
0:21:39	mel version of the across class covariance in fact is already a lot better so
0:21:43	whatever's happening in the closed set discriminative training is actually helping the
0:21:47	open set scoring as well but explicitly retraining
0:21:51	the out of set covariance matrix
0:21:53	with the same mechanism mel scale factor then the me
0:21:58	in fact if the pretty reasonably
0:22:01	and the system which is not obviously on calibrated
0:22:05	and it's pretty reasonable performance
0:22:07	the closed set scoring performance is still down here but this is gone a lot
0:22:11	better and it's perfectly feasible
0:22:14	so
0:22:16	the two contributions here where the single system concept of we don't have to do
0:22:20	system design and then back end we can discriminatively trained system to already be calibrated
0:22:26	and we can model out of set using the same mathematics that we have in
0:22:31	speaker recognition
0:22:33	but a simpler version "'cause" we don't need to be used bayesian in this case
0:22:36	and i think can also be discriminatively updated so that we can that be reasonably
0:22:40	calibrated for the open set
0:22:42	task as well
0:23:06	so thanks island
0:23:08	the very nice to see that you unified those two parts of the system
0:23:13	i which we could do that than in speaker recognition
0:23:17	so my question is your
0:23:21	your maximum likelihood
0:23:23	across class covariance so you've got twenty four languages to work within a six hundred
0:23:28	dimensional
0:23:30	i-vectors so
0:23:31	how did you estimated or a sign that
0:23:34	parameter
0:23:36	it is the sample covariance so everything here was done with the dimension reduction in
0:23:41	the front
0:23:42	to twenty three dimensions
0:23:45	i'm sorry that's why i illustration
0:23:48	already specified that there would be twenty three dimensions
0:23:52	and anything that has a prior is limited to twenty three dimensions
0:23:58	okay in this i i'd since i just took the sample covariance matrix at regularized
0:24:02	it somehow you can make it
0:24:05	appear to be bigger to be full size
0:24:09	okay so
0:24:10	well those formulas you showed with the covariances that happens in twenty three dimensional space
0:24:15	yes
0:24:25	so in this case you doing lda and then i'm not my tanks she's got
0:24:31	at the same as doing question back and another calibration
0:24:36	well as lda and regression back at you that this evaluation bloodless
0:24:41	this was your computing the sample covariance as ones in the full space
0:24:46	but
0:24:47	the across class is only rank twenty three
0:24:51	so you take that the six hundred dimensional within class and map-adapted twenty three but
0:24:56	yes
0:24:59	so if you do lda and regression but gaussian backend is the same subspace
0:25:04	use lda
0:25:06	yes if you product of lda
0:25:09	and twenty three or you get the gore since any get twenty four scores
0:25:13	is almost the same thing so you're still doing to state to steps
0:25:20	it's still just two steps
0:25:22	in my view are the ml estimation which in this case forces you to be
0:25:26	twenty three dimensional
0:25:28	and then
0:25:29	the update of those equations
0:25:32	but lda english and
0:25:34	back and there is similarity very close
0:25:38	well like the way we would have done a system before would be lda and
0:25:43	then gaussian in that space and then
0:25:47	mmi training in the score space
0:25:51	the likelihood ratios of the first thing this is mmi training in the i-vector space
0:25:55	directly
0:25:57	but
0:25:58	these are not very complicated mathematics of things are pretty closely related yes
0:26:05	so when you did the joint diagonalization
0:26:08	in there and then you
0:26:10	work with diagonal covariance matrices but then you're also updating the covariance matrices training
0:26:16	is that diagonalisation still valid then
0:26:18	i mean you do the static one projection was that mean then when you forced
0:26:22	to be it back it's sort of like saying i mean
0:26:25	the entire thing can be mapped back to
0:26:28	by undoing the diagonalisation into a full covariance so you in some sense you are
0:26:33	still updating a full covariance with your only updating in a constraint what
0:26:38	so the matrix is still an apple size but the number of parameters that you
0:26:42	discriminatively updated is not the full set
0:26:58	so if i guess i remember correctly so you're doing actually closed set
0:27:03	twenty three or twenty four language that is that correct twenty four language right so
0:27:08	is it possible i mean i don't want change your problem but if you were
0:27:12	to look at a subset so you're gonna pick twelve each and take the others
0:27:16	are is completely open set data so you to screw training it only on a
0:27:20	portion of we don't have actions we have said data
0:27:23	you have some sense of how strong your solution would be
0:27:28	if you didn't have access to those similar sounds that languages that you want to
0:27:32	reject
0:27:34	i think it's an interesting thought that
0:27:38	you could more extensively test this out of set hypothesis by doing a whole one
0:27:43	out or something and round robin in that and i think that isn't it interesting
0:27:47	idea but i haven't
0:27:48	have done

Multiclass Discriminative Training of i-vector Language Recognition

Language Recognition

Alan Mccree