Speech Transcript - A BASIS METHOD FOR ROBUST ESTIMATION OF CONSTRAINED MLLR

0:00:14	um i everyone
0:00:16	and not gonna apologise for the quality of my slides because
0:00:20	and uh i just i think it really matters
0:00:23	people are people if you make an to find people seem to trying to compensate for something that that's my
0:00:29	that that's my uh
0:00:31	yeah
0:00:32	philosophy
0:00:33	okay so this is that actually very similar to some of
0:00:37	but in this session and it's
0:00:39	it's a strange session is in the because almost all of these talks are about some kind of uh
0:00:44	no no
0:00:45	fast faster from a uh
0:00:46	with some kind of factorization
0:00:49	now
0:00:51	i was going to introduce you
0:00:53	okay so i that all should be call it C M or or should we call it F a lot
0:00:58	for some reason i gone over recently to the C a last side but i'm hedging my bets and the
0:01:02	actual talk
0:01:04	uh
0:01:06	i i think you get everyone really had a stand that is the same thing
0:01:11	yeah
0:01:12	and i'm not gonna go through this slide in detail because it's the same as many slides that have been
0:01:17	presented already
0:01:19	yeah
0:01:19	notation is a little bit different i
0:01:22	this notation with a little plus
0:01:25	it kind of my personal notation
0:01:28	but i i i use it because i i'm not comfortable with a or the greek let as an it's
0:01:32	hard to me to remember a
0:01:33	what
0:01:34	i i is an a and that is supposed to be the same as X
0:01:37	so i just think it's easy at to
0:01:39	remember but the total plus means the a one
0:01:43	uh in and since some uh
0:01:45	in some people's work it's a side i
0:01:49	um
0:01:52	as that's that's also another little can confusing
0:01:55	difference between people notation that sometimes people put the one on the and and sometimes at the beginning
0:02:00	i think the B M had it is is to put on the end and that's what i'm doing
0:02:05	yeah
0:02:08	this is another kind of introductory slide but
0:02:12	since we've had so many introductions to cmllr
0:02:16	have from a lot
0:02:17	in this in this uh session i don't really think i need to go through this
0:02:22	no
0:02:24	a point i i put in here
0:02:26	is that
0:02:27	have some uh
0:02:29	i i'm gonna switch them out
0:02:30	works so the relative little adaptation data
0:02:33	now here
0:02:35	well that's a little adaptation data means after about thirty seconds or so
0:02:39	and someone else actually in this
0:02:41	session did mention that figure
0:02:43	after about thirty seconds or so you you get almost all of the improvement that you gonna get
0:02:48	at thirty seconds is a little bit too much for many practical applications
0:02:53	if you have some telephone service was someone that's gonna like request to stop quote or
0:02:58	do web search or something
0:03:00	my and you might only have
0:03:02	two or three
0:03:03	seconds or maybe five seconds of a and
0:03:06	and that's not really enough for a from a lot of work
0:03:09	in fact the law about five seconds and and my experience
0:03:13	it's not gonna give you and and and the thing is gone actually make it worse
0:03:17	so you might as well turn off
0:03:20	so that's the problem that this
0:03:21	talk is addressing an actual is the same problem that many previous talks and the session have then
0:03:27	i think all of the previous talks and the session of an addressing this problem
0:03:32	so
0:03:36	okay this
0:03:37	light is rising some of the prior
0:03:40	approach is an i i should emphasise that
0:03:43	i'm talking about the prior approaches to
0:03:46	somehow have regular rising C M are
0:03:49	obviously there's many other things that you can do like eigenvoices voices the stuff that uh the previous speaker mention
0:03:56	i single other parameters but i'm talking about
0:03:58	see all are
0:04:00	yeah regularization
0:04:02	so a simple things you can do you can just
0:04:05	like the a matrix diagonal
0:04:08	and that
0:04:08	a kind of an option and H K and it
0:04:11	it you know it it's a good approach you get a lot of improvement but it very ad hoc
0:04:16	i also uh make a block diagonal
0:04:20	this this approach had its origins the cost and the delta
0:04:24	and delta-delta type features
0:04:26	so you have three blocks that thirteen by thirteen
0:04:29	and i don't nobody really uses those features on process anymore one that serious
0:04:35	site i don't think use them anymore
0:04:37	uh without some kind of transformation you can still use the block diagonal in fact one of the baseline that
0:04:42	will be presenting
0:04:43	is we use these blocks even though they've lost
0:04:46	the original meaning
0:04:48	and you know it is still seems to work
0:04:51	so and other
0:04:53	she's is bayesian approaches
0:04:55	as the that been a couple of different papers both called
0:04:58	F map a lot
0:05:00	by a not a lot of it means would be we do have from a while but we have a
0:05:04	prior
0:05:05	and we pick the map estimate
0:05:08	as been a paper from microsoft and one from be M and they had slightly different priors but it was
0:05:13	the same basic idea
0:05:14	this is one of the baseline that we're going to
0:05:17	you using an our experiment
0:05:19	yeah
0:05:21	a an issue with these approaches is that
0:05:24	you you you probably like to have
0:05:27	a prior that tells you how all of the rows of the transform colour late with each other
0:05:32	but in practice that's not really uh do able
0:05:35	so
0:05:36	people we generally see priors
0:05:39	the are ride the row by row or even completely diagonal
0:05:42	so it's the prior over each individual parameter
0:05:46	yeah
0:05:47	approach that we are using is
0:05:49	parameter or doctor
0:05:51	reduction using a basis
0:05:53	where the uh
0:05:54	a from all i'm matrix a some kind of weighted
0:05:57	so of uh
0:05:59	i the of a basis matrices are of
0:06:02	a prototype matrices
0:06:04	so
0:06:06	the basic idea of not that this is similar to some of the previous talks
0:06:10	sometimes people have the uh
0:06:12	the there is a factorization than done a basis on you know the upper and lower how or something like
0:06:17	that
0:06:18	but we we we're talking about a a a a basic expansion of just the row or
0:06:23	the row transform
0:06:24	and the basic form of it is given here
0:06:27	where
0:06:29	is W subscript and
0:06:31	do the kind of like prototype or i again
0:06:34	a from a lot matrices and
0:06:36	they're kind of computed in advance somehow
0:06:39	for a given speaker
0:06:41	you have to estimate these coefficients
0:06:44	yeah well
0:06:45	this is not a really convex problem but
0:06:48	uh uh it's
0:06:50	it's solvable in a kind of local sense and it's not really
0:06:53	i i don't think it's a practical issue
0:06:56	so
0:06:57	the previous work in this area
0:06:59	you can of decide the basis this size an advance
0:07:03	so this the again
0:07:04	the decide let's we're gonna make that two hundred say
0:07:08	a number of parameters and the actual matrix what what's the T nine by forty it's like
0:07:13	a couple of thousand so
0:07:15	i
0:07:15	if you decide in advance some we gonna make two hundred coefficient
0:07:19	a that you know does pretty well for typical configurations if you have you know between ten and thirty seconds
0:07:25	of speech
0:07:26	but you gonna get a degradation
0:07:28	once you have a lot of data because you're not really estimating all of the parameters that you could estimate
0:07:34	eventually gets a bit worse
0:07:36	when you have a a lot of adaptation data
0:07:38	so that the
0:07:40	this to the closest prior work
0:07:43	so
0:07:45	the couple of differences
0:07:46	there what we're describing here from this prior what it was done in I B M
0:07:51	yeah why we
0:07:53	a basis
0:07:54	size to very per speaker
0:07:57	and we have a very simple real we just say
0:07:59	the more data that we have them more coefficients we can estimate
0:08:03	and we'd we just make the number of coefficients proportional to the amount of data
0:08:08	a of you could do all kinds of fancy stuff with information criterion so on but
0:08:13	i think this technique is you know easily complicated enough already without
0:08:17	introducing an you aspects so we just picked a very simple rule
0:08:20	but but the other aspect is we have a
0:08:23	we have a keep way of estimating these basis matrices W and
0:08:27	but a little bit more clever than just doing pca
0:08:32	yeah
0:08:33	and and and and lee
0:08:36	we just trying to popularise this type of method
0:08:38	this uh
0:08:40	we have uh
0:08:41	we have a job version of this paper in which we tried to explain very clearly had to implement it
0:08:46	"'cause" it really does
0:08:48	were
0:08:49	and it's uh it's for robust and everything and police so i think it's something but the i do recommend
0:08:54	to you
0:08:57	yeah
0:09:00	this
0:09:01	so probably covers material that i've
0:09:05	oh i'm gonna go to this slide
0:09:06	so
0:09:09	yeah the sample that it's like making a diagonal block back all very well and good but
0:09:14	it just doesn't it it's just a bit
0:09:16	ad hoc can it doesn't
0:09:18	you that much improvement
0:09:20	also so you it's all a half of to have to decide like
0:09:23	how much data do we have can we afford to do the form one are we gonna make it diagonal
0:09:27	when that this trade-off and you can get into
0:09:30	having count cutoffs and stuff but it's a bit of a mess
0:09:34	yeah
0:09:35	i to have anything and in methods you know if we could of done this in the bayesian way
0:09:40	that's probably
0:09:42	i think that's what more optimal because by picking a basis
0:09:45	size you kind of making a hard decision
0:09:48	a with the bayesian method you could do that and a soft way
0:09:52	and you know i think making a soft decision was always better than a hard decision
0:09:56	but
0:09:57	the the problem with the bayesian approach is first is very hard to estimate the prior
0:10:02	because
0:10:04	the whole reason may in the situation as we don't have a ton of data per speaker
0:10:08	right and
0:10:09	assuming a training data is matched just T testing condition
0:10:13	not gonna have a lot of data from you training data to estimate the
0:10:17	the uh W major sees
0:10:19	so how you gonna estimate of prior because you don't have good estimates of the
0:10:23	uh
0:10:24	the things are trying to get a prior on and a "'cause" you can do all of these
0:10:28	no easy in schemes where you integrate and stuff but it just becomes a a big head a
0:10:32	plus was always somebody choice choices to make
0:10:35	but there is i like this basis type of method is you just a we can use a basis
0:10:40	and everything just falls out and it's obvious what to do
0:10:44	so
0:10:46	i to talk about how to estimate the uh
0:10:48	basis
0:10:49	so
0:10:51	because we going to decide the coefficients and test i am
0:10:56	the kind of a or did base that's what you have the most important elements first
0:11:00	and released least important elements last
0:11:03	i if you we gonna say we gonna have any close two hundred and just that's it
0:11:07	then
0:11:09	it wouldn't really matter whether they were all mixed up what order they were and
0:11:12	but because uh
0:11:14	we're gonna decide the number of coefficients and test i'm we need to have this or
0:11:18	thing
0:11:19	so things are not show that
0:11:21	pca or A or or you know as three D are those those kinds of
0:11:25	approaches
0:11:27	the they actually give you this
0:11:29	but
0:11:30	i'm not very comfortable just
0:11:32	saying we're gonna do pca because you know
0:11:35	but who's to say that that makes sense
0:11:38	uh what one obvious argument why doesn't make sense
0:11:41	was that if you were to uh
0:11:45	and a scale the different dimensions of your
0:11:48	a of feature vector
0:11:49	that's gonna change the solution that P gives C you
0:11:53	no i mean is gonna change it and then
0:11:55	yeah and that kind of uh is gonna change the so basically it's gonna affect your decoding
0:11:59	and to me that's a but uh
0:12:02	it's not the right thing to do
0:12:04	so
0:12:06	but the framework but i think is most natural as maximum likelihood
0:12:09	what we're going to choose try to pick the bases that maximise the like you on test
0:12:15	and uh
0:12:17	i i don't think i have time to go through the whole
0:12:22	and i wasn't having to through the whole you know
0:12:25	argument about what we're doing
0:12:27	but but basically
0:12:29	we end that the use pca A but in that slightly precondition and uh
0:12:34	space
0:12:35	so
0:12:36	but W is a thirty nine by forty T matrix typically
0:12:40	but
0:12:41	we wanna consider the correlations between the rows and it's not ready can to think of it as a matrix
0:12:45	a let's think of it as one big but to
0:12:48	of size that in nine by forty byte we can cat make the rose
0:12:52	no
0:12:53	uh
0:12:55	if
0:12:58	i don't know if i can easily
0:12:59	and and how well this argument works but
0:13:01	that
0:13:02	i think about the objective function for each speaker
0:13:05	if that objective function were records at work or tick function
0:13:09	if we can somehow to change of variable so that the uh
0:13:12	quadratic part of that function is just proportional to the you makes
0:13:17	it then is possible to show that
0:13:19	we kind of right solution is just doing weighted pca
0:13:23	a some kind of derivation i mean it it might be obvious to some people how how you derive that
0:13:29	uh
0:13:29	but you know
0:13:31	but let's just take it is given for now but
0:13:33	that that's true
0:13:34	so so the
0:13:36	to do is do this strange variable so that the objective function is quadratic for each speaker
0:13:41	and for is not quite possible to do that
0:13:45	because
0:13:46	we
0:13:47	okay okay for a basis for as a couple of reasons
0:13:50	that's the objective function is not really quadratic that this log and and it's
0:13:54	for nonlinear
0:13:55	secondly
0:13:56	you can take a taylor series approximation of round the kind of before make trick
0:14:01	i and zero
0:14:03	take it taylor series around there
0:14:05	and the could right sick uh
0:14:09	that the quadratic term in that taylor series
0:14:11	but remember rubber it is a big vector right so the quadratic reading um is like
0:14:15	a big major
0:14:17	uh
0:14:18	a if you know about two thousand by two thousand
0:14:20	but quadratic to it depends on the that
0:14:24	so it's not just the constant
0:14:26	but
0:14:27	i i i don't think this that's really very that much in an important way
0:14:31	so
0:14:32	it is possible to kind of do a
0:14:35	it's possible to make an argument that
0:14:37	once we
0:14:39	once we work out the uh average of these quadratic terms
0:14:43	and then kind of preconditions conditions so that average averages the unit
0:14:47	it possible to make a reasonable argument that
0:14:50	to each speech uh
0:14:51	the uh
0:14:53	that was matrices are approximately unit
0:14:56	and and it is the situation where
0:14:59	would like it to be you know but is not gonna make it big difference if it's not quite unit
0:15:03	because i with all this is doing is this is
0:15:06	we're gonna pretty
0:15:07	we gonna pretty take everything in then do pca
0:15:11	and if you don't quite three low take correctly
0:15:14	rotations on the correct word kind a pretty a lot of a
0:15:18	that is accurate it's not gonna like totally change the result of pca
0:15:22	well that's gonna happen is that let's say the first i eigenvector vector
0:15:27	is gonna be mixed up a little bit with the second and so on
0:15:30	so
0:15:32	it pretty close to maximum like we had
0:15:36	now i think i i
0:15:38	cover this material
0:15:39	so
0:15:40	oh training time computation is you do this
0:15:44	but the training time computation basically involves computing
0:15:48	big matrices like this and doing like a a as we D and stuff
0:15:51	it's all described in the in the journal paper
0:15:54	so a test time there's an iterative of in test them to uh to keep the coefficients those the
0:16:01	a subscript something
0:16:03	so
0:16:04	it's
0:16:04	it's a lot convex problem but it's pretty easy to get a local optimum
0:16:08	we just you like steepest ascent and you do it pretty exact line search
0:16:13	and it's all it's all described in and the
0:16:15	paper
0:16:17	so this is the result with slide requires a little bit of explanation
0:16:21	so we have to
0:16:23	of
0:16:24	test data one is
0:16:25	short utterances one is long
0:16:28	it's it's like one is the digit digits type of task and stocks and stuff and one is a voice
0:16:32	mail
0:16:33	yeah
0:16:34	yes
0:16:34	we divide each of those two
0:16:36	in two uh four subsets based on the line
0:16:39	and the X axis
0:16:41	is
0:16:42	is is the uh length about of an so ten to the there was one second ten to the one
0:16:46	is ten seconds
0:16:48	and each of these kind
0:16:51	each of the uh
0:16:54	the
0:16:56	each of you kind of a
0:16:57	lines the points corresponds to a bin of train a test data
0:17:02	so
0:17:03	we we we can divide it up and the buck them on the left hand side this short
0:17:08	utterances on the right is wrong
0:17:10	and each point is a relative improvement
0:17:13	yeah
0:17:13	the the the the triangles the that triangle on the bottom that's just regular
0:17:18	a from or R
0:17:20	and it's actually making things worse for the first three bins
0:17:23	and then a helps a bit
0:17:26	oh is the a either word error rate kind of jumps up and down a bit
0:17:30	because it's different types of data
0:17:32	so
0:17:33	this is maybe not the ideal uh a data to test this on but
0:17:37	you've got look at the relative of and these thing
0:17:41	uh
0:17:42	the very top line is our method
0:17:44	is doing a bed so it's given me a lot more improvement than the other methods
0:17:49	i the this the three block on the diagonal one have matter a lot
0:17:53	those
0:17:54	there are bit a uh then doing regular similar
0:17:57	i get some improvement for the shorter amount of data
0:18:01	but we get a more improvement from our method
0:18:04	so
0:18:05	and i i mean the the story is that
0:18:07	if you have let's say between about three and ten seconds of data
0:18:12	i think this method will be a big improvement versus
0:18:15	doing of map a lower diagonal
0:18:17	or whatever
0:18:19	so uh
0:18:21	but but if you have a you know let's a more than thirty seconds they really doesn't make it from
0:18:26	so i think
0:18:27	i have pretty much covered all of this
0:18:30	i'm being us to wrap up
0:18:33	yeah
0:18:34	i think we've covered all of this
0:18:36	so i recommend that john paper if you want to implement this "'cause" i do described very clearly had to
0:18:41	do it and i think this stuff it does work
0:18:44	okay
0:18:49	i a question mark
0:18:57	pretty stunned
0:19:03	okay okay well uh but close the session at okay thanks right

A BASIS METHOD FOR ROBUST ESTIMATION OF CONSTRAINED MLLR

Adaptation for ASR

Presented by: Daniel Povey, Author(s): Daniel Povey, Kaisheng Yao, Microsoft Corporation, United States