Přepis řeči - A SYMMETRIZATION OF THE SUBSPACE GAUSSIAN MIXTURE MODEL

0:00:13	a moral is gonna be given but by opal be microsoft corporation
0:00:17	hello
0:00:19	so
0:00:20	um
0:00:23	i'm a some of the material from this
0:00:26	top was similar a little bit redundant with the next speaker a
0:00:29	because uh
0:00:31	to me and rick grow should of called an ada that talks a little bit quick
0:00:34	giving talks on similar topics
0:00:36	but i i am gonna go through the
0:00:38	introductory material anyway because uh
0:00:42	it necessary to understand might talk
0:00:44	yeah
0:00:45	yeah i'm kind of assuming that people in this audience may or may not have heard it
0:00:50	S G M
0:00:52	and will probably benefit from me going through them again
0:00:56	so
0:00:58	i
0:00:58	this is a technique that
0:00:59	we introduce fairly recently
0:01:01	it's the kind of
0:01:03	a factored form of the gaussian mixture model basis
0:01:08	now
0:01:10	i'm gonna get to it in stages starting from something that everyone knows
0:01:14	now first imagine you have
0:01:16	the full covariance
0:01:18	system
0:01:19	uh
0:01:20	and and i just written down the equations for that
0:01:24	and
0:01:25	this is just a full converts mixture of gaussians in each state
0:01:28	at at the bottom at just enumerated what the parameters are
0:01:32	as the weights than means the variances
0:01:35	no
0:01:36	next we just make a very trivial change we stipulate but
0:01:40	a number of gaussian the need
0:01:41	state is the same
0:01:44	and it would be a large number let's say two thousand
0:01:47	is obviously the in practical system at this point
0:01:50	but we uh
0:01:51	uh i i'm just making a small change as possible each time
0:01:54	so the same number of got that each state and that
0:01:57	one we that the parameters and really just listing that continuous ones
0:02:01	so those are unchanged from the four
0:02:04	the next thing we do
0:02:06	as we say that the covariance as are shared across states
0:02:10	but not shared across gaussians
0:02:12	so the question than change much all that happen as we dropped one index from the sigma i'll just go
0:02:17	back
0:02:18	i mean see it's sigma J i now we just have segment i
0:02:22	so i is the like the gaussian and X it goes from let's say one two
0:02:26	two thousand or one two thousand or something
0:02:29	now
0:02:31	X thing we do
0:02:32	but the next stage is that's slightly more complicated state and is the kind of key stage
0:02:37	and at the means to a subspace
0:02:40	so
0:02:40	the mean the now no longer parameters
0:02:44	yeah
0:02:45	yeah me a J i you G I is a vector
0:02:47	and the J is the state i is the gaussian and X
0:02:50	some i was saying you J i is um i V J
0:02:55	you can separate these quantities in various ways that i just in a
0:02:59	M M is an matrix V the vector uh i don't really give a much interpretation
0:03:03	but each state
0:03:05	each state J now has a vector V
0:03:08	to the clip the forty of fifty
0:03:10	and each cast an index uh i has this make tree
0:03:14	let's say it might be thirty nine by forty at thirty nine by fifty
0:03:18	a a matrix that says
0:03:19	how
0:03:21	the mean of that state varies when the vector and the
0:03:25	sorry
0:03:26	how the mean of that gas in index but one the vector that state changes
0:03:30	so
0:03:32	but changed here is that we used to have a go about go back one
0:03:37	a to have the new J I down there was a and and the parameter less now it's
0:03:41	V J and M my
0:03:43	and of course then me J I is
0:03:45	the kind of product of the two
0:03:48	now so so that the most important change uh
0:03:52	from a regular system
0:03:53	and and
0:03:54	there's a few more changes
0:03:57	X thing is that the way
0:03:58	and no longer parameters
0:04:00	but a lot of weight
0:04:02	is
0:04:04	suppose those a thousand or two thousand gaussians that's a
0:04:07	a lot of parameters and we we don't one most of the problem just to be in the way because
0:04:12	work got accustomed to the weights been your rows of we small subset of the parameters
0:04:16	so we say now the weights
0:04:18	the weights or depend on these vectors V
0:04:21	and what we do is make the way
0:04:24	i
0:04:25	so we maybe i'm lies log weights a linear function of these V's
0:04:29	so you see on the top the X of W I transpose B J W i transpose we J is
0:04:35	is a scalar that we can at separate as an a normalized log weight
0:04:39	more this equation is doing is just normalising at
0:04:42	i people ask me so why why that a log wait one just the weights well
0:04:47	a that you can make the weights depend linearly on the vector because then
0:04:52	he he would be hard to forced to be number to be positive
0:04:56	also uh
0:04:58	i i think the whole optimisation problem becomes non-convex if you choose any other formula apart from this
0:05:04	no no uh up to scaling and stuff
0:05:06	so a okay so i just so you what changed here
0:05:09	i go back
0:05:11	the parameters would W J I V J et cetera
0:05:14	no it's
0:05:15	W uh i bowled three a so
0:05:18	and that we do have the weight as problem is as we have these vectors
0:05:23	no the vector W i want for each gaussian index of this two thousand of these vectors are one thousand
0:05:28	of these vector
0:05:30	yeah
0:05:31	thing
0:05:31	then next thing yeah but next thing speaker adaptation
0:05:37	and an an a
0:05:37	a not the next thing the next thing sub state
0:05:40	what
0:05:42	we we just add another layer of mixture
0:05:44	now you know you can always that another layer of mixture right
0:05:47	just happens to help in this particular
0:05:50	circumstance and and my intuition is that
0:05:53	but there might be a particular
0:05:56	kind of phonetic state that can be realized two very distinct way
0:06:00	i you might pronounce the that I you might not pronounce it
0:06:04	and
0:06:05	it just seems more natural to have like a mixture of two
0:06:09	of these vectors V one to represent that to and want to just represents and
0:06:14	otherwise if force the kind of subspace to learn things that really shouldn't have to learn
0:06:19	so okay we just and we've introduced these the sub states and i just go back to a a a
0:06:24	a and look at the parameters of the bottom
0:06:27	this W I V take now we have
0:06:29	C J M W doubly V J a
0:06:32	so
0:06:33	a parameters is here at the at the mixture weight
0:06:37	and also we added then you subscript on the these not now it's of V J M
0:06:42	okay
0:06:43	the next
0:06:44	the X
0:06:45	stage is
0:06:46	speaker adaptation
0:06:48	yeah we can be norm of things like a from a lot retail and
0:06:52	but there's a kind of special speaker adaptation a specific to this model
0:06:57	you see there's this play S and i be a go back one using get see the change
0:07:02	that was
0:07:03	this is then new thing
0:07:05	so
0:07:06	it is is we introduce an a a speaker specific back to V super script S
0:07:11	it do we just but the S some top because sometimes we have both of them on certain quantities and
0:07:15	then it becomes a mess otherwise
0:07:17	so
0:07:20	so that V stupid script that's of the speaker-specific vector that says
0:07:24	it just in a
0:07:25	i get the information about that speaker
0:07:28	so so what we didn't have a is is we train
0:07:30	the kind of speaker subspace and these and i quantities tell you how each mean
0:07:36	varies with the speaker
0:07:38	typically the speaker sub-spaces of a dimension
0:07:41	with a forty
0:07:42	the same dimension as the uh phonetic one
0:07:45	so you have you have a quite a few parameters to describe the speaker subspace
0:07:49	and and and
0:07:50	a two D decode you'd have to
0:07:52	to a first pass decoding
0:07:54	as to make this these super script S
0:07:57	and uh
0:07:59	yeah to code again
0:08:01	so we add the parameters and that but i
0:08:04	and as also these these people script ask but these are speaker-specific specific then not really part of the model
0:08:09	there a little bit like
0:08:10	and F from a transform or something like that
0:08:14	so
0:08:16	i i think we can to the end of describing the sgmm so that means we K
0:08:20	but it is uh
0:08:22	oh i described that to now it's is stuff that we've already published
0:08:25	and i just maybe the punch line of what we already described in case you haven't seen that
0:08:30	but it bad so than a regular gmm based system
0:08:34	uh uh i four
0:08:37	it can better at the M a mobile and that's a special better for small data to the core
0:08:42	a twenty percent relative improvement
0:08:45	if you have a few hours of data and maybe
0:08:47	ten percent
0:08:48	if you like when you have tons of data
0:08:51	you have a thousand dollars a
0:08:53	and uh
0:08:54	the problems a somewhat less up to the scrimmage of training
0:08:57	mainly due to bad interaction with the feature space discriminative training
0:09:03	i just some in previous work here
0:09:06	but so so have this talk is about
0:09:08	a a is kind of fixing thing an asymmetry in the sgmm
0:09:12	so
0:09:14	as go back one slide
0:09:16	or or but what the speaker adaptation stuff you have this
0:09:20	and my V J M plus and i V S not i think kind of symmetrical equation because
0:09:25	you have these but to is describing the phonetic space
0:09:29	and and another vectors describing gonna speaker space um we add them together
0:09:35	that's nice and some you go but that like down to the the
0:09:38	the equation for the weights W J M i equals probable
0:09:41	we don't the in thing with the speaker stuff and their
0:09:45	doesn't doesn't P S as an asymmetry in the model because was saying the weights depend on the
0:09:49	phonetic state the not the
0:09:52	peak care and you know why shouldn't they depend on speaker
0:09:55	oh
0:09:56	so so i this paper is about is it's fixed thing bout symmetry
0:10:00	and uh i'll go i'll go for one slide you'll see how we fix set
0:10:06	a look at that equation for the weights the uh
0:10:08	the last but one equation
0:10:10	we we've added that um is for for the uh
0:10:13	speaker yeah
0:10:15	that
0:10:16	that for action just look at the top of a look at the new numerator
0:10:19	that's the uh normalized what weight
0:10:22	well the the inside the brackets of the uh normalized log way
0:10:25	so but this is saying is it's a a function of the
0:10:29	phonetic
0:10:30	uh
0:10:31	state and is a linear function of the speaker state so it's almost the simplest thing you could do
0:10:37	we just fix the asymmetry had the parameters we have is this
0:10:41	you use subscript i
0:10:43	which is a kind of
0:10:44	peak uh
0:10:45	the of the
0:10:48	the thing that tells you how the weights very with the speaker
0:10:51	just the speaker space on a log of W subscript script i
0:10:56	so now
0:10:57	it was a hard to write down this equation
0:11:00	so you know what didn't we do it the four
0:11:03	well what what the
0:11:05	uh
0:11:07	you can just wide down equation for something else that to
0:11:10	able to efficiently uh a that and uh
0:11:13	code with it
0:11:15	no
0:11:16	if you were to just six
0:11:18	expand these as gmms and to big gaussian mixtures that be completely impractical
0:11:23	because
0:11:24	i think about each state now has two thousand gaussians while some
0:11:29	so
0:11:30	and the full covariance
0:11:31	i i i don't have i mentioned that but the and therefore co variance
0:11:35	so you can you can fit that and memory and and
0:11:37	a and in all the machine
0:11:39	so uh
0:11:42	but
0:11:43	we previously described the ways that you can uh
0:11:46	efficiently evaluate the likely but it wasn't it just wasn't one hundred percent obvious how to extend those method
0:11:52	so the case where the weights depend on the speaker
0:11:55	so why this paper is about
0:11:57	as a separate tech report the describe the details
0:12:00	as it's about ha how do you
0:12:03	how do add in this uh
0:12:05	it's about how to efficiently evaluate the likelihoods
0:12:08	when use some at tries that
0:12:10	and uh
0:12:12	i i'm going to the details of that
0:12:15	it it it was reasonable to for you have a bit more memory
0:12:18	just just because this is necessary for understanding the results i just mentioning that
0:12:23	but we describe to a date it's for the U's
0:12:27	sorry for the use of script
0:12:29	a subscript I quantities
0:12:31	as an ending exact one and a an exact one
0:12:34	but difference really isn't that important i'm just gonna skip over that
0:12:39	uh
0:12:40	so that was that the results on call home and uh
0:12:43	how long do have by the way
0:12:46	we hope
0:12:47	okay
0:12:47	i'm call home and switchboard
0:12:51	yeah
0:12:51	the call home results and
0:12:54	so the second line of or
0:12:56	but top line the result is on adapted
0:12:59	a second line
0:13:01	and the were there
0:13:03	is a really difficult task
0:13:05	callhome home english doesn't how much training data it's messy C
0:13:08	the second one is
0:13:10	is with the speaker vectors that's just the kind of standard sgmm gmm with without adaptation
0:13:15	the bottom two lines of the new stuff
0:13:18	a difference between the bottom two lines
0:13:20	and the difference is not important so
0:13:22	so let's focus on the difference between the second and third line
0:13:25	as about
0:13:26	one and a half percent absolute improvement
0:13:29	going from forty five point nine to forty four point four
0:13:32	so that seems like a very worthwhile improvement from
0:13:35	this uh some a station
0:13:38	uh
0:13:39	so we put is about that
0:13:41	uh
0:13:41	oh yeah here is the uh
0:13:44	the same with constrained mllr a
0:13:46	just like you can get the best result this way you can combined the
0:13:50	the uh special form of adaptation with the standard method
0:13:53	so again we get improvement
0:13:55	how much is it now
0:13:57	most improvement we get is about
0:14:00	a two percent absolute
0:14:01	pretty clear
0:14:03	i'm for the students seem to work on switchboard
0:14:06	so the the this
0:14:08	this
0:14:09	table is a bit busy but the key line to the button two
0:14:12	the
0:14:13	the second to last line is the standard
0:14:16	the standard that
0:14:17	the bottom someone is the summit station
0:14:19	i miss seeing
0:14:21	between zero and zero point two percent
0:14:24	improvement absolute
0:14:26	which was a bit disappointing
0:14:28	thought maybe it was some interaction with vtln and so
0:14:32	we did the experiment without vtln
0:14:35	and again we seeing
0:14:37	oh we see point one point five and point to different uh
0:14:42	different configurations and
0:14:44	and it's a rather disappointing improvement
0:14:47	uh
0:14:49	so we try to figure out why wasn't working we looked to the likelihoods of various
0:14:53	stages of decoding is stuff and nothing was a P S
0:14:56	nothing was different from the other set up so
0:14:59	i i at this point we just really don't know why it worked on one set up and not the
0:15:02	other
0:15:03	and and we suspect that is probably somewhere in between
0:15:06	so we can do further experiments
0:15:10	uh
0:15:11	something we should do and future is is to see what weather
0:15:15	there i didn't mention but this this is on the called a universal background model involved it's only use for
0:15:20	three pruning
0:15:21	but one possibility is that you should train that in the matched to way
0:15:25	and that would help uh
0:15:27	get the stuff to where you could be that the pretty pruning is stopping this from being effective
0:15:31	has just one idea
0:15:33	and way
0:15:34	so next thing is just the
0:15:35	applied for something
0:15:37	we number
0:15:38	that implements these S gmms
0:15:41	it's is actually complete speech toolkit
0:15:44	uh
0:15:45	and it's useful independently of the sgmm aspect but
0:15:49	i it can run the system we have we have scripts that uh
0:15:53	for that we have a presentation on friday
0:15:56	about that
0:15:57	not part of the official program for it to the room here
0:16:00	so if anyone's interested they can come along
0:16:04	so i believe
0:16:05	or are the time like you very much
0:16:12	we have time for
0:16:14	three or four questions
0:16:15	uh
0:16:16	we
0:16:16	yeah
0:16:19	uh are also uh a piece of the question
0:16:22	you change a gmm a tool as uh a gmm M
0:16:25	right yeah well as we know gmm is that generally we now tool all model and T
0:16:30	a i
0:16:31	hmmm is used to stick it is you wish
0:16:34	uh the uh well you change twice
0:16:35	gmm um
0:16:37	hmmm have you told that that you can do those uh you could you are model is and we now
0:16:41	oh a map model i do uh user oceans
0:16:46	i mean just you you could increase the number of
0:16:48	gaussian than the ubm
0:16:50	and it would be general but it's really about compressed them a number of parameters you have to learn
0:16:56	i mean i mean it's not a is not gonna but with infinite training data that it wouldn't be any
0:17:01	better than a gmm
0:17:03	but would finite training data seems to be but
0:17:07	oh yeah yeah yeah
0:17:12	three
0:17:14	yeah
0:17:14	so a little used about because we
0:17:17	so the basic the
0:17:19	uh of the variances
0:17:22	a in some funny way and hmmm so
0:17:24	a lot of mind how many more parameters or less parameters that well a U eight and have it is
0:17:30	you mean input to do that a little bit less
0:17:32	but
0:17:33	that call me and that because i have a if i haven't checked in our distributed to by feel i
0:17:37	have a feeling it might be a little bit more but but when you have a lot of data it's
0:17:41	usually less to you to unit
0:17:45	uh_huh
0:17:46	right
0:17:54	the difference between the call home and the switchboard
0:17:57	uh
0:17:58	the if for the the the speaker modeling like have to do with the amount of data per speaker and
0:18:02	two
0:18:04	um
0:18:05	no i'm not i'm not one of these data base gurus i really don't she know
0:18:10	how
0:18:11	whether that differ
0:18:13	so
0:18:14	yeah i have to look into the how you in in most something but also the the the the likelihood
0:18:19	be computation for the uh
0:18:21	a what when you you calling segment arise when you some suck in the uh
0:18:25	is the E the speaker
0:18:28	hmmm subspace and the weights
0:18:30	is is is that change a lot it more complicated
0:18:33	well it very slightly more complicated
0:18:35	uh but
0:18:36	it's not significantly hard to so
0:18:38	you you you is like more more an extra quantity that you have to pretty compute and then hmmm and
0:18:43	then at the time when you
0:18:45	and a complete the speaker vector there's a bunch of inner products the you have to compute one for each
0:18:50	state or something
0:18:51	i don't for each sub state but then not
0:18:53	but that add significantly to the can compute to as just a book keeping in yeah and that i see
0:18:58	in it that increase the memory nearly double the memory required
0:19:01	storing a model
0:19:03	you mean in do some likely computation or in training as well
0:19:08	oh but was a in in storing in the model for the model that any more weights
0:19:12	oh that it's not like there's more weights but that
0:19:15	some way like this some can to do that the same size as the expanded weights that you have to
0:19:19	store well
0:19:21	yeah
0:19:24	as like this week again

A SYMMETRIZATION OF THE SUBSPACE GAUSSIAN MIXTURE MODEL

Acoustic Modeling

Přednášející: Daniel Povey, Autoři: Daniel Povey, Microsoft Corporation, United States; Martin Karafiát, Brno University of Technology, Czech Republic; Arnab Ghoshal, University of Saarland, Czech Republic; Petr Schwarz, Brno University of Technology, Czech Republic