0:00:13a moral is gonna be given but by opal be microsoft corporation
0:00:23i'm a some of the material from this
0:00:26top was similar a little bit redundant with the next speaker a
0:00:29because uh
0:00:31to me and rick grow should of called an ada that talks a little bit quick
0:00:34giving talks on similar topics
0:00:36but i i am gonna go through the
0:00:38introductory material anyway because uh
0:00:42it necessary to understand might talk
0:00:45yeah i'm kind of assuming that people in this audience may or may not have heard it
0:00:50S G M
0:00:52and will probably benefit from me going through them again
0:00:58this is a technique that
0:00:59we introduce fairly recently
0:01:01it's the kind of
0:01:03a factored form of the gaussian mixture model basis
0:01:10i'm gonna get to it in stages starting from something that everyone knows
0:01:14now first imagine you have
0:01:16the full covariance
0:01:20and and i just written down the equations for that
0:01:25this is just a full converts mixture of gaussians in each state
0:01:28at at the bottom at just enumerated what the parameters are
0:01:32as the weights than means the variances
0:01:36next we just make a very trivial change we stipulate but
0:01:40a number of gaussian the need
0:01:41state is the same
0:01:44and it would be a large number let's say two thousand
0:01:47is obviously the in practical system at this point
0:01:50but we uh
0:01:51uh i i'm just making a small change as possible each time
0:01:54so the same number of got that each state and that
0:01:57one we that the parameters and really just listing that continuous ones
0:02:01so those are unchanged from the four
0:02:04the next thing we do
0:02:06as we say that the covariance as are shared across states
0:02:10but not shared across gaussians
0:02:12so the question than change much all that happen as we dropped one index from the sigma i'll just go
0:02:18i mean see it's sigma J i now we just have segment i
0:02:22so i is the like the gaussian and X it goes from let's say one two
0:02:26two thousand or one two thousand or something
0:02:31X thing we do
0:02:32but the next stage is that's slightly more complicated state and is the kind of key stage
0:02:37and at the means to a subspace
0:02:40the mean the now no longer parameters
0:02:45yeah me a J i you G I is a vector
0:02:47and the J is the state i is the gaussian and X
0:02:50some i was saying you J i is um i V J
0:02:55you can separate these quantities in various ways that i just in a
0:02:59M M is an matrix V the vector uh i don't really give a much interpretation
0:03:03but each state
0:03:05each state J now has a vector V
0:03:08to the clip the forty of fifty
0:03:10and each cast an index uh i has this make tree
0:03:14let's say it might be thirty nine by forty at thirty nine by fifty
0:03:18a a matrix that says
0:03:21the mean of that state varies when the vector and the
0:03:26how the mean of that gas in index but one the vector that state changes
0:03:32but changed here is that we used to have a go about go back one
0:03:37a to have the new J I down there was a and and the parameter less now it's
0:03:41V J and M my
0:03:43and of course then me J I is
0:03:45the kind of product of the two
0:03:48now so so that the most important change uh
0:03:52from a regular system
0:03:53and and
0:03:54there's a few more changes
0:03:57X thing is that the way
0:03:58and no longer parameters
0:04:00but a lot of weight
0:04:04suppose those a thousand or two thousand gaussians that's a
0:04:07a lot of parameters and we we don't one most of the problem just to be in the way because
0:04:12work got accustomed to the weights been your rows of we small subset of the parameters
0:04:16so we say now the weights
0:04:18the weights or depend on these vectors V
0:04:21and what we do is make the way
0:04:25so we maybe i'm lies log weights a linear function of these V's
0:04:29so you see on the top the X of W I transpose B J W i transpose we J is
0:04:35is a scalar that we can at separate as an a normalized log weight
0:04:39more this equation is doing is just normalising at
0:04:42i people ask me so why why that a log wait one just the weights well
0:04:47a that you can make the weights depend linearly on the vector because then
0:04:52he he would be hard to forced to be number to be positive
0:04:56also uh
0:04:58i i think the whole optimisation problem becomes non-convex if you choose any other formula apart from this
0:05:04no no uh up to scaling and stuff
0:05:06so a okay so i just so you what changed here
0:05:09i go back
0:05:11the parameters would W J I V J et cetera
0:05:14no it's
0:05:15W uh i bowled three a so
0:05:18and that we do have the weight as problem is as we have these vectors
0:05:23no the vector W i want for each gaussian index of this two thousand of these vectors are one thousand
0:05:28of these vector
0:05:31then next thing yeah but next thing speaker adaptation
0:05:37and an an a
0:05:37a not the next thing the next thing sub state
0:05:42we we just add another layer of mixture
0:05:44now you know you can always that another layer of mixture right
0:05:47just happens to help in this particular
0:05:50circumstance and and my intuition is that
0:05:53but there might be a particular
0:05:56kind of phonetic state that can be realized two very distinct way
0:06:00i you might pronounce the that I you might not pronounce it
0:06:05it just seems more natural to have like a mixture of two
0:06:09of these vectors V one to represent that to and want to just represents and
0:06:14otherwise if force the kind of subspace to learn things that really shouldn't have to learn
0:06:19so okay we just and we've introduced these the sub states and i just go back to a a a
0:06:24a and look at the parameters of the bottom
0:06:27this W I V take now we have
0:06:29C J M W doubly V J a
0:06:33a parameters is here at the at the mixture weight
0:06:37and also we added then you subscript on the these not now it's of V J M
0:06:43the next
0:06:44the X
0:06:45stage is
0:06:46speaker adaptation
0:06:48yeah we can be norm of things like a from a lot retail and
0:06:52but there's a kind of special speaker adaptation a specific to this model
0:06:57you see there's this play S and i be a go back one using get see the change
0:07:02that was
0:07:03this is then new thing
0:07:06it is is we introduce an a a speaker specific back to V super script S
0:07:11it do we just but the S some top because sometimes we have both of them on certain quantities and
0:07:15then it becomes a mess otherwise
0:07:20so that V stupid script that's of the speaker-specific vector that says
0:07:24it just in a
0:07:25i get the information about that speaker
0:07:28so so what we didn't have a is is we train
0:07:30the kind of speaker subspace and these and i quantities tell you how each mean
0:07:36varies with the speaker
0:07:38typically the speaker sub-spaces of a dimension
0:07:41with a forty
0:07:42the same dimension as the uh phonetic one
0:07:45so you have you have a quite a few parameters to describe the speaker subspace
0:07:49and and and
0:07:50a two D decode you'd have to
0:07:52to a first pass decoding
0:07:54as to make this these super script S
0:07:57and uh
0:07:59yeah to code again
0:08:01so we add the parameters and that but i
0:08:04and as also these these people script ask but these are speaker-specific specific then not really part of the model
0:08:09there a little bit like
0:08:10and F from a transform or something like that
0:08:16i i think we can to the end of describing the sgmm so that means we K
0:08:20but it is uh
0:08:22oh i described that to now it's is stuff that we've already published
0:08:25and i just maybe the punch line of what we already described in case you haven't seen that
0:08:30but it bad so than a regular gmm based system
0:08:34uh uh i four
0:08:37it can better at the M a mobile and that's a special better for small data to the core
0:08:42a twenty percent relative improvement
0:08:45if you have a few hours of data and maybe
0:08:47ten percent
0:08:48if you like when you have tons of data
0:08:51you have a thousand dollars a
0:08:53and uh
0:08:54the problems a somewhat less up to the scrimmage of training
0:08:57mainly due to bad interaction with the feature space discriminative training
0:09:03i just some in previous work here
0:09:06but so so have this talk is about
0:09:08a a is kind of fixing thing an asymmetry in the sgmm
0:09:14as go back one slide
0:09:16or or but what the speaker adaptation stuff you have this
0:09:20and my V J M plus and i V S not i think kind of symmetrical equation because
0:09:25you have these but to is describing the phonetic space
0:09:29and and another vectors describing gonna speaker space um we add them together
0:09:35that's nice and some you go but that like down to the the
0:09:38the equation for the weights W J M i equals probable
0:09:41we don't the in thing with the speaker stuff and their
0:09:45doesn't doesn't P S as an asymmetry in the model because was saying the weights depend on the
0:09:49phonetic state the not the
0:09:52peak care and you know why shouldn't they depend on speaker
0:09:56so so i this paper is about is it's fixed thing bout symmetry
0:10:00and uh i'll go i'll go for one slide you'll see how we fix set
0:10:06a look at that equation for the weights the uh
0:10:08the last but one equation
0:10:10we we've added that um is for for the uh
0:10:13speaker yeah
0:10:16that for action just look at the top of a look at the new numerator
0:10:19that's the uh normalized what weight
0:10:22well the the inside the brackets of the uh normalized log way
0:10:25so but this is saying is it's a a function of the
0:10:31state and is a linear function of the speaker state so it's almost the simplest thing you could do
0:10:37we just fix the asymmetry had the parameters we have is this
0:10:41you use subscript i
0:10:43which is a kind of
0:10:44peak uh
0:10:45the of the
0:10:48the thing that tells you how the weights very with the speaker
0:10:51just the speaker space on a log of W subscript script i
0:10:56so now
0:10:57it was a hard to write down this equation
0:11:00so you know what didn't we do it the four
0:11:03well what what the
0:11:07you can just wide down equation for something else that to
0:11:10able to efficiently uh a that and uh
0:11:13code with it
0:11:16if you were to just six
0:11:18expand these as gmms and to big gaussian mixtures that be completely impractical
0:11:24i think about each state now has two thousand gaussians while some
0:11:30and the full covariance
0:11:31i i i don't have i mentioned that but the and therefore co variance
0:11:35so you can you can fit that and memory and and
0:11:37a and in all the machine
0:11:39so uh
0:11:43we previously described the ways that you can uh
0:11:46efficiently evaluate the likely but it wasn't it just wasn't one hundred percent obvious how to extend those method
0:11:52so the case where the weights depend on the speaker
0:11:55so why this paper is about
0:11:57as a separate tech report the describe the details
0:12:00as it's about ha how do you
0:12:03how do add in this uh
0:12:05it's about how to efficiently evaluate the likelihoods
0:12:08when use some at tries that
0:12:10and uh
0:12:12i i'm going to the details of that
0:12:15it it it was reasonable to for you have a bit more memory
0:12:18just just because this is necessary for understanding the results i just mentioning that
0:12:23but we describe to a date it's for the U's
0:12:27sorry for the use of script
0:12:29a subscript I quantities
0:12:31as an ending exact one and a an exact one
0:12:34but difference really isn't that important i'm just gonna skip over that
0:12:40so that was that the results on call home and uh
0:12:43how long do have by the way
0:12:46we hope
0:12:47i'm call home and switchboard
0:12:51the call home results and
0:12:54so the second line of or
0:12:56but top line the result is on adapted
0:12:59a second line
0:13:01and the were there
0:13:03is a really difficult task
0:13:05callhome home english doesn't how much training data it's messy C
0:13:08the second one is
0:13:10is with the speaker vectors that's just the kind of standard sgmm gmm with without adaptation
0:13:15the bottom two lines of the new stuff
0:13:18a difference between the bottom two lines
0:13:20and the difference is not important so
0:13:22so let's focus on the difference between the second and third line
0:13:25as about
0:13:26one and a half percent absolute improvement
0:13:29going from forty five point nine to forty four point four
0:13:32so that seems like a very worthwhile improvement from
0:13:35this uh some a station
0:13:39so we put is about that
0:13:41oh yeah here is the uh
0:13:44the same with constrained mllr a
0:13:46just like you can get the best result this way you can combined the
0:13:50the uh special form of adaptation with the standard method
0:13:53so again we get improvement
0:13:55how much is it now
0:13:57most improvement we get is about
0:14:00a two percent absolute
0:14:01pretty clear
0:14:03i'm for the students seem to work on switchboard
0:14:06so the the this
0:14:09table is a bit busy but the key line to the button two
0:14:13the second to last line is the standard
0:14:16the standard that
0:14:17the bottom someone is the summit station
0:14:19i miss seeing
0:14:21between zero and zero point two percent
0:14:24improvement absolute
0:14:26which was a bit disappointing
0:14:28thought maybe it was some interaction with vtln and so
0:14:32we did the experiment without vtln
0:14:35and again we seeing
0:14:37oh we see point one point five and point to different uh
0:14:42different configurations and
0:14:44and it's a rather disappointing improvement
0:14:49so we try to figure out why wasn't working we looked to the likelihoods of various
0:14:53stages of decoding is stuff and nothing was a P S
0:14:56nothing was different from the other set up so
0:14:59i i at this point we just really don't know why it worked on one set up and not the
0:15:03and and we suspect that is probably somewhere in between
0:15:06so we can do further experiments
0:15:11something we should do and future is is to see what weather
0:15:15there i didn't mention but this this is on the called a universal background model involved it's only use for
0:15:20three pruning
0:15:21but one possibility is that you should train that in the matched to way
0:15:25and that would help uh
0:15:27get the stuff to where you could be that the pretty pruning is stopping this from being effective
0:15:31has just one idea
0:15:33and way
0:15:34so next thing is just the
0:15:35applied for something
0:15:37we number
0:15:38that implements these S gmms
0:15:41it's is actually complete speech toolkit
0:15:45and it's useful independently of the sgmm aspect but
0:15:49i it can run the system we have we have scripts that uh
0:15:53for that we have a presentation on friday
0:15:56about that
0:15:57not part of the official program for it to the room here
0:16:00so if anyone's interested they can come along
0:16:04so i believe
0:16:05or are the time like you very much
0:16:12we have time for
0:16:14three or four questions
0:16:19uh are also uh a piece of the question
0:16:22you change a gmm a tool as uh a gmm M
0:16:25right yeah well as we know gmm is that generally we now tool all model and T
0:16:30a i
0:16:31hmmm is used to stick it is you wish
0:16:34uh the uh well you change twice
0:16:35gmm um
0:16:37hmmm have you told that that you can do those uh you could you are model is and we now
0:16:41oh a map model i do uh user oceans
0:16:46i mean just you you could increase the number of
0:16:48gaussian than the ubm
0:16:50and it would be general but it's really about compressed them a number of parameters you have to learn
0:16:56i mean i mean it's not a is not gonna but with infinite training data that it wouldn't be any
0:17:01better than a gmm
0:17:03but would finite training data seems to be but
0:17:07oh yeah yeah yeah
0:17:14so a little used about because we
0:17:17so the basic the
0:17:19uh of the variances
0:17:22a in some funny way and hmmm so
0:17:24a lot of mind how many more parameters or less parameters that well a U eight and have it is
0:17:30you mean input to do that a little bit less
0:17:33that call me and that because i have a if i haven't checked in our distributed to by feel i
0:17:37have a feeling it might be a little bit more but but when you have a lot of data it's
0:17:41usually less to you to unit
0:17:54the difference between the call home and the switchboard
0:17:58the if for the the the speaker modeling like have to do with the amount of data per speaker and
0:18:05no i'm not i'm not one of these data base gurus i really don't she know
0:18:11whether that differ
0:18:14yeah i have to look into the how you in in most something but also the the the the likelihood
0:18:19be computation for the uh
0:18:21a what when you you calling segment arise when you some suck in the uh
0:18:25is the E the speaker
0:18:28hmmm subspace and the weights
0:18:30is is is that change a lot it more complicated
0:18:33well it very slightly more complicated
0:18:35uh but
0:18:36it's not significantly hard to so
0:18:38you you you is like more more an extra quantity that you have to pretty compute and then hmmm and
0:18:43then at the time when you
0:18:45and a complete the speaker vector there's a bunch of inner products the you have to compute one for each
0:18:50state or something
0:18:51i don't for each sub state but then not
0:18:53but that add significantly to the can compute to as just a book keeping in yeah and that i see
0:18:58in it that increase the memory nearly double the memory required
0:19:01storing a model
0:19:03you mean in do some likely computation or in training as well
0:19:08oh but was a in in storing in the model for the model that any more weights
0:19:12oh that it's not like there's more weights but that
0:19:15some way like this some can to do that the same size as the expanded weights that you have to
0:19:19store well
0:19:24as like this week again