a moral is gonna be given but by opal be microsoft corporation hello so um i'm a some of the material from this top was similar a little bit redundant with the next speaker a because uh to me and rick grow should of called an ada that talks a little bit quick giving talks on similar topics but i i am gonna go through the introductory material anyway because uh it necessary to understand might talk yeah yeah i'm kind of assuming that people in this audience may or may not have heard it S G M and will probably benefit from me going through them again so i this is a technique that we introduce fairly recently it's the kind of a factored form of the gaussian mixture model basis now i'm gonna get to it in stages starting from something that everyone knows now first imagine you have the full covariance system uh and and i just written down the equations for that and this is just a full converts mixture of gaussians in each state at at the bottom at just enumerated what the parameters are as the weights than means the variances no next we just make a very trivial change we stipulate but a number of gaussian the need state is the same and it would be a large number let's say two thousand is obviously the in practical system at this point but we uh uh i i'm just making a small change as possible each time so the same number of got that each state and that one we that the parameters and really just listing that continuous ones so those are unchanged from the four the next thing we do as we say that the covariance as are shared across states but not shared across gaussians so the question than change much all that happen as we dropped one index from the sigma i'll just go back i mean see it's sigma J i now we just have segment i so i is the like the gaussian and X it goes from let's say one two two thousand or one two thousand or something now X thing we do but the next stage is that's slightly more complicated state and is the kind of key stage and at the means to a subspace so the mean the now no longer parameters yeah yeah me a J i you G I is a vector and the J is the state i is the gaussian and X some i was saying you J i is um i V J you can separate these quantities in various ways that i just in a M M is an matrix V the vector uh i don't really give a much interpretation but each state each state J now has a vector V to the clip the forty of fifty and each cast an index uh i has this make tree let's say it might be thirty nine by forty at thirty nine by fifty a a matrix that says how the mean of that state varies when the vector and the sorry how the mean of that gas in index but one the vector that state changes so but changed here is that we used to have a go about go back one a to have the new J I down there was a and and the parameter less now it's V J and M my and of course then me J I is the kind of product of the two now so so that the most important change uh from a regular system and and there's a few more changes X thing is that the way and no longer parameters but a lot of weight is suppose those a thousand or two thousand gaussians that's a a lot of parameters and we we don't one most of the problem just to be in the way because work got accustomed to the weights been your rows of we small subset of the parameters so we say now the weights the weights or depend on these vectors V and what we do is make the way i so we maybe i'm lies log weights a linear function of these V's so you see on the top the X of W I transpose B J W i transpose we J is is a scalar that we can at separate as an a normalized log weight more this equation is doing is just normalising at i people ask me so why why that a log wait one just the weights well a that you can make the weights depend linearly on the vector because then he he would be hard to forced to be number to be positive also uh i i think the whole optimisation problem becomes non-convex if you choose any other formula apart from this no no uh up to scaling and stuff so a okay so i just so you what changed here i go back the parameters would W J I V J et cetera no it's W uh i bowled three a so and that we do have the weight as problem is as we have these vectors no the vector W i want for each gaussian index of this two thousand of these vectors are one thousand of these vector yeah thing then next thing yeah but next thing speaker adaptation and an an a a not the next thing the next thing sub state what we we just add another layer of mixture now you know you can always that another layer of mixture right just happens to help in this particular circumstance and and my intuition is that but there might be a particular kind of phonetic state that can be realized two very distinct way i you might pronounce the that I you might not pronounce it and it just seems more natural to have like a mixture of two of these vectors V one to represent that to and want to just represents and otherwise if force the kind of subspace to learn things that really shouldn't have to learn so okay we just and we've introduced these the sub states and i just go back to a a a a and look at the parameters of the bottom this W I V take now we have C J M W doubly V J a so a parameters is here at the at the mixture weight and also we added then you subscript on the these not now it's of V J M okay the next the X stage is speaker adaptation yeah we can be norm of things like a from a lot retail and but there's a kind of special speaker adaptation a specific to this model you see there's this play S and i be a go back one using get see the change that was this is then new thing so it is is we introduce an a a speaker specific back to V super script S it do we just but the S some top because sometimes we have both of them on certain quantities and then it becomes a mess otherwise so so that V stupid script that's of the speaker-specific vector that says it just in a i get the information about that speaker so so what we didn't have a is is we train the kind of speaker subspace and these and i quantities tell you how each mean varies with the speaker typically the speaker sub-spaces of a dimension with a forty the same dimension as the uh phonetic one so you have you have a quite a few parameters to describe the speaker subspace and and and a two D decode you'd have to to a first pass decoding as to make this these super script S and uh yeah to code again so we add the parameters and that but i and as also these these people script ask but these are speaker-specific specific then not really part of the model there a little bit like and F from a transform or something like that so i i think we can to the end of describing the sgmm so that means we K but it is uh oh i described that to now it's is stuff that we've already published and i just maybe the punch line of what we already described in case you haven't seen that but it bad so than a regular gmm based system uh uh i four it can better at the M a mobile and that's a special better for small data to the core a twenty percent relative improvement if you have a few hours of data and maybe ten percent if you like when you have tons of data you have a thousand dollars a and uh the problems a somewhat less up to the scrimmage of training mainly due to bad interaction with the feature space discriminative training i just some in previous work here but so so have this talk is about a a is kind of fixing thing an asymmetry in the sgmm so as go back one slide or or but what the speaker adaptation stuff you have this and my V J M plus and i V S not i think kind of symmetrical equation because you have these but to is describing the phonetic space and and another vectors describing gonna speaker space um we add them together that's nice and some you go but that like down to the the the equation for the weights W J M i equals probable we don't the in thing with the speaker stuff and their doesn't doesn't P S as an asymmetry in the model because was saying the weights depend on the phonetic state the not the peak care and you know why shouldn't they depend on speaker oh so so i this paper is about is it's fixed thing bout symmetry and uh i'll go i'll go for one slide you'll see how we fix set a look at that equation for the weights the uh the last but one equation we we've added that um is for for the uh speaker yeah that that for action just look at the top of a look at the new numerator that's the uh normalized what weight well the the inside the brackets of the uh normalized log way so but this is saying is it's a a function of the phonetic uh state and is a linear function of the speaker state so it's almost the simplest thing you could do we just fix the asymmetry had the parameters we have is this you use subscript i which is a kind of peak uh the of the the thing that tells you how the weights very with the speaker just the speaker space on a log of W subscript script i so now it was a hard to write down this equation so you know what didn't we do it the four well what what the uh you can just wide down equation for something else that to able to efficiently uh a that and uh code with it no if you were to just six expand these as gmms and to big gaussian mixtures that be completely impractical because i think about each state now has two thousand gaussians while some so and the full covariance i i i don't have i mentioned that but the and therefore co variance so you can you can fit that and memory and and a and in all the machine so uh but we previously described the ways that you can uh efficiently evaluate the likely but it wasn't it just wasn't one hundred percent obvious how to extend those method so the case where the weights depend on the speaker so why this paper is about as a separate tech report the describe the details as it's about ha how do you how do add in this uh it's about how to efficiently evaluate the likelihoods when use some at tries that and uh i i'm going to the details of that it it it was reasonable to for you have a bit more memory just just because this is necessary for understanding the results i just mentioning that but we describe to a date it's for the U's sorry for the use of script a subscript I quantities as an ending exact one and a an exact one but difference really isn't that important i'm just gonna skip over that uh so that was that the results on call home and uh how long do have by the way we hope okay i'm call home and switchboard yeah the call home results and so the second line of or but top line the result is on adapted a second line and the were there is a really difficult task callhome home english doesn't how much training data it's messy C the second one is is with the speaker vectors that's just the kind of standard sgmm gmm with without adaptation the bottom two lines of the new stuff a difference between the bottom two lines and the difference is not important so so let's focus on the difference between the second and third line as about one and a half percent absolute improvement going from forty five point nine to forty four point four so that seems like a very worthwhile improvement from this uh some a station uh so we put is about that uh oh yeah here is the uh the same with constrained mllr a just like you can get the best result this way you can combined the the uh special form of adaptation with the standard method so again we get improvement how much is it now most improvement we get is about a two percent absolute pretty clear i'm for the students seem to work on switchboard so the the this this table is a bit busy but the key line to the button two the the second to last line is the standard the standard that the bottom someone is the summit station i miss seeing between zero and zero point two percent improvement absolute which was a bit disappointing thought maybe it was some interaction with vtln and so we did the experiment without vtln and again we seeing oh we see point one point five and point to different uh different configurations and and it's a rather disappointing improvement uh so we try to figure out why wasn't working we looked to the likelihoods of various stages of decoding is stuff and nothing was a P S nothing was different from the other set up so i i at this point we just really don't know why it worked on one set up and not the other and and we suspect that is probably somewhere in between so we can do further experiments uh something we should do and future is is to see what weather there i didn't mention but this this is on the called a universal background model involved it's only use for three pruning but one possibility is that you should train that in the matched to way and that would help uh get the stuff to where you could be that the pretty pruning is stopping this from being effective has just one idea and way so next thing is just the applied for something we number that implements these S gmms it's is actually complete speech toolkit uh and it's useful independently of the sgmm aspect but i it can run the system we have we have scripts that uh for that we have a presentation on friday about that not part of the official program for it to the room here so if anyone's interested they can come along so i believe or are the time like you very much we have time for three or four questions uh we yeah uh are also uh a piece of the question you change a gmm a tool as uh a gmm M right yeah well as we know gmm is that generally we now tool all model and T a i hmmm is used to stick it is you wish uh the uh well you change twice gmm um hmmm have you told that that you can do those uh you could you are model is and we now oh a map model i do uh user oceans i mean just you you could increase the number of gaussian than the ubm and it would be general but it's really about compressed them a number of parameters you have to learn i mean i mean it's not a is not gonna but with infinite training data that it wouldn't be any better than a gmm but would finite training data seems to be but oh yeah yeah yeah three yeah so a little used about because we so the basic the uh of the variances a in some funny way and hmmm so a lot of mind how many more parameters or less parameters that well a U eight and have it is you mean input to do that a little bit less but that call me and that because i have a if i haven't checked in our distributed to by feel i have a feeling it might be a little bit more but but when you have a lot of data it's usually less to you to unit uh_huh right the difference between the call home and the switchboard uh the if for the the the speaker modeling like have to do with the amount of data per speaker and two um no i'm not i'm not one of these data base gurus i really don't she know how whether that differ so yeah i have to look into the how you in in most something but also the the the the likelihood be computation for the uh a what when you you calling segment arise when you some suck in the uh is the E the speaker hmmm subspace and the weights is is is that change a lot it more complicated well it very slightly more complicated uh but it's not significantly hard to so you you you is like more more an extra quantity that you have to pretty compute and then hmmm and then at the time when you and a complete the speaker vector there's a bunch of inner products the you have to compute one for each state or something i don't for each sub state but then not but that add significantly to the can compute to as just a book keeping in yeah and that i see in it that increase the memory nearly double the memory required storing a model you mean in do some likely computation or in training as well oh but was a in in storing in the model for the model that any more weights oh that it's not like there's more weights but that some way like this some can to do that the same size as the expanded weights that you have to store well yeah as like this week again