0:00:13 | a moral is gonna be given but by opal be microsoft corporation |
---|---|

0:00:17 | hello |

0:00:19 | so |

0:00:20 | um |

0:00:23 | i'm a some of the material from this |

0:00:26 | top was similar a little bit redundant with the next speaker a |

0:00:29 | because uh |

0:00:31 | to me and rick grow should of called an ada that talks a little bit quick |

0:00:34 | giving talks on similar topics |

0:00:36 | but i i am gonna go through the |

0:00:38 | introductory material anyway because uh |

0:00:42 | it necessary to understand might talk |

0:00:44 | yeah |

0:00:45 | yeah i'm kind of assuming that people in this audience may or may not have heard it |

0:00:50 | S G M |

0:00:52 | and will probably benefit from me going through them again |

0:00:56 | so |

0:00:58 | i |

0:00:58 | this is a technique that |

0:00:59 | we introduce fairly recently |

0:01:01 | it's the kind of |

0:01:03 | a factored form of the gaussian mixture model basis |

0:01:08 | now |

0:01:10 | i'm gonna get to it in stages starting from something that everyone knows |

0:01:14 | now first imagine you have |

0:01:16 | the full covariance |

0:01:18 | system |

0:01:19 | uh |

0:01:20 | and and i just written down the equations for that |

0:01:24 | and |

0:01:25 | this is just a full converts mixture of gaussians in each state |

0:01:28 | at at the bottom at just enumerated what the parameters are |

0:01:32 | as the weights than means the variances |

0:01:35 | no |

0:01:36 | next we just make a very trivial change we stipulate but |

0:01:40 | a number of gaussian the need |

0:01:41 | state is the same |

0:01:44 | and it would be a large number let's say two thousand |

0:01:47 | is obviously the in practical system at this point |

0:01:50 | but we uh |

0:01:51 | uh i i'm just making a small change as possible each time |

0:01:54 | so the same number of got that each state and that |

0:01:57 | one we that the parameters and really just listing that continuous ones |

0:02:01 | so those are unchanged from the four |

0:02:04 | the next thing we do |

0:02:06 | as we say that the covariance as are shared across states |

0:02:10 | but not shared across gaussians |

0:02:12 | so the question than change much all that happen as we dropped one index from the sigma i'll just go |

0:02:17 | back |

0:02:18 | i mean see it's sigma J i now we just have segment i |

0:02:22 | so i is the like the gaussian and X it goes from let's say one two |

0:02:26 | two thousand or one two thousand or something |

0:02:29 | now |

0:02:31 | X thing we do |

0:02:32 | but the next stage is that's slightly more complicated state and is the kind of key stage |

0:02:37 | and at the means to a subspace |

0:02:40 | so |

0:02:40 | the mean the now no longer parameters |

0:02:44 | yeah |

0:02:45 | yeah me a J i you G I is a vector |

0:02:47 | and the J is the state i is the gaussian and X |

0:02:50 | some i was saying you J i is um i V J |

0:02:55 | you can separate these quantities in various ways that i just in a |

0:02:59 | M M is an matrix V the vector uh i don't really give a much interpretation |

0:03:03 | but each state |

0:03:05 | each state J now has a vector V |

0:03:08 | to the clip the forty of fifty |

0:03:10 | and each cast an index uh i has this make tree |

0:03:14 | let's say it might be thirty nine by forty at thirty nine by fifty |

0:03:18 | a a matrix that says |

0:03:19 | how |

0:03:21 | the mean of that state varies when the vector and the |

0:03:25 | sorry |

0:03:26 | how the mean of that gas in index but one the vector that state changes |

0:03:30 | so |

0:03:32 | but changed here is that we used to have a go about go back one |

0:03:37 | a to have the new J I down there was a and and the parameter less now it's |

0:03:41 | V J and M my |

0:03:43 | and of course then me J I is |

0:03:45 | the kind of product of the two |

0:03:48 | now so so that the most important change uh |

0:03:52 | from a regular system |

0:03:53 | and and |

0:03:54 | there's a few more changes |

0:03:57 | X thing is that the way |

0:03:58 | and no longer parameters |

0:04:00 | but a lot of weight |

0:04:02 | is |

0:04:04 | suppose those a thousand or two thousand gaussians that's a |

0:04:07 | a lot of parameters and we we don't one most of the problem just to be in the way because |

0:04:12 | work got accustomed to the weights been your rows of we small subset of the parameters |

0:04:16 | so we say now the weights |

0:04:18 | the weights or depend on these vectors V |

0:04:21 | and what we do is make the way |

0:04:24 | i |

0:04:25 | so we maybe i'm lies log weights a linear function of these V's |

0:04:29 | so you see on the top the X of W I transpose B J W i transpose we J is |

0:04:35 | is a scalar that we can at separate as an a normalized log weight |

0:04:39 | more this equation is doing is just normalising at |

0:04:42 | i people ask me so why why that a log wait one just the weights well |

0:04:47 | a that you can make the weights depend linearly on the vector because then |

0:04:52 | he he would be hard to forced to be number to be positive |

0:04:56 | also uh |

0:04:58 | i i think the whole optimisation problem becomes non-convex if you choose any other formula apart from this |

0:05:04 | no no uh up to scaling and stuff |

0:05:06 | so a okay so i just so you what changed here |

0:05:09 | i go back |

0:05:11 | the parameters would W J I V J et cetera |

0:05:14 | no it's |

0:05:15 | W uh i bowled three a so |

0:05:18 | and that we do have the weight as problem is as we have these vectors |

0:05:23 | no the vector W i want for each gaussian index of this two thousand of these vectors are one thousand |

0:05:28 | of these vector |

0:05:30 | yeah |

0:05:31 | thing |

0:05:31 | then next thing yeah but next thing speaker adaptation |

0:05:37 | and an an a |

0:05:37 | a not the next thing the next thing sub state |

0:05:40 | what |

0:05:42 | we we just add another layer of mixture |

0:05:44 | now you know you can always that another layer of mixture right |

0:05:47 | just happens to help in this particular |

0:05:50 | circumstance and and my intuition is that |

0:05:53 | but there might be a particular |

0:05:56 | kind of phonetic state that can be realized two very distinct way |

0:06:00 | i you might pronounce the that I you might not pronounce it |

0:06:04 | and |

0:06:05 | it just seems more natural to have like a mixture of two |

0:06:09 | of these vectors V one to represent that to and want to just represents and |

0:06:14 | otherwise if force the kind of subspace to learn things that really shouldn't have to learn |

0:06:19 | so okay we just and we've introduced these the sub states and i just go back to a a a |

0:06:24 | a and look at the parameters of the bottom |

0:06:27 | this W I V take now we have |

0:06:29 | C J M W doubly V J a |

0:06:32 | so |

0:06:33 | a parameters is here at the at the mixture weight |

0:06:37 | and also we added then you subscript on the these not now it's of V J M |

0:06:42 | okay |

0:06:43 | the next |

0:06:44 | the X |

0:06:45 | stage is |

0:06:46 | speaker adaptation |

0:06:48 | yeah we can be norm of things like a from a lot retail and |

0:06:52 | but there's a kind of special speaker adaptation a specific to this model |

0:06:57 | you see there's this play S and i be a go back one using get see the change |

0:07:02 | that was |

0:07:03 | this is then new thing |

0:07:05 | so |

0:07:06 | it is is we introduce an a a speaker specific back to V super script S |

0:07:11 | it do we just but the S some top because sometimes we have both of them on certain quantities and |

0:07:15 | then it becomes a mess otherwise |

0:07:17 | so |

0:07:20 | so that V stupid script that's of the speaker-specific vector that says |

0:07:24 | it just in a |

0:07:25 | i get the information about that speaker |

0:07:28 | so so what we didn't have a is is we train |

0:07:30 | the kind of speaker subspace and these and i quantities tell you how each mean |

0:07:36 | varies with the speaker |

0:07:38 | typically the speaker sub-spaces of a dimension |

0:07:41 | with a forty |

0:07:42 | the same dimension as the uh phonetic one |

0:07:45 | so you have you have a quite a few parameters to describe the speaker subspace |

0:07:49 | and and and |

0:07:50 | a two D decode you'd have to |

0:07:52 | to a first pass decoding |

0:07:54 | as to make this these super script S |

0:07:57 | and uh |

0:07:59 | yeah to code again |

0:08:01 | so we add the parameters and that but i |

0:08:04 | and as also these these people script ask but these are speaker-specific specific then not really part of the model |

0:08:09 | there a little bit like |

0:08:10 | and F from a transform or something like that |

0:08:14 | so |

0:08:16 | i i think we can to the end of describing the sgmm so that means we K |

0:08:20 | but it is uh |

0:08:22 | oh i described that to now it's is stuff that we've already published |

0:08:25 | and i just maybe the punch line of what we already described in case you haven't seen that |

0:08:30 | but it bad so than a regular gmm based system |

0:08:34 | uh uh i four |

0:08:37 | it can better at the M a mobile and that's a special better for small data to the core |

0:08:42 | a twenty percent relative improvement |

0:08:45 | if you have a few hours of data and maybe |

0:08:47 | ten percent |

0:08:48 | if you like when you have tons of data |

0:08:51 | you have a thousand dollars a |

0:08:53 | and uh |

0:08:54 | the problems a somewhat less up to the scrimmage of training |

0:08:57 | mainly due to bad interaction with the feature space discriminative training |

0:09:03 | i just some in previous work here |

0:09:06 | but so so have this talk is about |

0:09:08 | a a is kind of fixing thing an asymmetry in the sgmm |

0:09:12 | so |

0:09:14 | as go back one slide |

0:09:16 | or or but what the speaker adaptation stuff you have this |

0:09:20 | and my V J M plus and i V S not i think kind of symmetrical equation because |

0:09:25 | you have these but to is describing the phonetic space |

0:09:29 | and and another vectors describing gonna speaker space um we add them together |

0:09:35 | that's nice and some you go but that like down to the the |

0:09:38 | the equation for the weights W J M i equals probable |

0:09:41 | we don't the in thing with the speaker stuff and their |

0:09:45 | doesn't doesn't P S as an asymmetry in the model because was saying the weights depend on the |

0:09:49 | phonetic state the not the |

0:09:52 | peak care and you know why shouldn't they depend on speaker |

0:09:55 | oh |

0:09:56 | so so i this paper is about is it's fixed thing bout symmetry |

0:10:00 | and uh i'll go i'll go for one slide you'll see how we fix set |

0:10:06 | a look at that equation for the weights the uh |

0:10:08 | the last but one equation |

0:10:10 | we we've added that um is for for the uh |

0:10:13 | speaker yeah |

0:10:15 | that |

0:10:16 | that for action just look at the top of a look at the new numerator |

0:10:19 | that's the uh normalized what weight |

0:10:22 | well the the inside the brackets of the uh normalized log way |

0:10:25 | so but this is saying is it's a a function of the |

0:10:29 | phonetic |

0:10:30 | uh |

0:10:31 | state and is a linear function of the speaker state so it's almost the simplest thing you could do |

0:10:37 | we just fix the asymmetry had the parameters we have is this |

0:10:41 | you use subscript i |

0:10:43 | which is a kind of |

0:10:44 | peak uh |

0:10:45 | the of the |

0:10:48 | the thing that tells you how the weights very with the speaker |

0:10:51 | just the speaker space on a log of W subscript script i |

0:10:56 | so now |

0:10:57 | it was a hard to write down this equation |

0:11:00 | so you know what didn't we do it the four |

0:11:03 | well what what the |

0:11:05 | uh |

0:11:07 | you can just wide down equation for something else that to |

0:11:10 | able to efficiently uh a that and uh |

0:11:13 | code with it |

0:11:15 | no |

0:11:16 | if you were to just six |

0:11:18 | expand these as gmms and to big gaussian mixtures that be completely impractical |

0:11:23 | because |

0:11:24 | i think about each state now has two thousand gaussians while some |

0:11:29 | so |

0:11:30 | and the full covariance |

0:11:31 | i i i don't have i mentioned that but the and therefore co variance |

0:11:35 | so you can you can fit that and memory and and |

0:11:37 | a and in all the machine |

0:11:39 | so uh |

0:11:42 | but |

0:11:43 | we previously described the ways that you can uh |

0:11:46 | efficiently evaluate the likely but it wasn't it just wasn't one hundred percent obvious how to extend those method |

0:11:52 | so the case where the weights depend on the speaker |

0:11:55 | so why this paper is about |

0:11:57 | as a separate tech report the describe the details |

0:12:00 | as it's about ha how do you |

0:12:03 | how do add in this uh |

0:12:05 | it's about how to efficiently evaluate the likelihoods |

0:12:08 | when use some at tries that |

0:12:10 | and uh |

0:12:12 | i i'm going to the details of that |

0:12:15 | it it it was reasonable to for you have a bit more memory |

0:12:18 | just just because this is necessary for understanding the results i just mentioning that |

0:12:23 | but we describe to a date it's for the U's |

0:12:27 | sorry for the use of script |

0:12:29 | a subscript I quantities |

0:12:31 | as an ending exact one and a an exact one |

0:12:34 | but difference really isn't that important i'm just gonna skip over that |

0:12:39 | uh |

0:12:40 | so that was that the results on call home and uh |

0:12:43 | how long do have by the way |

0:12:46 | we hope |

0:12:47 | okay |

0:12:47 | i'm call home and switchboard |

0:12:51 | yeah |

0:12:51 | the call home results and |

0:12:54 | so the second line of or |

0:12:56 | but top line the result is on adapted |

0:12:59 | a second line |

0:13:01 | and the were there |

0:13:03 | is a really difficult task |

0:13:05 | callhome home english doesn't how much training data it's messy C |

0:13:08 | the second one is |

0:13:10 | is with the speaker vectors that's just the kind of standard sgmm gmm with without adaptation |

0:13:15 | the bottom two lines of the new stuff |

0:13:18 | a difference between the bottom two lines |

0:13:20 | and the difference is not important so |

0:13:22 | so let's focus on the difference between the second and third line |

0:13:25 | as about |

0:13:26 | one and a half percent absolute improvement |

0:13:29 | going from forty five point nine to forty four point four |

0:13:32 | so that seems like a very worthwhile improvement from |

0:13:35 | this uh some a station |

0:13:38 | uh |

0:13:39 | so we put is about that |

0:13:41 | uh |

0:13:41 | oh yeah here is the uh |

0:13:44 | the same with constrained mllr a |

0:13:46 | just like you can get the best result this way you can combined the |

0:13:50 | the uh special form of adaptation with the standard method |

0:13:53 | so again we get improvement |

0:13:55 | how much is it now |

0:13:57 | most improvement we get is about |

0:14:00 | a two percent absolute |

0:14:01 | pretty clear |

0:14:03 | i'm for the students seem to work on switchboard |

0:14:06 | so the the this |

0:14:08 | this |

0:14:09 | table is a bit busy but the key line to the button two |

0:14:12 | the |

0:14:13 | the second to last line is the standard |

0:14:16 | the standard that |

0:14:17 | the bottom someone is the summit station |

0:14:19 | i miss seeing |

0:14:21 | between zero and zero point two percent |

0:14:24 | improvement absolute |

0:14:26 | which was a bit disappointing |

0:14:28 | thought maybe it was some interaction with vtln and so |

0:14:32 | we did the experiment without vtln |

0:14:35 | and again we seeing |

0:14:37 | oh we see point one point five and point to different uh |

0:14:42 | different configurations and |

0:14:44 | and it's a rather disappointing improvement |

0:14:47 | uh |

0:14:49 | so we try to figure out why wasn't working we looked to the likelihoods of various |

0:14:53 | stages of decoding is stuff and nothing was a P S |

0:14:56 | nothing was different from the other set up so |

0:14:59 | i i at this point we just really don't know why it worked on one set up and not the |

0:15:02 | other |

0:15:03 | and and we suspect that is probably somewhere in between |

0:15:06 | so we can do further experiments |

0:15:10 | uh |

0:15:11 | something we should do and future is is to see what weather |

0:15:15 | there i didn't mention but this this is on the called a universal background model involved it's only use for |

0:15:20 | three pruning |

0:15:21 | but one possibility is that you should train that in the matched to way |

0:15:25 | and that would help uh |

0:15:27 | get the stuff to where you could be that the pretty pruning is stopping this from being effective |

0:15:31 | has just one idea |

0:15:33 | and way |

0:15:34 | so next thing is just the |

0:15:35 | applied for something |

0:15:37 | we number |

0:15:38 | that implements these S gmms |

0:15:41 | it's is actually complete speech toolkit |

0:15:44 | uh |

0:15:45 | and it's useful independently of the sgmm aspect but |

0:15:49 | i it can run the system we have we have scripts that uh |

0:15:53 | for that we have a presentation on friday |

0:15:56 | about that |

0:15:57 | not part of the official program for it to the room here |

0:16:00 | so if anyone's interested they can come along |

0:16:04 | so i believe |

0:16:05 | or are the time like you very much |

0:16:12 | we have time for |

0:16:14 | three or four questions |

0:16:15 | uh |

0:16:16 | we |

0:16:16 | yeah |

0:16:19 | uh are also uh a piece of the question |

0:16:22 | you change a gmm a tool as uh a gmm M |

0:16:25 | right yeah well as we know gmm is that generally we now tool all model and T |

0:16:30 | a i |

0:16:31 | hmmm is used to stick it is you wish |

0:16:34 | uh the uh well you change twice |

0:16:35 | gmm um |

0:16:37 | hmmm have you told that that you can do those uh you could you are model is and we now |

0:16:41 | oh a map model i do uh user oceans |

0:16:46 | i mean just you you could increase the number of |

0:16:48 | gaussian than the ubm |

0:16:50 | and it would be general but it's really about compressed them a number of parameters you have to learn |

0:16:56 | i mean i mean it's not a is not gonna but with infinite training data that it wouldn't be any |

0:17:01 | better than a gmm |

0:17:03 | but would finite training data seems to be but |

0:17:07 | oh yeah yeah yeah |

0:17:12 | three |

0:17:14 | yeah |

0:17:14 | so a little used about because we |

0:17:17 | so the basic the |

0:17:19 | uh of the variances |

0:17:22 | a in some funny way and hmmm so |

0:17:24 | a lot of mind how many more parameters or less parameters that well a U eight and have it is |

0:17:30 | you mean input to do that a little bit less |

0:17:32 | but |

0:17:33 | that call me and that because i have a if i haven't checked in our distributed to by feel i |

0:17:37 | have a feeling it might be a little bit more but but when you have a lot of data it's |

0:17:41 | usually less to you to unit |

0:17:45 | uh_huh |

0:17:46 | right |

0:17:54 | the difference between the call home and the switchboard |

0:17:57 | uh |

0:17:58 | the if for the the the speaker modeling like have to do with the amount of data per speaker and |

0:18:02 | two |

0:18:04 | um |

0:18:05 | no i'm not i'm not one of these data base gurus i really don't she know |

0:18:10 | how |

0:18:11 | whether that differ |

0:18:13 | so |

0:18:14 | yeah i have to look into the how you in in most something but also the the the the likelihood |

0:18:19 | be computation for the uh |

0:18:21 | a what when you you calling segment arise when you some suck in the uh |

0:18:25 | is the E the speaker |

0:18:28 | hmmm subspace and the weights |

0:18:30 | is is is that change a lot it more complicated |

0:18:33 | well it very slightly more complicated |

0:18:35 | uh but |

0:18:36 | it's not significantly hard to so |

0:18:38 | you you you is like more more an extra quantity that you have to pretty compute and then hmmm and |

0:18:43 | then at the time when you |

0:18:45 | and a complete the speaker vector there's a bunch of inner products the you have to compute one for each |

0:18:50 | state or something |

0:18:51 | i don't for each sub state but then not |

0:18:53 | but that add significantly to the can compute to as just a book keeping in yeah and that i see |

0:18:58 | in it that increase the memory nearly double the memory required |

0:19:01 | storing a model |

0:19:03 | you mean in do some likely computation or in training as well |

0:19:08 | oh but was a in in storing in the model for the model that any more weights |

0:19:12 | oh that it's not like there's more weights but that |

0:19:15 | some way like this some can to do that the same size as the expanded weights that you have to |

0:19:19 | store well |

0:19:21 | yeah |

0:19:24 | as like this week again |