a moral is gonna be given but by opal be microsoft corporation

hello

so

um

i'm a some of the material from this

top was similar a little bit redundant with the next speaker a

because uh

to me and rick grow should of called an ada that talks a little bit quick

giving talks on similar topics

but i i am gonna go through the

introductory material anyway because uh

it necessary to understand might talk

yeah

yeah i'm kind of assuming that people in this audience may or may not have heard it

S G M

and will probably benefit from me going through them again

so

i

this is a technique that

we introduce fairly recently

it's the kind of

a factored form of the gaussian mixture model basis

now

i'm gonna get to it in stages starting from something that everyone knows

now first imagine you have

the full covariance

system

uh

and and i just written down the equations for that

and

this is just a full converts mixture of gaussians in each state

at at the bottom at just enumerated what the parameters are

as the weights than means the variances

no

next we just make a very trivial change we stipulate but

a number of gaussian the need

state is the same

and it would be a large number let's say two thousand

is obviously the in practical system at this point

but we uh

uh i i'm just making a small change as possible each time

so the same number of got that each state and that

one we that the parameters and really just listing that continuous ones

so those are unchanged from the four

the next thing we do

as we say that the covariance as are shared across states

but not shared across gaussians

so the question than change much all that happen as we dropped one index from the sigma i'll just go

back

i mean see it's sigma J i now we just have segment i

so i is the like the gaussian and X it goes from let's say one two

two thousand or one two thousand or something

now

X thing we do

but the next stage is that's slightly more complicated state and is the kind of key stage

and at the means to a subspace

so

the mean the now no longer parameters

yeah

yeah me a J i you G I is a vector

and the J is the state i is the gaussian and X

some i was saying you J i is um i V J

you can separate these quantities in various ways that i just in a

M M is an matrix V the vector uh i don't really give a much interpretation

but each state

each state J now has a vector V

to the clip the forty of fifty

and each cast an index uh i has this make tree

let's say it might be thirty nine by forty at thirty nine by fifty

a a matrix that says

how

the mean of that state varies when the vector and the

sorry

how the mean of that gas in index but one the vector that state changes

so

but changed here is that we used to have a go about go back one

a to have the new J I down there was a and and the parameter less now it's

V J and M my

and of course then me J I is

the kind of product of the two

now so so that the most important change uh

from a regular system

and and

there's a few more changes

X thing is that the way

and no longer parameters

but a lot of weight

is

suppose those a thousand or two thousand gaussians that's a

a lot of parameters and we we don't one most of the problem just to be in the way because

work got accustomed to the weights been your rows of we small subset of the parameters

so we say now the weights

the weights or depend on these vectors V

and what we do is make the way

i

so we maybe i'm lies log weights a linear function of these V's

so you see on the top the X of W I transpose B J W i transpose we J is

is a scalar that we can at separate as an a normalized log weight

more this equation is doing is just normalising at

i people ask me so why why that a log wait one just the weights well

a that you can make the weights depend linearly on the vector because then

he he would be hard to forced to be number to be positive

also uh

i i think the whole optimisation problem becomes non-convex if you choose any other formula apart from this

no no uh up to scaling and stuff

so a okay so i just so you what changed here

i go back

the parameters would W J I V J et cetera

no it's

W uh i bowled three a so

and that we do have the weight as problem is as we have these vectors

no the vector W i want for each gaussian index of this two thousand of these vectors are one thousand

of these vector

yeah

thing

then next thing yeah but next thing speaker adaptation

and an an a

a not the next thing the next thing sub state

what

we we just add another layer of mixture

now you know you can always that another layer of mixture right

just happens to help in this particular

circumstance and and my intuition is that

but there might be a particular

kind of phonetic state that can be realized two very distinct way

i you might pronounce the that I you might not pronounce it

and

it just seems more natural to have like a mixture of two

of these vectors V one to represent that to and want to just represents and

otherwise if force the kind of subspace to learn things that really shouldn't have to learn

so okay we just and we've introduced these the sub states and i just go back to a a a

a and look at the parameters of the bottom

this W I V take now we have

C J M W doubly V J a

so

a parameters is here at the at the mixture weight

and also we added then you subscript on the these not now it's of V J M

okay

the next

the X

stage is

speaker adaptation

yeah we can be norm of things like a from a lot retail and

but there's a kind of special speaker adaptation a specific to this model

you see there's this play S and i be a go back one using get see the change

that was

this is then new thing

so

it is is we introduce an a a speaker specific back to V super script S

it do we just but the S some top because sometimes we have both of them on certain quantities and

then it becomes a mess otherwise

so

so that V stupid script that's of the speaker-specific vector that says

it just in a

i get the information about that speaker

so so what we didn't have a is is we train

the kind of speaker subspace and these and i quantities tell you how each mean

varies with the speaker

typically the speaker sub-spaces of a dimension

with a forty

the same dimension as the uh phonetic one

so you have you have a quite a few parameters to describe the speaker subspace

and and and

a two D decode you'd have to

to a first pass decoding

as to make this these super script S

and uh

yeah to code again

so we add the parameters and that but i

and as also these these people script ask but these are speaker-specific specific then not really part of the model

there a little bit like

and F from a transform or something like that

so

i i think we can to the end of describing the sgmm so that means we K

but it is uh

oh i described that to now it's is stuff that we've already published

and i just maybe the punch line of what we already described in case you haven't seen that

but it bad so than a regular gmm based system

uh uh i four

it can better at the M a mobile and that's a special better for small data to the core

a twenty percent relative improvement

if you have a few hours of data and maybe

ten percent

if you like when you have tons of data

you have a thousand dollars a

and uh

the problems a somewhat less up to the scrimmage of training

mainly due to bad interaction with the feature space discriminative training

i just some in previous work here

but so so have this talk is about

a a is kind of fixing thing an asymmetry in the sgmm

so

as go back one slide

or or but what the speaker adaptation stuff you have this

and my V J M plus and i V S not i think kind of symmetrical equation because

you have these but to is describing the phonetic space

and and another vectors describing gonna speaker space um we add them together

that's nice and some you go but that like down to the the

the equation for the weights W J M i equals probable

we don't the in thing with the speaker stuff and their

doesn't doesn't P S as an asymmetry in the model because was saying the weights depend on the

phonetic state the not the

peak care and you know why shouldn't they depend on speaker

oh

so so i this paper is about is it's fixed thing bout symmetry

and uh i'll go i'll go for one slide you'll see how we fix set

a look at that equation for the weights the uh

the last but one equation

we we've added that um is for for the uh

speaker yeah

that

that for action just look at the top of a look at the new numerator

that's the uh normalized what weight

well the the inside the brackets of the uh normalized log way

so but this is saying is it's a a function of the

phonetic

uh

state and is a linear function of the speaker state so it's almost the simplest thing you could do

we just fix the asymmetry had the parameters we have is this

you use subscript i

which is a kind of

peak uh

the of the

the thing that tells you how the weights very with the speaker

just the speaker space on a log of W subscript script i

so now

it was a hard to write down this equation

so you know what didn't we do it the four

well what what the

uh

you can just wide down equation for something else that to

able to efficiently uh a that and uh

code with it

no

if you were to just six

expand these as gmms and to big gaussian mixtures that be completely impractical

because

i think about each state now has two thousand gaussians while some

so

and the full covariance

i i i don't have i mentioned that but the and therefore co variance

so you can you can fit that and memory and and

a and in all the machine

so uh

but

we previously described the ways that you can uh

efficiently evaluate the likely but it wasn't it just wasn't one hundred percent obvious how to extend those method

so the case where the weights depend on the speaker

so why this paper is about

as a separate tech report the describe the details

as it's about ha how do you

how do add in this uh

it's about how to efficiently evaluate the likelihoods

when use some at tries that

and uh

i i'm going to the details of that

it it it was reasonable to for you have a bit more memory

just just because this is necessary for understanding the results i just mentioning that

but we describe to a date it's for the U's

sorry for the use of script

a subscript I quantities

as an ending exact one and a an exact one

but difference really isn't that important i'm just gonna skip over that

uh

so that was that the results on call home and uh

how long do have by the way

we hope

okay

i'm call home and switchboard

yeah

the call home results and

so the second line of or

but top line the result is on adapted

a second line

and the were there

is a really difficult task

callhome home english doesn't how much training data it's messy C

the second one is

is with the speaker vectors that's just the kind of standard sgmm gmm with without adaptation

the bottom two lines of the new stuff

a difference between the bottom two lines

and the difference is not important so

so let's focus on the difference between the second and third line

as about

one and a half percent absolute improvement

going from forty five point nine to forty four point four

so that seems like a very worthwhile improvement from

this uh some a station

uh

so we put is about that

uh

oh yeah here is the uh

the same with constrained mllr a

just like you can get the best result this way you can combined the

the uh special form of adaptation with the standard method

so again we get improvement

how much is it now

most improvement we get is about

a two percent absolute

pretty clear

i'm for the students seem to work on switchboard

so the the this

this

table is a bit busy but the key line to the button two

the

the second to last line is the standard

the standard that

the bottom someone is the summit station

i miss seeing

between zero and zero point two percent

improvement absolute

which was a bit disappointing

thought maybe it was some interaction with vtln and so

we did the experiment without vtln

and again we seeing

oh we see point one point five and point to different uh

different configurations and

and it's a rather disappointing improvement

uh

so we try to figure out why wasn't working we looked to the likelihoods of various

stages of decoding is stuff and nothing was a P S

nothing was different from the other set up so

i i at this point we just really don't know why it worked on one set up and not the

other

and and we suspect that is probably somewhere in between

so we can do further experiments

uh

something we should do and future is is to see what weather

there i didn't mention but this this is on the called a universal background model involved it's only use for

three pruning

but one possibility is that you should train that in the matched to way

and that would help uh

get the stuff to where you could be that the pretty pruning is stopping this from being effective

has just one idea

and way

so next thing is just the

applied for something

we number

that implements these S gmms

it's is actually complete speech toolkit

uh

and it's useful independently of the sgmm aspect but

i it can run the system we have we have scripts that uh

for that we have a presentation on friday

about that

not part of the official program for it to the room here

so if anyone's interested they can come along

so i believe

or are the time like you very much

we have time for

three or four questions

uh

we

yeah

uh are also uh a piece of the question

you change a gmm a tool as uh a gmm M

right yeah well as we know gmm is that generally we now tool all model and T

a i

hmmm is used to stick it is you wish

uh the uh well you change twice

gmm um

hmmm have you told that that you can do those uh you could you are model is and we now

oh a map model i do uh user oceans

i mean just you you could increase the number of

gaussian than the ubm

and it would be general but it's really about compressed them a number of parameters you have to learn

i mean i mean it's not a is not gonna but with infinite training data that it wouldn't be any

better than a gmm

but would finite training data seems to be but

oh yeah yeah yeah

three

yeah

so a little used about because we

so the basic the

uh of the variances

a in some funny way and hmmm so

a lot of mind how many more parameters or less parameters that well a U eight and have it is

you mean input to do that a little bit less

but

that call me and that because i have a if i haven't checked in our distributed to by feel i

have a feeling it might be a little bit more but but when you have a lot of data it's

usually less to you to unit

uh_huh

right

the difference between the call home and the switchboard

uh

the if for the the the speaker modeling like have to do with the amount of data per speaker and

two

um

no i'm not i'm not one of these data base gurus i really don't she know

how

whether that differ

so

yeah i have to look into the how you in in most something but also the the the the likelihood

be computation for the uh

a what when you you calling segment arise when you some suck in the uh

is the E the speaker

hmmm subspace and the weights

is is is that change a lot it more complicated

well it very slightly more complicated

uh but

it's not significantly hard to so

you you you is like more more an extra quantity that you have to pretty compute and then hmmm and

then at the time when you

and a complete the speaker vector there's a bunch of inner products the you have to compute one for each

state or something

i don't for each sub state but then not

but that add significantly to the can compute to as just a book keeping in yeah and that i see

in it that increase the memory nearly double the memory required

storing a model

you mean in do some likely computation or in training as well

oh but was a in in storing in the model for the model that any more weights

oh that it's not like there's more weights but that

some way like this some can to do that the same size as the expanded weights that you have to

store well

yeah

as like this week again