um i everyone

and not gonna apologise for the quality of my slides because

and uh i just i think it really matters

people are people if you make an to find people seem to trying to compensate for something that that's my

that that's my uh

yeah

philosophy

okay so this is that actually very similar to some of

but in this session and it's

it's a strange session is in the because almost all of these talks are about some kind of uh

no no

fast faster from a uh

with some kind of factorization

now

i was going to introduce you

okay so i that all should be call it C M or or should we call it F a lot

for some reason i gone over recently to the C a last side but i'm hedging my bets and the

actual talk

uh

i i think you get everyone really had a stand that is the same thing

yeah

and i'm not gonna go through this slide in detail because it's the same as many slides that have been

presented already

yeah

notation is a little bit different i

this notation with a little plus

it kind of my personal notation

but i i i use it because i i'm not comfortable with a or the greek let as an it's

hard to me to remember a

what

i i is an a and that is supposed to be the same as X

so i just think it's easy at to

remember but the total plus means the a one

uh in and since some uh

in some people's work it's a side i

um

as that's that's also another little can confusing

difference between people notation that sometimes people put the one on the and and sometimes at the beginning

i think the B M had it is is to put on the end and that's what i'm doing

yeah

this is another kind of introductory slide but

since we've had so many introductions to cmllr

have from a lot

in this in this uh session i don't really think i need to go through this

no

a point i i put in here

is that

have some uh

i i'm gonna switch them out

works so the relative little adaptation data

now here

well that's a little adaptation data means after about thirty seconds or so

and someone else actually in this

session did mention that figure

after about thirty seconds or so you you get almost all of the improvement that you gonna get

at thirty seconds is a little bit too much for many practical applications

if you have some telephone service was someone that's gonna like request to stop quote or

do web search or something

my and you might only have

two or three

seconds or maybe five seconds of a and

and that's not really enough for a from a lot of work

in fact the law about five seconds and and my experience

it's not gonna give you and and and the thing is gone actually make it worse

so you might as well turn off

so that's the problem that this

talk is addressing an actual is the same problem that many previous talks and the session have then

i think all of the previous talks and the session of an addressing this problem

so

okay this

light is rising some of the prior

approach is an i i should emphasise that

i'm talking about the prior approaches to

somehow have regular rising C M are

obviously there's many other things that you can do like eigenvoices voices the stuff that uh the previous speaker mention

i single other parameters but i'm talking about

see all are

yeah regularization

so a simple things you can do you can just

like the a matrix diagonal

and that

a kind of an option and H K and it

it you know it it's a good approach you get a lot of improvement but it very ad hoc

i also uh make a block diagonal

this this approach had its origins the cost and the delta

and delta-delta type features

so you have three blocks that thirteen by thirteen

and i don't nobody really uses those features on process anymore one that serious

site i don't think use them anymore

uh without some kind of transformation you can still use the block diagonal in fact one of the baseline that

will be presenting

is we use these blocks even though they've lost

the original meaning

and you know it is still seems to work

so and other

she's is bayesian approaches

as the that been a couple of different papers both called

F map a lot

by a not a lot of it means would be we do have from a while but we have a

prior

and we pick the map estimate

as been a paper from microsoft and one from be M and they had slightly different priors but it was

the same basic idea

this is one of the baseline that we're going to

you using an our experiment

yeah

a an issue with these approaches is that

you you you probably like to have

a prior that tells you how all of the rows of the transform colour late with each other

but in practice that's not really uh do able

so

people we generally see priors

the are ride the row by row or even completely diagonal

so it's the prior over each individual parameter

yeah

approach that we are using is

parameter or doctor

reduction using a basis

where the uh

a from all i'm matrix a some kind of weighted

so of uh

i the of a basis matrices are of

a prototype matrices

so

the basic idea of not that this is similar to some of the previous talks

sometimes people have the uh

the there is a factorization than done a basis on you know the upper and lower how or something like

that

but we we we're talking about a a a a basic expansion of just the row or

the row transform

and the basic form of it is given here

where

is W subscript and

do the kind of like prototype or i again

a from a lot matrices and

they're kind of computed in advance somehow

for a given speaker

you have to estimate these coefficients

yeah well

this is not a really convex problem but

uh uh it's

it's solvable in a kind of local sense and it's not really

i i don't think it's a practical issue

so

the previous work in this area

you can of decide the basis this size an advance

so this the again

the decide let's we're gonna make that two hundred say

a number of parameters and the actual matrix what what's the T nine by forty it's like

a couple of thousand so

i

if you decide in advance some we gonna make two hundred coefficient

a that you know does pretty well for typical configurations if you have you know between ten and thirty seconds

of speech

but you gonna get a degradation

once you have a lot of data because you're not really estimating all of the parameters that you could estimate

eventually gets a bit worse

when you have a a lot of adaptation data

so that the

this to the closest prior work

so

the couple of differences

there what we're describing here from this prior what it was done in I B M

yeah why we

a basis

size to very per speaker

and we have a very simple real we just say

the more data that we have them more coefficients we can estimate

and we'd we just make the number of coefficients proportional to the amount of data

a of you could do all kinds of fancy stuff with information criterion so on but

i think this technique is you know easily complicated enough already without

introducing an you aspects so we just picked a very simple rule

but but the other aspect is we have a

we have a keep way of estimating these basis matrices W and

but a little bit more clever than just doing pca

yeah

and and and and lee

we just trying to popularise this type of method

this uh

we have uh

we have a job version of this paper in which we tried to explain very clearly had to implement it

"'cause" it really does

were

and it's uh it's for robust and everything and police so i think it's something but the i do recommend

to you

yeah

this

so probably covers material that i've

oh i'm gonna go to this slide

so

yeah the sample that it's like making a diagonal block back all very well and good but

it just doesn't it it's just a bit

ad hoc can it doesn't

you that much improvement

also so you it's all a half of to have to decide like

how much data do we have can we afford to do the form one are we gonna make it diagonal

when that this trade-off and you can get into

having count cutoffs and stuff but it's a bit of a mess

yeah

i to have anything and in methods you know if we could of done this in the bayesian way

that's probably

i think that's what more optimal because by picking a basis

size you kind of making a hard decision

a with the bayesian method you could do that and a soft way

and you know i think making a soft decision was always better than a hard decision

but

the the problem with the bayesian approach is first is very hard to estimate the prior

because

the whole reason may in the situation as we don't have a ton of data per speaker

right and

assuming a training data is matched just T testing condition

not gonna have a lot of data from you training data to estimate the

the uh W major sees

so how you gonna estimate of prior because you don't have good estimates of the

uh

the things are trying to get a prior on and a "'cause" you can do all of these

no easy in schemes where you integrate and stuff but it just becomes a a big head a

plus was always somebody choice choices to make

but there is i like this basis type of method is you just a we can use a basis

and everything just falls out and it's obvious what to do

so

i to talk about how to estimate the uh

basis

so

because we going to decide the coefficients and test i am

the kind of a or did base that's what you have the most important elements first

and released least important elements last

i if you we gonna say we gonna have any close two hundred and just that's it

then

it wouldn't really matter whether they were all mixed up what order they were and

but because uh

we're gonna decide the number of coefficients and test i'm we need to have this or

thing

so things are not show that

pca or A or or you know as three D are those those kinds of

approaches

the they actually give you this

but

i'm not very comfortable just

saying we're gonna do pca because you know

but who's to say that that makes sense

uh what one obvious argument why doesn't make sense

was that if you were to uh

and a scale the different dimensions of your

a of feature vector

that's gonna change the solution that P gives C you

no i mean is gonna change it and then

yeah and that kind of uh is gonna change the so basically it's gonna affect your decoding

and to me that's a but uh

it's not the right thing to do

so

but the framework but i think is most natural as maximum likelihood

what we're going to choose try to pick the bases that maximise the like you on test

and uh

i i don't think i have time to go through the whole

and i wasn't having to through the whole you know

argument about what we're doing

but but basically

we end that the use pca A but in that slightly precondition and uh

space

so

but W is a thirty nine by forty T matrix typically

but

we wanna consider the correlations between the rows and it's not ready can to think of it as a matrix

a let's think of it as one big but to

of size that in nine by forty byte we can cat make the rose

no

uh

if

i don't know if i can easily

and and how well this argument works but

that

i think about the objective function for each speaker

if that objective function were records at work or tick function

if we can somehow to change of variable so that the uh

quadratic part of that function is just proportional to the you makes

it then is possible to show that

we kind of right solution is just doing weighted pca

a some kind of derivation i mean it it might be obvious to some people how how you derive that

uh

but you know

but let's just take it is given for now but

that that's true

so so the

to do is do this strange variable so that the objective function is quadratic for each speaker

and for is not quite possible to do that

because

we

okay okay for a basis for as a couple of reasons

that's the objective function is not really quadratic that this log and and it's

for nonlinear

secondly

you can take a taylor series approximation of round the kind of before make trick

i and zero

take it taylor series around there

and the could right sick uh

that the quadratic term in that taylor series

but remember rubber it is a big vector right so the quadratic reading um is like

a big major

uh

a if you know about two thousand by two thousand

but quadratic to it depends on the that

so it's not just the constant

but

i i i don't think this that's really very that much in an important way

so

it is possible to kind of do a

it's possible to make an argument that

once we

once we work out the uh average of these quadratic terms

and then kind of preconditions conditions so that average averages the unit

it possible to make a reasonable argument that

to each speech uh

the uh

that was matrices are approximately unit

and and it is the situation where

would like it to be you know but is not gonna make it big difference if it's not quite unit

because i with all this is doing is this is

we're gonna pretty

we gonna pretty take everything in then do pca

and if you don't quite three low take correctly

rotations on the correct word kind a pretty a lot of a

that is accurate it's not gonna like totally change the result of pca

well that's gonna happen is that let's say the first i eigenvector vector

is gonna be mixed up a little bit with the second and so on

so

it pretty close to maximum like we had

now i think i i

cover this material

so

oh training time computation is you do this

but the training time computation basically involves computing

big matrices like this and doing like a a as we D and stuff

it's all described in the in the journal paper

so a test time there's an iterative of in test them to uh to keep the coefficients those the

a subscript something

so

it's

it's a lot convex problem but it's pretty easy to get a local optimum

we just you like steepest ascent and you do it pretty exact line search

and it's all it's all described in and the

paper

so this is the result with slide requires a little bit of explanation

so we have to

of

test data one is

short utterances one is long

it's it's like one is the digit digits type of task and stocks and stuff and one is a voice

yeah

yes

we divide each of those two

in two uh four subsets based on the line

and the X axis

is

is is the uh length about of an so ten to the there was one second ten to the one

is ten seconds

and each of these kind

each of the uh

the

each of you kind of a

lines the points corresponds to a bin of train a test data

so

we we we can divide it up and the buck them on the left hand side this short

utterances on the right is wrong

and each point is a relative improvement

yeah

the the the the triangles the that triangle on the bottom that's just regular

a from or R

and it's actually making things worse for the first three bins

and then a helps a bit

oh is the a either word error rate kind of jumps up and down a bit

because it's different types of data

so

this is maybe not the ideal uh a data to test this on but

you've got look at the relative of and these thing

uh

the very top line is our method

is doing a bed so it's given me a lot more improvement than the other methods

i the this the three block on the diagonal one have matter a lot

those

there are bit a uh then doing regular similar

i get some improvement for the shorter amount of data

but we get a more improvement from our method

so

and i i mean the the story is that

if you have let's say between about three and ten seconds of data

i think this method will be a big improvement versus

doing of map a lower diagonal

or whatever

so uh

but but if you have a you know let's a more than thirty seconds they really doesn't make it from

so i think

i have pretty much covered all of this

i'm being us to wrap up

yeah

i think we've covered all of this

so i recommend that john paper if you want to implement this "'cause" i do described very clearly had to

do it and i think this stuff it does work

okay

i a question mark

pretty stunned

okay okay well uh but close the session at okay thanks right