are

um

i i america of the list term the uh session chair and um a one second an advertisement if you

see people wearing a wine may ask them about a asr you

okay

um we're gonna start off

a a first uh paper is uh

front-end feature transforms with context filtering for speaker adaptation

um papers by ageing one

a

i i take a

as well uh

as as well as raw were yeah

a a all and and

by how a go go all

um and it will be presented by you any

okay so uh the top is front in feature transforms with context filtering for speaker adaptation

um

so uh this an line of the talk

first stop briefly motivate

a other work can explain a

basically what's

uh what we're trying to accomplish

uh then next all give an overview of

uh and the new technique called maximum likelihood context filtering

and then we'll move uh strain in some experiments and results to see how works

okay so uh the top is front

speaker adaptation

uh

in terms of front-end transforms we usually uh do in your transforms

or conditionally linear transform

a perhaps the most popular technique is feature space mllr or maybe more popularly named constraint

yeah mllr

and the course there's just discriminative uh techniques that of developed and nonlinear transformations

a some variance of F from a additive been

a work on and in years are quick of more are

and uh full covariance

a formal are

um

so to they'll tell you about another

aryan of the full are

and but first that let's review of more so the ideas you're given

set of adaptation data

and you want to estimate uh when you're transformation a

and a bias

be uh which can be

concatenated

in into the linear

a a matrix W

so uh the key point about

a a a from are for the purposes of this talk is that the a matrix is square

uh T by D and the notation used here

and uh

that makes it particularly easy to learn

um

so

of course

the main thing you need to deal with when you up these transforms is

uh the volume

uh change compensation

in the case of a when you're transformation it's just the log determinant a and red you C

in our objective function Q

uh the second term there is just your

a typical term you see it has the posterior

probability of all the components of the acoustic model your evaluating that's

gamma

uh subscript scripted by J for each gaussian

okay so when we

start to think about the non square case

uh

what what do we need to do so first that's set up the notation

so we use the notation X had of T

to do know of

a vector with context in this case there's context size one

so there's X T might as one T in T plus one of being concatenated to make X have the

T

so the model is

why i

because a X have

um

we can uh condense this notation to

um

form W

and hear them the main difference here is the a is not square in this case it is D by

three D

because the output Y

is the original dimension of of the input X

but X that is three D

so how do we estimate

a a a a a non square matrix the max my like

uh

so

a an important point is there's no direct obvious way to do this and that's because

the

uh

if your you change you changing the dimension of the space so that is no determine volume

the you can use

a a in a straightforward manner to

accomplish this

so let's go back and look at how we get that term

so basically what you say is that the

log likelihood under the transformation

of Y

uh

is of equal to the log likelihood of the input variable up to a constant

that is your

to cope in term

so in the case that uh you soon that a square you can readily

confirmed that the term is

have log

ratio of the determinant

of the input and output mall assuming the gaussian

so this slide is just showing how you would ride that

there's L X

a gaussian

L Y

i when you to

are are get as you know the when your transform data

and essentially you quite them the fine what C is in you find that

it is the log ratio

a a as we started before

so on the bottom line and read

if you break down that not show you see that uh the covariance of

D variable Y

uh

this is just a known uh identity it's a a

a a signal X transport

transpose

a transpose

E

a a signal X

at

transpose

um so the compensation term ends up being log determine a

um so in our case we're gonna assume that the compensation term

remains the same

uh will drop the

log determinant of sigma X had term

because it does not depend on a

number left

with the

a segment X a transport

pos turned uh that we had in this case that

they was square

um

so the modified objective becomes

uh are the following

and the one one point is that well what is this

signal X had that was used well what they did is they use

uh

a full covariance approximations all the speech features

to come up with a that

a a full covariance sigma X hat

a used it in this subjective to

learn K

okay so in terms of optimising

a uh this modified objective this the statistics that you need of

the same

form is in the square case scores the sizes are different

uh

but to uh main quantities that you need to optimize

the objective are

to be able to evaluate the objective Q and the derivative of the objective

um and so the row by row it or iterative up

data are real that people normally use

not be applied here at least it's not obvious how to do it

there

uh

a a a a we're looking at it now but there are some ways to do something

very similar

uh but it uh for the purposes of this paper just a a a gas optimization a uh

package was used the H C L package

and uh as i mentioned before you just has in the function in in a function and its gradient available

at any point that the optimiser wants to evaluate

okay okay so uh

that's

that's essentially the map it'll trying leave some time for questions at the end

this any more uh

details that are

request

um so moving right on to training data and models so

uh the training data

for the task we evaluated this

technique on was collected in stationary noise

there's about eight hundred hours of it

a a word internal uh

a weird internal model with kind of phone contact

eight hundred thirty

context dependent states in a

ten K gaussians was train

and uh the technique was tested on

a L D A forty dimensional features bill and on models

built using maximum likelihood

the M i

and

on a model uh with an F a i transformation applied before

uh we apply context filtering

uh in terms of test data uh was recorded in car at three different speeds zeros thirty at sixty miles

per hour

uh there were four tasks dressed

digits commands and radio control

and that's about uh

twenty six K utterances and uh

a total of a hundred and thirty thousand word

he is that the distribution of the

the snr distribution of this data

in terms of

speed

how you can see will see that most in is obtained that for the sixty

for our data and for that data about

basically half of the data is below

say twelve and a half T V

uh a we estimate a using a forced alignment

okay so for experiments

uh a context filtering was tried for speaker adaptation

training speaker dependent uh

a a that being uh

the canonical model

so uh a and all C a and just a little uh nomenclature here

it is uh

maximum likelihood context filtering with context size and

so one would be plus or minus one

aim is included in the context

when computing the transform

uh

so for all the experiments

uh we

the transform was in lies with identity uh with respect to the current

frames parameters

and the side frames where

uh than a lies to

have zero

to zeros

so

just for reference they also tried using

for the centre

a a part of the matrix the F from or that was estimated

uh

you uh

using the usual technique

okay so in terms a result

give skip it had to that so

uh

clearly a from R

uh brings a lot over the baseline on this data

and when a you turn on context filtering

yeah actually get some significant gains

in the sixty per hour call call and you can see that there are late and red

so this is actually twenty three percent

uh relative gain in word error rate thirty percent and sensor rate over a more

um um

the other point here is it's starting with a more are

and then adapting actually doesn't give you any advantage over

uh starting with an and it then you made

a this point is just showing how

uh performance varies with than um with the amount of data you provide

uh to the transform estimation

so uh

where we can see that actually

the relative a degradation in performance when you have less data as in this case ten utterance

ten utterances as

all utterances i believe all is a hundred in this case

um

is less and i i think the argument here is that

uh uh you're using context so you can do some averaging

of the data you see and that's

that effectively regular thing yes

estimation the sum

extent although there's more parameters to estimate so

uh

kind of counter intuitive i think

okay a this just a picture of but typical F mark transform estimated uh using our

system it's

it's uh

for the most part no

and uh this is the corresponding

a one frame of context

a context filtering transforms so you can see

interestingly it's not symmetric the

the uh

previous

the mapping from previous the current frame is almost i no so is the current the current frame

mapping

scene

a but the

a count of the future looks

kind of random

and thing to keep in mind is that this is

uh

no uh it's

is actually

that's whole subspace a lot most solutions to this problem

so it's not clear if this is an artifact of the optimization package

perhaps the order in it that that uh that it optimize the subspace and whatnot not

hmmm

using

okay so here's more results

uh

a collective a uh using a you my model

and again uh we're seeing seven significant gains

uh

oh over F a are about ten percent relative improvement

on the six team up our data

a once again when we when we uh train have a my transform and then apply context filtering

we're still actually getting some gains

it's about

nine percent

relative sent error rate reduction over a more

okay so one summer a uh and i'll see a fixed ends well the full rank square matrix

technique

colour from R to not square meter

and now there's some very nice gains

on a

some pretty good systems

uh

the use to be am am i have from my

uh when we apply this technique

so terms the future work

course is the use a

uh

we should uh

trying a discriminative objective function is something that i think they're looking at in the course

uh

the another question is how this technique interacts with traditional noise robust as methods like spectral subtraction

dynamic noise adaptation et cetera

okay

so

that's all i have hopefully this sometime time of for questions

i is use my

so the plot you have for improvement so that you had to have ten utterance

for each speed how do you do that

a practical sense are you going to keep track

yeah

hi

uh let me just go to the us you

i mean this is just a investigating the amount of data is needed for the transform to be effective

so uh the this is useful

in sense that if you need to roll a speaker for example on a cell phone

he only needs to talk ten utterances by the way that's a good point uh each utterance is only about

three seconds

so we're talking about

you know uh at thirty seconds of a of data uh were already

almost that

a completely adapted to do the speaker as opposed to

that from more are that's

actually seems the need about thirty

to be at that stage

a

oh microphone

for the third one right there

so from this chart you're working all

screws me

uh you know

utterances collected at sixty

models mouse or speech

in a real scenario in people drive store high we

so the you know sequence

uh uh you know uh as made you that with this same are is not this scenery

oh have you test that's scenario

yeah i they in consider that in this work but it so this is a block

this block optimization of the matrix actually

so

just take a section of

speaker data and

see how many utterances are required

a

to can to get uh decent gains

but that's certainly an important problem

to more questions

have a quick one

um i was actually kind of interested when you were looking at

the results for one and two

like to be two different um when you know

the they actually do visualisation of the two

um context

was it's great and i found that the visual interesting and i was wondering at that

i think but different

uh

oh i see

right i i think that uh

think this is one of the only ones they actually look that okay

uh

i was very curious about that myself

so

ready put them in at the last moment

actually

um

yeah that's very true choosing certainly uh

uh

but it for the experiments they did they found that performance was that eating at about

uh

but and right context

two

so uh uh i i think that uh is symmetry and that

X

for future uh

investigation and understanding

i think the speaker

maybe we we're gonna need a minute to set up the next peak there so

sir

but