i

yes

factorization

i

for a task

i

paper have proposed a

channel well

and she

i

good morning our run uh were come to talk a for come to to um at all

however have of be very uh please to present our reason to work arounds

a title ways speaker and noise factorization are or for task

and ball on you tell all and this is a joint work with by supervisor all

mark gales

and

so he's here still will we'll

a a a uh uh uh

first slayer out a top or something about the

a a model based approach for robust speech running

speech recognition

and is the is uh uh a uh uh a a a lot of the skiing that that have been

developed over the years that does

uh a specific acoustic

uh us to distortion including speaker adaptation and noise robustness

and the we'll talk about uh we we we discuss discussed options we have to handle multiple acoustic factors

in this call so that the concept of acoustic factorisation is introduced

and as an example we do

uh we derive when you adaptations

which we call joints that handle speaker and noise that uh distortions

and then just uh

i rooms and conclusion

so for we start from the uh environment with as we all know the speech signal can be influenced by

a factor

think i i as in this diagram we have speaker differences

i a channel mismatch

and also some some sort of a back noise and room reverberation noise

also also to do also of this factors can uh fact that speech and decode added it want to variations

and degree or all to uh decoder speech in speech signals

so this makes the

a a robust speech recognition by challenge task

and

so in this work we consider using the model based of porsche

to handle

multiple close factors

so

a in in the in this framework we have a

can all can not cope you look

come of cool model be able to model the disorder versions

and we use a a a a a a set of a transformed to the used at uh i used

to adapt a can not come model to different of course the conditions

and

or the U S a different transforms

has been to about the to hand do a pacific

single acoustic factors including speaker adaptation and noise compensation schemes

and

so yeah

uh a hard but hard to combine

this transforms to handle multiple close to factors

and you know if active and efficient way it is this central topic out this talk

oh

that's look at a first look at speaker adaptation be all know as being a transfer is

uh well

adapt acoustic models

this uh this is on mean transforms

and is this in a transform is very simple but very effective in practice

but this uh a limitation of this this thing a transform that is

uh uh we have a uh right relative a large number of parameters to estimate so we can't do

robust estimation a single options

this

so this this thing in transform cannot be

is not suitable for very rapid at adaptation

so and and then a point interesting point to make it uh is

uh this thing a transform or or uh a was or you don't design of for speaker adaptation but you've

we this strands was a generating a transfer

this can also extend to you my meant to you want men to adaptation

and next so we look at as noise come compensation schemes

normally a mismatch function of a be defined for the impact well environment

uh this is

is the first equation is a use the nonlinear

uh

some is about mean occlusions that relate

i describes how lead to channel distortion and added noise can a fact that clean speech

and it's on this mismatch functions

uh model based approach um

modified a models to

and it to better represent a noisy speech distributions

yeah you the D is used to here

i

the second you creations

shoes how we can adapt acoustic models using vts should based the cool um

which has but is the

but the comp compensation schemes

you can see if only creations that a

do to we use a use the

a mismatch function this

transfer transfer use highly construing and nonlinear

so we can

uh uh we can see that's uh relativist film

for a member of prime is to estimate

so we can do very red we very rapid at that adaptation sings to

a noise transform can be estimated a for a single options

so

and you know about be i talk about in speaker i the speaker adaptation a noise transforms

a noise

compensation schemes

so hard to combine the

in in practice we have a very simple various

straight forward uh a combination schemes of we call this joint a we called this

that's more combination

and the E here you cushion here uh describe some how we can do

a first uh adapt to a a uh a week

the first adapt the acoustic models using vts transforms

and failing dart we have learning a transformed to reduce is mismatch

and this

uh and and the diagram shoes with uh we

we do is sing

how we do is

a given a acoustic addition be as to be noise friends one speaker transform

uh a a proper update or

and and if i that speaker or all noise transform to be to estimate re-estimated post

are are and so

uh at so we can see a a a a a a uh a limitation is obvious that uh

did you know transform should be estimate

on a block of data so this

kind of a combination can out you very rapid a rapid at that adaptations this T

it requires a block up a a a a block update data

and i to me to a we and do you uh uh uh in another way we call this of

acoustic factorisation

in in uh uh we have

we decompose the transform

and a come constrain the each transform to more low as best the good

as the best to suppress the close the factor

in this case we has speak transform and noise transform

which also have a each others

this gives us the some free even two

to to use this transform for example you've we know that same speaker as

a speaking

you know the changing noise conditions

and we want to

uh we we want to update the noise condition for and went to and two we just

a to speech transfer is as we now was speaker has not changed

and the can to noise update uh i i adaptation would just to do a a a nice adaptation

and

a similar way

oh

but environment that use is and change

but a speaker has

it has

has has changed to another speaker speakers we can do

uh make

a speaker transform i out

updating

that we do this noise transforms

so this

that

a a kind of acoustic factorization E

factorization a a lots of this peak transform can be used in a range of noise condition

and similar for noise transform

and that when you sure with this is this approach is that

the transfer what uh should be used the uh we use the transforming of factor i the fashion

that to to estimate a transform need we need to join to estimate both speaker and noise trends are since

that eight or uh of a fact

a a a a a a of of

uh

a a a that of fact the by two

uh to acoustic factors simultaneously

a base on this comes at we derive a new adaptation schemes we call joint that that the king

and this C D to on the right hand side shoes how we manipulate as transforms

first what uh do in contrast to the previous should we do we it T plus them are

this approach

at that you use a a a reversed R a transform with applied to being a transform first and uh

and the modified

uh

clean speech to nice the speech choose to crucial by doing so

i work

you can transform is a acting on the clean speech

and

the the in speaker independent clean speech and the S transform is a up you acting on this speech

speaker dependence

which true shouldn't

this

uh

all do

that are these are problems we

we expect in speaker adaptation all

noise compensation so we expect is

is to transform she can be

uh uh can be a so

can be a some sort of of factor tries all also noise to each other so we can apply

didn't me

we can

the them

so that did i when use D we how we evaluate a hard you by it is uh uh a

joint joint to transform seeing a in the

X runs

so we have for we have this song

a we a you condition data are that's is from noise one

peak K as me the noise phones for an speech transform joint state

and the for and for the same speaker and then and uh i i can a noise condition

we do a bit just a dude noise transform and uh these speak trance of we have all ten the

in i don't in

in the previous uh estimation

and

jen at

at that acoustic models

so that of it has the free and that's

things

uh uh since not only points friends far

uh required do

a a a update

so this can be done or a single options so we can do this

joint to speaker and noise i

a a adaptation

a single options

which is very flexible

so as scroll to the X ones

uh for as when we we you bout the i-th runs on or four task

this is a a is derived from most wrong as a joke one and task

and we have for test set find

there errors uh the in set a

set a a is uh a test or one which is clean set

and test

in set B we have sick

different to a six different types of noise at it

and set C and said D is

comes from the far-field microphones

for the close to model training we do some of are pretty standard stuff

and

this is the X runs from a bashful batch X in a i'm

batch more X in i mean

the speaker and noise transform for i estimate for

a for you shop that's for test set

so

it this no sharing bic uh off speaker transforms

we can see that's uh

by during speaker and noise adaptation

a combine the speaker and noise adaptation

we she we yeah she and signal and things over

noise adaptational

a noise adaptation on only

and we L i can see that sings joint it's just the reverse all they're of each T from of

i am i'm not transform so the order

in share is not a very sensitive to it

it it it is really uh a

it it it it impacts performs it it does not impact for one too much

so

uh but we want to emphasise that this is a batch more X runs

we we we uh this

we recall which requires a a update is to estimate france transforms

so it a is is better uh it he's not very flexible to be used

so what is more interesting is the factorization X ones

we can which uh in this X runs

we have we can uh these estimates speaker transform for a from the clean set we should test or one

and and we applied to speech transforming out the noise conditions

we can see from the uh

so the row of the table

that's a we

we uh this

the speech transform from big from clean speech that's hard for a for the set B C is uh the

out noise is set

and then function at native S plus a that's not

a that's not general at the that actually decrease performance

because the each here the M O transfer in this case

i uh is

i i is acting on the vts adapted the being so uh so uh is

is it is

uh a you can it his uh a social a suspect and noise condition and

can not be used uh that you know i don't noise conditions

and and what is more interesting is that if we estimate transforms forms the speak transform font you got nice

a set

test or for all which is uh i since a restaurant noise

and we have

the

a a a a a would joint to screw adapting skating actually

uh you

that that a get a a a uh guess a sum

i that some better the result

this is an interesting so

and

uh had this night might be a in a need and indicated that

i would join transform uh i'm i'm a transformed in joint

maybe more something that should be more that by

i a by vts transform which is say which means that our factorization maybe not perfect

that them but the a number of is

uh are are uh are a up that nation use a as using the transform a speak transform as they

from i is a shave

for point for or which should use just which is very close to more expert

a a more lax foreigns

this demonstrate

it

we can we can fact tries act a we can fact a speak transform and using sprites speak transforming up

you very good noise stations

so

so i i here i rival at my conclusions

in this talk at i i we argue that um

a handling doing or close to factors Z is important in

being very complex realistic in closing moment

and we present a are powerful and flexible polish test based on the

acoustic factorisation with your the derive when you adaptation skiing because a joint

and and this allows very rapid the speaker and noise adaptation

this

speak transform can can be used a cross them are local acoustic issues

and just a little bit a about this a new X in

a we have to to compare our approach

is the uh uh a feature enhancement you enhancement to style is

a style a but a noise robustness schemes is

and am adaptation a speaker transforms

a speaker adaptation

we we observe that of joining the all all performed this a you have he feature these M are employed

um um

such and factorization mold

and this is demonstrated

a the the power of

but the bayes framework

and the we uh and the inside is from where we have a very powerful and flexible to does sees

that he's acoustic factorisation

second

do have a a time for a couple of questions but have a process to to you might be are

both behind the projectors

so i have the

so questions

to use a speaker

a to the factorization so

quote a close at all

three

to that extent you so the assumption is about

oh

uh

uh yeah i i i think this you are right to that this actually is

um

do not very uh it is not perfect fact uh of factor arise since me

as you can see that uh

uh

but we have a that by on the on the transfer or and we also have a channel distortion actually

is a also a bias on that

that we do

we can't

uh

and but is since

but for the main main part the

we transform is ending at as well and the nice friends for its in a transform

so this to leave different types of transform one combined

is actually the D can be

uh uh

uh uh uh a uh uh a factor rights

and uh as the that there's the X were and demonstrated that

oh

we can't use

you can because a is the fact right

property is quite good

um two

the count to say in met matt in mathematically medically

uh the the if it also a model to each other but we can see

from this we can use the speech transforming wireds conditions

so that's

that's T

a is kind of factor

uh uh also an art

a questions

i i sure can for the speaker