i
yes
factorization
i
for a task
i
paper have proposed a
channel well
and she
i
good morning our run uh were come to talk a for come to to um at all
however have of be very uh please to present our reason to work arounds
a title ways speaker and noise factorization are or for task
and ball on you tell all and this is a joint work with by supervisor all
mark gales
and
so he's here still will we'll
a a a uh uh uh
first slayer out a top or something about the
a a model based approach for robust speech running
speech recognition
and is the is uh uh a uh uh a a a lot of the skiing that that have been
developed over the years that does
uh a specific acoustic
uh us to distortion including speaker adaptation and noise robustness
and the we'll talk about uh we we we discuss discussed options we have to handle multiple acoustic factors
in this call so that the concept of acoustic factorisation is introduced
and as an example we do
uh we derive when you adaptations
which we call joints that handle speaker and noise that uh distortions
and then just uh
i rooms and conclusion
so for we start from the uh environment with as we all know the speech signal can be influenced by
a factor
think i i as in this diagram we have speaker differences
i a channel mismatch
and also some some sort of a back noise and room reverberation noise
also also to do also of this factors can uh fact that speech and decode added it want to variations
and degree or all to uh decoder speech in speech signals
so this makes the
a a robust speech recognition by challenge task
and
so in this work we consider using the model based of porsche
to handle
multiple close factors
so
a in in the in this framework we have a
can all can not cope you look
come of cool model be able to model the disorder versions
and we use a a a a a a set of a transformed to the used at uh i used
to adapt a can not come model to different of course the conditions
and
or the U S a different transforms
has been to about the to hand do a pacific
single acoustic factors including speaker adaptation and noise compensation schemes
and
so yeah
uh a hard but hard to combine
this transforms to handle multiple close to factors
and you know if active and efficient way it is this central topic out this talk
oh
that's look at a first look at speaker adaptation be all know as being a transfer is
uh well
adapt acoustic models
this uh this is on mean transforms
and is this in a transform is very simple but very effective in practice
but this uh a limitation of this this thing a transform that is
uh uh we have a uh right relative a large number of parameters to estimate so we can't do
robust estimation a single options
this
so this this thing in transform cannot be
is not suitable for very rapid at adaptation
so and and then a point interesting point to make it uh is
uh this thing a transform or or uh a was or you don't design of for speaker adaptation but you've
we this strands was a generating a transfer
this can also extend to you my meant to you want men to adaptation
and next so we look at as noise come compensation schemes
normally a mismatch function of a be defined for the impact well environment
uh this is
is the first equation is a use the nonlinear
uh
some is about mean occlusions that relate
i describes how lead to channel distortion and added noise can a fact that clean speech
and it's on this mismatch functions
uh model based approach um
modified a models to
and it to better represent a noisy speech distributions
yeah you the D is used to here
i
the second you creations
shoes how we can adapt acoustic models using vts should based the cool um
which has but is the
but the comp compensation schemes
you can see if only creations that a
do to we use a use the
a mismatch function this
transfer transfer use highly construing and nonlinear
so we can
uh uh we can see that's uh relativist film
for a member of prime is to estimate
so we can do very red we very rapid at that adaptation sings to
a noise transform can be estimated a for a single options
so
and you know about be i talk about in speaker i the speaker adaptation a noise transforms
a noise
compensation schemes
so hard to combine the
in in practice we have a very simple various
straight forward uh a combination schemes of we call this joint a we called this
that's more combination
and the E here you cushion here uh describe some how we can do
a first uh adapt to a a uh a week
the first adapt the acoustic models using vts transforms
and failing dart we have learning a transformed to reduce is mismatch
and this
uh and and the diagram shoes with uh we
we do is sing
how we do is
a given a acoustic addition be as to be noise friends one speaker transform
uh a a proper update or
and and if i that speaker or all noise transform to be to estimate re-estimated post
are are and so
uh at so we can see a a a a a a uh a limitation is obvious that uh
did you know transform should be estimate
on a block of data so this
kind of a combination can out you very rapid a rapid at that adaptations this T
it requires a block up a a a a block update data
and i to me to a we and do you uh uh uh in another way we call this of
acoustic factorisation
in in uh uh we have
we decompose the transform
and a come constrain the each transform to more low as best the good
as the best to suppress the close the factor
in this case we has speak transform and noise transform
which also have a each others
this gives us the some free even two
to to use this transform for example you've we know that same speaker as
a speaking
you know the changing noise conditions
and we want to
uh we we want to update the noise condition for and went to and two we just
a to speech transfer is as we now was speaker has not changed
and the can to noise update uh i i adaptation would just to do a a a nice adaptation
and
a similar way
oh
but environment that use is and change
but a speaker has
it has
has has changed to another speaker speakers we can do
uh make
a speaker transform i out
updating
that we do this noise transforms
so this
that
a a kind of acoustic factorization E
factorization a a lots of this peak transform can be used in a range of noise condition
and similar for noise transform
and that when you sure with this is this approach is that
the transfer what uh should be used the uh we use the transforming of factor i the fashion
that to to estimate a transform need we need to join to estimate both speaker and noise trends are since
that eight or uh of a fact
a a a a a a of of
uh
a a a that of fact the by two
uh to acoustic factors simultaneously
a base on this comes at we derive a new adaptation schemes we call joint that that the king
and this C D to on the right hand side shoes how we manipulate as transforms
first what uh do in contrast to the previous should we do we it T plus them are
this approach
at that you use a a a reversed R a transform with applied to being a transform first and uh
and the modified
uh
clean speech to nice the speech choose to crucial by doing so
i work
you can transform is a acting on the clean speech
and
the the in speaker independent clean speech and the S transform is a up you acting on this speech
speaker dependence
which true shouldn't
this
uh
all do
that are these are problems we
we expect in speaker adaptation all
noise compensation so we expect is
is to transform she can be
uh uh can be a so
can be a some sort of of factor tries all also noise to each other so we can apply
didn't me
we can
the them
so that did i when use D we how we evaluate a hard you by it is uh uh a
joint joint to transform seeing a in the
X runs
so we have for we have this song
a we a you condition data are that's is from noise one
peak K as me the noise phones for an speech transform joint state
and the for and for the same speaker and then and uh i i can a noise condition
we do a bit just a dude noise transform and uh these speak trance of we have all ten the
in i don't in
in the previous uh estimation
and
jen at
at that acoustic models
so that of it has the free and that's
things
uh uh since not only points friends far
uh required do
a a a update
so this can be done or a single options so we can do this
joint to speaker and noise i
a a adaptation
a single options
which is very flexible
so as scroll to the X ones
uh for as when we we you bout the i-th runs on or four task
this is a a is derived from most wrong as a joke one and task
and we have for test set find
there errors uh the in set a
set a a is uh a test or one which is clean set
and test
in set B we have sick
different to a six different types of noise at it
and set C and said D is
comes from the far-field microphones
for the close to model training we do some of are pretty standard stuff
and
this is the X runs from a bashful batch X in a i'm
batch more X in i mean
the speaker and noise transform for i estimate for
a for you shop that's for test set
so
it this no sharing bic uh off speaker transforms
we can see that's uh
by during speaker and noise adaptation
a combine the speaker and noise adaptation
we she we yeah she and signal and things over
noise adaptational
a noise adaptation on only
and we L i can see that sings joint it's just the reverse all they're of each T from of
i am i'm not transform so the order
in share is not a very sensitive to it
it it it is really uh a
it it it it impacts performs it it does not impact for one too much
so
uh but we want to emphasise that this is a batch more X runs
we we we uh this
we recall which requires a a update is to estimate france transforms
so it a is is better uh it he's not very flexible to be used
so what is more interesting is the factorization X ones
we can which uh in this X runs
we have we can uh these estimates speaker transform for a from the clean set we should test or one
and and we applied to speech transforming out the noise conditions
we can see from the uh
so the row of the table
that's a we
we uh this
the speech transform from big from clean speech that's hard for a for the set B C is uh the
out noise is set
and then function at native S plus a that's not
a that's not general at the that actually decrease performance
because the each here the M O transfer in this case
i uh is
i i is acting on the vts adapted the being so uh so uh is
is it is
uh a you can it his uh a social a suspect and noise condition and
can not be used uh that you know i don't noise conditions
and and what is more interesting is that if we estimate transforms forms the speak transform font you got nice
a set
test or for all which is uh i since a restaurant noise
and we have
the
a a a a a would joint to screw adapting skating actually
uh you
that that a get a a a uh guess a sum
i that some better the result
this is an interesting so
and
uh had this night might be a in a need and indicated that
i would join transform uh i'm i'm a transformed in joint
maybe more something that should be more that by
i a by vts transform which is say which means that our factorization maybe not perfect
that them but the a number of is
uh are are uh are a up that nation use a as using the transform a speak transform as they
from i is a shave
for point for or which should use just which is very close to more expert
a a more lax foreigns
this demonstrate
it
we can we can fact tries act a we can fact a speak transform and using sprites speak transforming up
you very good noise stations
so
so i i here i rival at my conclusions
in this talk at i i we argue that um
a handling doing or close to factors Z is important in
being very complex realistic in closing moment
and we present a are powerful and flexible polish test based on the
acoustic factorisation with your the derive when you adaptation skiing because a joint
and and this allows very rapid the speaker and noise adaptation
this
speak transform can can be used a cross them are local acoustic issues
and just a little bit a about this a new X in
a we have to to compare our approach
is the uh uh a feature enhancement you enhancement to style is
a style a but a noise robustness schemes is
and am adaptation a speaker transforms
a speaker adaptation
we we observe that of joining the all all performed this a you have he feature these M are employed
um um
such and factorization mold
and this is demonstrated
a the the power of
but the bayes framework
and the we uh and the inside is from where we have a very powerful and flexible to does sees
that he's acoustic factorisation
second
do have a a time for a couple of questions but have a process to to you might be are
both behind the projectors
so i have the
so questions
to use a speaker
a to the factorization so
quote a close at all
three
to that extent you so the assumption is about
oh
uh
uh yeah i i i think this you are right to that this actually is
um
do not very uh it is not perfect fact uh of factor arise since me
as you can see that uh
uh
but we have a that by on the on the transfer or and we also have a channel distortion actually
is a also a bias on that
that we do
we can't
uh
and but is since
but for the main main part the
we transform is ending at as well and the nice friends for its in a transform
so this to leave different types of transform one combined
is actually the D can be
uh uh
uh uh uh a uh uh a factor rights
and uh as the that there's the X were and demonstrated that
oh
we can't use
you can because a is the fact right
property is quite good
um two
the count to say in met matt in mathematically medically
uh the the if it also a model to each other but we can see
from this we can use the speech transforming wireds conditions
so that's
that's T
a is kind of factor
uh uh also an art
a questions
i i sure can for the speaker