0:00:13i
0:00:14yes
0:00:15factorization
0:00:18i
0:00:18for a task
0:00:20i
0:00:21paper have proposed a
0:00:23channel well
0:00:24and she
0:00:27i
0:00:29good morning our run uh were come to talk a for come to to um at all
0:00:34however have of be very uh please to present our reason to work arounds
0:00:38a title ways speaker and noise factorization are or for task
0:00:43and ball on you tell all and this is a joint work with by supervisor all
0:00:48mark gales
0:00:49and
0:00:50so he's here still will we'll
0:00:52a a a uh uh uh
0:00:54first slayer out a top or something about the
0:00:57a a model based approach for robust speech running
0:01:00speech recognition
0:01:02and is the is uh uh a uh uh a a a lot of the skiing that that have been
0:01:07developed over the years that does
0:01:09uh a specific acoustic
0:01:11uh us to distortion including speaker adaptation and noise robustness
0:01:16and the we'll talk about uh we we we discuss discussed options we have to handle multiple acoustic factors
0:01:22in this call so that the concept of acoustic factorisation is introduced
0:01:26and as an example we do
0:01:29uh we derive when you adaptations
0:01:31which we call joints that handle speaker and noise that uh distortions
0:01:36and then just uh
0:01:37i rooms and conclusion
0:01:39so for we start from the uh environment with as we all know the speech signal can be influenced by
0:01:45a factor
0:01:46think i i as in this diagram we have speaker differences
0:01:50i a channel mismatch
0:01:52and also some some sort of a back noise and room reverberation noise
0:01:57also also to do also of this factors can uh fact that speech and decode added it want to variations
0:02:03and degree or all to uh decoder speech in speech signals
0:02:07so this makes the
0:02:09a a robust speech recognition by challenge task
0:02:12and
0:02:13so in this work we consider using the model based of porsche
0:02:17to handle
0:02:18multiple close factors
0:02:21so
0:02:22a in in the in this framework we have a
0:02:25can all can not cope you look
0:02:26come of cool model be able to model the disorder versions
0:02:30and we use a a a a a a set of a transformed to the used at uh i used
0:02:35to adapt a can not come model to different of course the conditions
0:02:39and
0:02:41or the U S a different transforms
0:02:43has been to about the to hand do a pacific
0:02:46single acoustic factors including speaker adaptation and noise compensation schemes
0:02:52and
0:02:53so yeah
0:02:54uh a hard but hard to combine
0:02:57this transforms to handle multiple close to factors
0:03:00and you know if active and efficient way it is this central topic out this talk
0:03:06oh
0:03:07that's look at a first look at speaker adaptation be all know as being a transfer is
0:03:12uh well
0:03:13adapt acoustic models
0:03:15this uh this is on mean transforms
0:03:18and is this in a transform is very simple but very effective in practice
0:03:23but this uh a limitation of this this thing a transform that is
0:03:28uh uh we have a uh right relative a large number of parameters to estimate so we can't do
0:03:33robust estimation a single options
0:03:36this
0:03:36so this this thing in transform cannot be
0:03:39is not suitable for very rapid at adaptation
0:03:43so and and then a point interesting point to make it uh is
0:03:47uh this thing a transform or or uh a was or you don't design of for speaker adaptation but you've
0:03:53we this strands was a generating a transfer
0:03:56this can also extend to you my meant to you want men to adaptation
0:04:01and next so we look at as noise come compensation schemes
0:04:05normally a mismatch function of a be defined for the impact well environment
0:04:10uh this is
0:04:11is the first equation is a use the nonlinear
0:04:15uh
0:04:15some is about mean occlusions that relate
0:04:18i describes how lead to channel distortion and added noise can a fact that clean speech
0:04:24and it's on this mismatch functions
0:04:26uh model based approach um
0:04:29modified a models to
0:04:31and it to better represent a noisy speech distributions
0:04:35yeah you the D is used to here
0:04:38i
0:04:39the second you creations
0:04:40shoes how we can adapt acoustic models using vts should based the cool um
0:04:46which has but is the
0:04:47but the comp compensation schemes
0:04:49you can see if only creations that a
0:04:52do to we use a use the
0:04:54a mismatch function this
0:04:56transfer transfer use highly construing and nonlinear
0:04:59so we can
0:05:00uh uh we can see that's uh relativist film
0:05:03for a member of prime is to estimate
0:05:06so we can do very red we very rapid at that adaptation sings to
0:05:10a noise transform can be estimated a for a single options
0:05:14so
0:05:15and you know about be i talk about in speaker i the speaker adaptation a noise transforms
0:05:21a noise
0:05:21compensation schemes
0:05:22so hard to combine the
0:05:24in in practice we have a very simple various
0:05:27straight forward uh a combination schemes of we call this joint a we called this
0:05:32that's more combination
0:05:33and the E here you cushion here uh describe some how we can do
0:05:39a first uh adapt to a a uh a week
0:05:41the first adapt the acoustic models using vts transforms
0:05:45and failing dart we have learning a transformed to reduce is mismatch
0:05:49and this
0:05:51uh and and the diagram shoes with uh we
0:05:54we do is sing
0:05:55how we do is
0:05:56a given a acoustic addition be as to be noise friends one speaker transform
0:06:01uh a a proper update or
0:06:02and and if i that speaker or all noise transform to be to estimate re-estimated post
0:06:08are are and so
0:06:10uh at so we can see a a a a a a uh a limitation is obvious that uh
0:06:15did you know transform should be estimate
0:06:17on a block of data so this
0:06:19kind of a combination can out you very rapid a rapid at that adaptations this T
0:06:24it requires a block up a a a a block update data
0:06:28and i to me to a we and do you uh uh uh in another way we call this of
0:06:32acoustic factorisation
0:06:34in in uh uh we have
0:06:36we decompose the transform
0:06:38and a come constrain the each transform to more low as best the good
0:06:42as the best to suppress the close the factor
0:06:45in this case we has speak transform and noise transform
0:06:48which also have a each others
0:06:50this gives us the some free even two
0:06:53to to use this transform for example you've we know that same speaker as
0:06:58a speaking
0:06:59you know the changing noise conditions
0:07:01and we want to
0:07:03uh we we want to update the noise condition for and went to and two we just
0:07:08a to speech transfer is as we now was speaker has not changed
0:07:11and the can to noise update uh i i adaptation would just to do a a a nice adaptation
0:07:17and
0:07:17a similar way
0:07:19oh
0:07:20but environment that use is and change
0:07:22but a speaker has
0:07:23it has
0:07:24has has changed to another speaker speakers we can do
0:07:27uh make
0:07:28a speaker transform i out
0:07:29updating
0:07:30that we do this noise transforms
0:07:33so this
0:07:34that
0:07:34a a kind of acoustic factorization E
0:07:37factorization a a lots of this peak transform can be used in a range of noise condition
0:07:41and similar for noise transform
0:07:43and that when you sure with this is this approach is that
0:07:47the transfer what uh should be used the uh we use the transforming of factor i the fashion
0:07:52that to to estimate a transform need we need to join to estimate both speaker and noise trends are since
0:07:58that eight or uh of a fact
0:08:01a a a a a a of of
0:08:03uh
0:08:04a a a that of fact the by two
0:08:06uh to acoustic factors simultaneously
0:08:10a base on this comes at we derive a new adaptation schemes we call joint that that the king
0:08:16and this C D to on the right hand side shoes how we manipulate as transforms
0:08:21first what uh do in contrast to the previous should we do we it T plus them are
0:08:27this approach
0:08:28at that you use a a a reversed R a transform with applied to being a transform first and uh
0:08:34and the modified
0:08:36uh
0:08:37clean speech to nice the speech choose to crucial by doing so
0:08:41i work
0:08:42you can transform is a acting on the clean speech
0:08:46and
0:08:47the the in speaker independent clean speech and the S transform is a up you acting on this speech
0:08:54speaker dependence
0:08:56which true shouldn't
0:08:57this
0:08:58uh
0:08:59all do
0:09:00that are these are problems we
0:09:02we expect in speaker adaptation all
0:09:05noise compensation so we expect is
0:09:08is to transform she can be
0:09:10uh uh can be a so
0:09:12can be a some sort of of factor tries all also noise to each other so we can apply
0:09:17didn't me
0:09:18we can
0:09:19the them
0:09:20so that did i when use D we how we evaluate a hard you by it is uh uh a
0:09:26joint joint to transform seeing a in the
0:09:29X runs
0:09:30so we have for we have this song
0:09:33a we a you condition data are that's is from noise one
0:09:37peak K as me the noise phones for an speech transform joint state
0:09:41and the for and for the same speaker and then and uh i i can a noise condition
0:09:46we do a bit just a dude noise transform and uh these speak trance of we have all ten the
0:09:51in i don't in
0:09:52in the previous uh estimation
0:09:55and
0:09:55jen at
0:09:56at that acoustic models
0:09:58so that of it has the free and that's
0:10:00things
0:10:01uh uh since not only points friends far
0:10:03uh required do
0:10:05a a a update
0:10:06so this can be done or a single options so we can do this
0:10:10joint to speaker and noise i
0:10:12a a adaptation
0:10:14a single options
0:10:15which is very flexible
0:10:18so as scroll to the X ones
0:10:20uh for as when we we you bout the i-th runs on or four task
0:10:24this is a a is derived from most wrong as a joke one and task
0:10:29and we have for test set find
0:10:32there errors uh the in set a
0:10:35set a a is uh a test or one which is clean set
0:10:39and test
0:10:39in set B we have sick
0:10:41different to a six different types of noise at it
0:10:44and set C and said D is
0:10:46comes from the far-field microphones
0:10:49for the close to model training we do some of are pretty standard stuff
0:10:53and
0:10:54this is the X runs from a bashful batch X in a i'm
0:10:58batch more X in i mean
0:11:00the speaker and noise transform for i estimate for
0:11:03a for you shop that's for test set
0:11:06so
0:11:07it this no sharing bic uh off speaker transforms
0:11:10we can see that's uh
0:11:12by during speaker and noise adaptation
0:11:14a combine the speaker and noise adaptation
0:11:17we she we yeah she and signal and things over
0:11:22noise adaptational
0:11:23a noise adaptation on only
0:11:25and we L i can see that sings joint it's just the reverse all they're of each T from of
0:11:30i am i'm not transform so the order
0:11:32in share is not a very sensitive to it
0:11:36it it it is really uh a
0:11:39it it it it impacts performs it it does not impact for one too much
0:11:44so
0:11:44uh but we want to emphasise that this is a batch more X runs
0:11:49we we we uh this
0:11:51we recall which requires a a update is to estimate france transforms
0:11:56so it a is is better uh it he's not very flexible to be used
0:12:00so what is more interesting is the factorization X ones
0:12:04we can which uh in this X runs
0:12:06we have we can uh these estimates speaker transform for a from the clean set we should test or one
0:12:13and and we applied to speech transforming out the noise conditions
0:12:17we can see from the uh
0:12:19so the row of the table
0:12:21that's a we
0:12:22we uh this
0:12:23the speech transform from big from clean speech that's hard for a for the set B C is uh the
0:12:29out noise is set
0:12:31and then function at native S plus a that's not
0:12:35a that's not general at the that actually decrease performance
0:12:38because the each here the M O transfer in this case
0:12:42i uh is
0:12:43i i is acting on the vts adapted the being so uh so uh is
0:12:48is it is
0:12:49uh a you can it his uh a social a suspect and noise condition and
0:12:55can not be used uh that you know i don't noise conditions
0:12:58and and what is more interesting is that if we estimate transforms forms the speak transform font you got nice
0:13:04a set
0:13:05test or for all which is uh i since a restaurant noise
0:13:09and we have
0:13:10the
0:13:11a a a a a would joint to screw adapting skating actually
0:13:15uh you
0:13:16that that a get a a a uh guess a sum
0:13:18i that some better the result
0:13:22this is an interesting so
0:13:23and
0:13:25uh had this night might be a in a need and indicated that
0:13:29i would join transform uh i'm i'm a transformed in joint
0:13:33maybe more something that should be more that by
0:13:36i a by vts transform which is say which means that our factorization maybe not perfect
0:13:43that them but the a number of is
0:13:45uh are are uh are a up that nation use a as using the transform a speak transform as they
0:13:51from i is a shave
0:13:53for point for or which should use just which is very close to more expert
0:13:59a a more lax foreigns
0:14:00this demonstrate
0:14:01it
0:14:02we can we can fact tries act a we can fact a speak transform and using sprites speak transforming up
0:14:08you very good noise stations
0:14:11so
0:14:13so i i here i rival at my conclusions
0:14:16in this talk at i i we argue that um
0:14:19a handling doing or close to factors Z is important in
0:14:23being very complex realistic in closing moment
0:14:26and we present a are powerful and flexible polish test based on the
0:14:30acoustic factorisation with your the derive when you adaptation skiing because a joint
0:14:36and and this allows very rapid the speaker and noise adaptation
0:14:41this
0:14:41speak transform can can be used a cross them are local acoustic issues
0:14:45and just a little bit a about this a new X in
0:14:49a we have to to compare our approach
0:14:52is the uh uh a feature enhancement you enhancement to style is
0:14:57a style a but a noise robustness schemes is
0:15:01and am adaptation a speaker transforms
0:15:04a speaker adaptation
0:15:05we we observe that of joining the all all performed this a you have he feature these M are employed
0:15:12um um
0:15:13such and factorization mold
0:15:15and this is demonstrated
0:15:17a the the power of
0:15:18but the bayes framework
0:15:20and the we uh and the inside is from where we have a very powerful and flexible to does sees
0:15:27that he's acoustic factorisation
0:15:29second
0:15:36do have a a time for a couple of questions but have a process to to you might be are
0:15:42both behind the projectors
0:16:04so i have the
0:16:06so questions
0:16:07to use a speaker
0:16:11a to the factorization so
0:16:14quote a close at all
0:16:16three
0:16:17to that extent you so the assumption is about
0:16:22oh
0:16:24uh
0:16:25uh yeah i i i think this you are right to that this actually is
0:16:29um
0:16:30do not very uh it is not perfect fact uh of factor arise since me
0:16:35as you can see that uh
0:16:37uh
0:16:38but we have a that by on the on the transfer or and we also have a channel distortion actually
0:16:44is a also a bias on that
0:16:46that we do
0:16:48we can't
0:16:49uh
0:16:49and but is since
0:16:51but for the main main part the
0:16:54we transform is ending at as well and the nice friends for its in a transform
0:16:59so this to leave different types of transform one combined
0:17:03is actually the D can be
0:17:04uh uh
0:17:06uh uh uh a uh uh a factor rights
0:17:09and uh as the that there's the X were and demonstrated that
0:17:13oh
0:17:14we can't use
0:17:15you can because a is the fact right
0:17:18property is quite good
0:17:20um two
0:17:21the count to say in met matt in mathematically medically
0:17:24uh the the if it also a model to each other but we can see
0:17:28from this we can use the speech transforming wireds conditions
0:17:32so that's
0:17:33that's T
0:17:34a is kind of factor
0:17:35uh uh also an art
0:17:40a questions
0:17:45i i sure can for the speaker