Speech Transcript - SPEAKER AND NOISE FACTORISATION ON THE AURORA4 TASK

i yes factorization i for a task i paper have proposed a channel well and she i good morning our run uh were come to talk a for come to to um at all however have of be very uh please to present our reason to work arounds a title ways speaker and noise factorization are or for task and ball on you tell all and this is a joint work with by supervisor all mark gales and so he's here still will we'll a a a uh uh uh first slayer out a top or something about the a a model based approach for robust speech running speech recognition and is the is uh uh a uh uh a a a lot of the skiing that that have been developed over the years that does uh a specific acoustic uh us to distortion including speaker adaptation and noise robustness and the we'll talk about uh we we we discuss discussed options we have to handle multiple acoustic factors in this call so that the concept of acoustic factorisation is introduced and as an example we do uh we derive when you adaptations which we call joints that handle speaker and noise that uh distortions and then just uh i rooms and conclusion so for we start from the uh environment with as we all know the speech signal can be influenced by a factor think i i as in this diagram we have speaker differences i a channel mismatch and also some some sort of a back noise and room reverberation noise also also to do also of this factors can uh fact that speech and decode added it want to variations and degree or all to uh decoder speech in speech signals so this makes the a a robust speech recognition by challenge task and so in this work we consider using the model based of porsche to handle multiple close factors so a in in the in this framework we have a can all can not cope you look come of cool model be able to model the disorder versions and we use a a a a a a set of a transformed to the used at uh i used to adapt a can not come model to different of course the conditions and or the U S a different transforms has been to about the to hand do a pacific single acoustic factors including speaker adaptation and noise compensation schemes and so yeah uh a hard but hard to combine this transforms to handle multiple close to factors and you know if active and efficient way it is this central topic out this talk oh that's look at a first look at speaker adaptation be all know as being a transfer is uh well adapt acoustic models this uh this is on mean transforms and is this in a transform is very simple but very effective in practice but this uh a limitation of this this thing a transform that is uh uh we have a uh right relative a large number of parameters to estimate so we can't do robust estimation a single options this so this this thing in transform cannot be is not suitable for very rapid at adaptation so and and then a point interesting point to make it uh is uh this thing a transform or or uh a was or you don't design of for speaker adaptation but you've we this strands was a generating a transfer this can also extend to you my meant to you want men to adaptation and next so we look at as noise come compensation schemes normally a mismatch function of a be defined for the impact well environment uh this is is the first equation is a use the nonlinear uh some is about mean occlusions that relate i describes how lead to channel distortion and added noise can a fact that clean speech and it's on this mismatch functions uh model based approach um modified a models to and it to better represent a noisy speech distributions yeah you the D is used to here i the second you creations shoes how we can adapt acoustic models using vts should based the cool um which has but is the but the comp compensation schemes you can see if only creations that a do to we use a use the a mismatch function this transfer transfer use highly construing and nonlinear so we can uh uh we can see that's uh relativist film for a member of prime is to estimate so we can do very red we very rapid at that adaptation sings to a noise transform can be estimated a for a single options so and you know about be i talk about in speaker i the speaker adaptation a noise transforms a noise compensation schemes so hard to combine the in in practice we have a very simple various straight forward uh a combination schemes of we call this joint a we called this that's more combination and the E here you cushion here uh describe some how we can do a first uh adapt to a a uh a week the first adapt the acoustic models using vts transforms and failing dart we have learning a transformed to reduce is mismatch and this uh and and the diagram shoes with uh we we do is sing how we do is a given a acoustic addition be as to be noise friends one speaker transform uh a a proper update or and and if i that speaker or all noise transform to be to estimate re-estimated post are are and so uh at so we can see a a a a a a uh a limitation is obvious that uh did you know transform should be estimate on a block of data so this kind of a combination can out you very rapid a rapid at that adaptations this T it requires a block up a a a a block update data and i to me to a we and do you uh uh uh in another way we call this of acoustic factorisation in in uh uh we have we decompose the transform and a come constrain the each transform to more low as best the good as the best to suppress the close the factor in this case we has speak transform and noise transform which also have a each others this gives us the some free even two to to use this transform for example you've we know that same speaker as a speaking you know the changing noise conditions and we want to uh we we want to update the noise condition for and went to and two we just a to speech transfer is as we now was speaker has not changed and the can to noise update uh i i adaptation would just to do a a a nice adaptation and a similar way oh but environment that use is and change but a speaker has it has has has changed to another speaker speakers we can do uh make a speaker transform i out updating that we do this noise transforms so this that a a kind of acoustic factorization E factorization a a lots of this peak transform can be used in a range of noise condition and similar for noise transform and that when you sure with this is this approach is that the transfer what uh should be used the uh we use the transforming of factor i the fashion that to to estimate a transform need we need to join to estimate both speaker and noise trends are since that eight or uh of a fact a a a a a a of of uh a a a that of fact the by two uh to acoustic factors simultaneously a base on this comes at we derive a new adaptation schemes we call joint that that the king and this C D to on the right hand side shoes how we manipulate as transforms first what uh do in contrast to the previous should we do we it T plus them are this approach at that you use a a a reversed R a transform with applied to being a transform first and uh and the modified uh clean speech to nice the speech choose to crucial by doing so i work you can transform is a acting on the clean speech and the the in speaker independent clean speech and the S transform is a up you acting on this speech speaker dependence which true shouldn't this uh all do that are these are problems we we expect in speaker adaptation all noise compensation so we expect is is to transform she can be uh uh can be a so can be a some sort of of factor tries all also noise to each other so we can apply didn't me we can the them so that did i when use D we how we evaluate a hard you by it is uh uh a joint joint to transform seeing a in the X runs so we have for we have this song a we a you condition data are that's is from noise one peak K as me the noise phones for an speech transform joint state and the for and for the same speaker and then and uh i i can a noise condition we do a bit just a dude noise transform and uh these speak trance of we have all ten the in i don't in in the previous uh estimation and jen at at that acoustic models so that of it has the free and that's things uh uh since not only points friends far uh required do a a a update so this can be done or a single options so we can do this joint to speaker and noise i a a adaptation a single options which is very flexible so as scroll to the X ones uh for as when we we you bout the i-th runs on or four task this is a a is derived from most wrong as a joke one and task and we have for test set find there errors uh the in set a set a a is uh a test or one which is clean set and test in set B we have sick different to a six different types of noise at it and set C and said D is comes from the far-field microphones for the close to model training we do some of are pretty standard stuff and this is the X runs from a bashful batch X in a i'm batch more X in i mean the speaker and noise transform for i estimate for a for you shop that's for test set so it this no sharing bic uh off speaker transforms we can see that's uh by during speaker and noise adaptation a combine the speaker and noise adaptation we she we yeah she and signal and things over noise adaptational a noise adaptation on only and we L i can see that sings joint it's just the reverse all they're of each T from of i am i'm not transform so the order in share is not a very sensitive to it it it it is really uh a it it it it impacts performs it it does not impact for one too much so uh but we want to emphasise that this is a batch more X runs we we we uh this we recall which requires a a update is to estimate france transforms so it a is is better uh it he's not very flexible to be used so what is more interesting is the factorization X ones we can which uh in this X runs we have we can uh these estimates speaker transform for a from the clean set we should test or one and and we applied to speech transforming out the noise conditions we can see from the uh so the row of the table that's a we we uh this the speech transform from big from clean speech that's hard for a for the set B C is uh the out noise is set and then function at native S plus a that's not a that's not general at the that actually decrease performance because the each here the M O transfer in this case i uh is i i is acting on the vts adapted the being so uh so uh is is it is uh a you can it his uh a social a suspect and noise condition and can not be used uh that you know i don't noise conditions and and what is more interesting is that if we estimate transforms forms the speak transform font you got nice a set test or for all which is uh i since a restaurant noise and we have the a a a a a would joint to screw adapting skating actually uh you that that a get a a a uh guess a sum i that some better the result this is an interesting so and uh had this night might be a in a need and indicated that i would join transform uh i'm i'm a transformed in joint maybe more something that should be more that by i a by vts transform which is say which means that our factorization maybe not perfect that them but the a number of is uh are are uh are a up that nation use a as using the transform a speak transform as they from i is a shave for point for or which should use just which is very close to more expert a a more lax foreigns this demonstrate it we can we can fact tries act a we can fact a speak transform and using sprites speak transforming up you very good noise stations so so i i here i rival at my conclusions in this talk at i i we argue that um a handling doing or close to factors Z is important in being very complex realistic in closing moment and we present a are powerful and flexible polish test based on the acoustic factorisation with your the derive when you adaptation skiing because a joint and and this allows very rapid the speaker and noise adaptation this speak transform can can be used a cross them are local acoustic issues and just a little bit a about this a new X in a we have to to compare our approach is the uh uh a feature enhancement you enhancement to style is a style a but a noise robustness schemes is and am adaptation a speaker transforms a speaker adaptation we we observe that of joining the all all performed this a you have he feature these M are employed um um such and factorization mold and this is demonstrated a the the power of but the bayes framework and the we uh and the inside is from where we have a very powerful and flexible to does sees that he's acoustic factorisation second do have a a time for a couple of questions but have a process to to you might be are both behind the projectors so i have the so questions to use a speaker a to the factorization so quote a close at all three to that extent you so the assumption is about oh uh uh yeah i i i think this you are right to that this actually is um do not very uh it is not perfect fact uh of factor arise since me as you can see that uh uh but we have a that by on the on the transfer or and we also have a channel distortion actually is a also a bias on that that we do we can't uh and but is since but for the main main part the we transform is ending at as well and the nice friends for its in a transform so this to leave different types of transform one combined is actually the D can be uh uh uh uh uh a uh uh a factor rights and uh as the that there's the X were and demonstrated that oh we can't use you can because a is the fact right property is quite good um two the count to say in met matt in mathematically medically uh the the if it also a model to each other but we can see from this we can use the speech transforming wireds conditions so that's that's T a is kind of factor uh uh also an art a questions i i sure can for the speaker

SPEAKER AND NOISE FACTORISATION ON THE AURORA4 TASK

Robust ASR

Presented by: Yongqiang Wang, Author(s): Yongqiang Wang, Mark Gales, University of Cambridge, United Kingdom