Přepis řeči - ROBUST SPEECH RECOGNITION USING DYNAMIC NOISE ADAPTATION

i'm gonna be telling about robust speech recognition using dynamic noise adaptation i guess we'll go right into it um so and a line a talk all start by giving a brief overview of the area and D N A uh review the D N a model briefly go over how we do in print and then jump right two experiments and results okay so model based role robust asr it's a well establish paradigm basically use explicit models of noise channel distortion and their interaction with speech two model noisy speech uh there's been many interesting modeling an inference techniques that are thing developed of a lab i'd say twenty five years interaction models uh there's many references you have to refer to the paper um so one thing that is conjecture to some degree but uh i i i uh but i i believe it didn't and that basically sickly that the relevance to a commercial great is is not been this dependent definitively established a so why is that well that's because we have promising word error rate reductions on less sophisticated asr systems uh you to the small training or test sets either the our or simple pipe fines and or artificially mixed data so we don't know if these techniques actually improve asr as of this moment uh but all i'll give you some evidence it's of at least one model that does improve a state-of-the-art asr system okay so dynamic noise adaptation just uh a review a a model based approach uh essentially use a gmm of one sort or another first model speech in uh and i the model of noise in this case a gaussian process model uh the features of this some features of the map that are that it's i mismatch model there's noise no uh training required for the noise model there's no system retraining required and uh the technique models and certainly in the noise estimate now show you briefly why that's important to uh good speech uh noise thing terms of previous results actually actually D N A is been kicking around for a while now we does off every couple years and try it on more realistic data a so it a a previous publications have shown that it significantly a performs techniques like the F E and F for more on aurora two and the D an a plus or or two test these are artificially a data sets though we've we never tried it on real data so uh this work all talk about today all or review some new results on real data using the uh best uh embedded speech recognizer that we have a a i'll discuss how the an A is been which it for low latency deployments and all i'll show you some results that with some gains for example twenty two percent word error rate reduction before the below six B R on the system that includes spectral subtraction uh from more are and fmpe P uh first briefly here's a that D N a generative model i'll get into the components oh right away but basically the idea is that this is a general model which is it's modular there's the channel model speech model and noise model that are combined with an interaction model and uh basically you can play with how the pieces are structured and put them together you uh using the general framework which is very useful if you want to extend uh and through the model change the inference out were that make it a stronger it's true okay so right in the interaction model in the time domain uh this the standard uh distortion plus noise model in the frequency domain uh i think probably we most and this room are familiar with this uh we approximate the term with the phase the the third term in the sum uh in this in this uh paper and most other papers is scales in there are some better interaction models out there uh by lee gang another um there a little bit more computationally intensive but they are more accurate um so are probability model for the data we observe why given the speech uh X the channel H and the noise and and the log mel power spectral domain is uh a normal distribution with the interaction function have as shown okay so jumping right into the speech model one thing we changed is we uh use the bank quantized gmm instead of a regular gmm essentially for each uh state uh uh you have a map to a reduced number of gaussians at each frequency so here a maps acoustic state yes to get a seen a and a given frequency band yeah the idea is the number the number of yeah scenes in each band is a lot smaller than the number of states so this model can be very efficiently computed and ford C an example than the sec a the noise model for D an A it's a simple uh a simple uh gaussian process model is you can have essentially uh uh we model the noise level as uh as a stationary process so the prediction for the next frame given the previous frames noise level is the that that that same value a plus a the there's some propagation noise gamma M and model a some to uh capture how the noise of the over time uh this is combined with a simple transient noise model which just gaussian and the idea here is that the the noise level is uh changing slowly to the uh but is with respect to the frame rate and that the transient noise actually ends that being a dominating this so introducing this into additional layer makes it possible to track something that doesn't look smooth at all when you look at in the log power spectrum uh but with this with this filled uh this extra layer in the model which a in essence filters out the transient noise it is a very good model a the channel model like in the past mean didn't have a channel model was artificial data we did need one so uh for the channel model we used yeah the for this work just the stochastically adapted parameter vector this is actually uh the same the same model that uh we dang used in the past for modeling noise and not uh work quite well okay so now we're back to the gender model a show you a few slides ago go so uh now we can explain the as briefly so the channel model i have a little square for the variable there because it's not actually random variable it's a parameter that and and the arrows are showing that it adapts over time um uh the speech model uh the part inside the big grey box is essentially uh it is the uh variables K F T which are they gaussian seen indices for a given time and frequency and the speech the clean speech that's generated X subscript that T and of course there's and noise model that i just described it has two layers to facilitate a bus noise tracking and the interaction model i described oh one interesting thing is that everything and the box actually that's a plate a graphical model notation with that little have and the corner there i means that this structure is do created for but duplicated for all frequency and actually the only thing that is not independent in the model keep uh but the only thing that binds uh the free the estimates so we're frequency is actually the speech a components T here C T given C T V entire model uh factors uh well more precisely for a given segment of data over time and frequency you have to be given this all the states of the speech model for the model the factor uh over time and frequency okay so how do we do we it's in this small very quickly this is been oh shown before so the exact noise posterior because it's a gaussian process and you pick up uh uh uh there every time you evaluate a gmm that increases the number of components and the noise posterior by a factor of the number of on it so uh i T grows exponentially with time the number of components and the noise posterior so we actually approximate this just is gaussian and this work and now we done some investigation of some more complex methods and so far we haven't seen any advantage of that but that was on synthetic data it still needs to be tried for real data um um as it just showing how you compute the mean and the variance of that gaussian i'll skip right over that um okay so likelihood approximation a i showed you before the interaction function it's nonlinear when so basically we iteratively lynn your eyes it using a got when this is a very in of it or itself a variant of it or iterative T um um i i here is the a a uh i guess one thing is that the one interesting thing is that i for uh basically is the ratio of speech and noise in uh the weights on that the two factors in the model and it's actually ends up being just the power ratio of the distorted speech over total power under the model uh that's kind of difficult understands here's a picture uh so for a given time and frequency a are you have a joint prior for speech in noise in the first image a i a diagonal covariance gaussian the likelihood function and a given frequency in this case a a the observation has uh intensity of ten db a that's what it looks like in the exact posterior of the model looks like the thirty image so what do we do we uh uh your eyes this approximation it so in this case you can see that we linear arrays in yeah compute the posterior and it's actually nothing you nothing like the a a true posterior but if we iterate the process actually we get a result that is very faithful to the true posterior that shown in the previous slide so is that are two things here iteration is important we know that uh modeling the certainly in the noise when you when the noise is uh adapting is clearly very important if i didn't have uncertainty in the noise estimate uh a then this would have been my answer which is wrong okay um so to reconstruct the speech this is a front-end technique uh at least uh that's how we've been using it for this work we uh uh run D N A and then reconstruct and feed it of the back for recognition that's just the expected value wander a gaussian a mixture of gaussians rather so highly structured mixture of gaussians at a given time step given the prior for not okay so let's just jump straight in some results so the data we tested on was us english in-car speech recorded in various noise conditions uh mostly uh a a you know cars speeding by acceleration of the motor and one not would be the done and noise uh the training data is uh uh in a hundred thousand utterances that's about seven hundred and eighty six hours test data is a a whole out uh one a hundred twenty eight held-out speakers and uh approximately forty hours of data uh there's forty seven tasks that span a few domains navigation command control radio digits dialling seven regional a us accents oh the models that we use uh in the back and we use a pretty standard uh a word internal plusminus to phonetic context the with eight eight sixty five context-dependent state where dimensional the you features we built three models based on notes an ml model a clean and all model trained only on data it's twenty db know above and and fmpe model a once trained we compressed using a hierarchical bang quantization uh for efficiency a the D N a speech model it runs of the front and it has its own speech model a a little gmm so yeah uh it's set in the log mel domain and actually just has two hundred fifty six speech and sixteen silence component uh we compress this model as well to the Q gmm as i describe so is actually only a scenes per dimension you you need to evaluate to evaluate the you know so it's got the uh yeah i mean it's in terms the number of gaussians similar to what a speech detector go um so the adaptation algorithms that we uh tried in conjunction with a testing include spectral subtraction T in a a alone i mean normalisation was always in the pipeline and we tried switching and have a more and P the C L the techniques interact um so spectral subtraction is uh basically used standard a spectral subtraction module uh model based speech detector noise estimate from speech three frames with a a a floor value based on the current noise estimate to avoid musical noise um have a are this version is an online version it's stochastically adapted every friday five frame a the F P model will oh a uh implements the it the transformation using five under and twelve gaussians seventeen frame in a context a nine frame out or condo okay so uh this is a pretty busy go graph will spend a few minutes taking a look at it first thing i want you to look at it is uh basically compare the red lines to the blue lines uh this compares basically turning the an when D an a a is off to when the in is on so first but i but a red curve is a showing the word error rate as a function of snr actually we have a histogram of how much data is E in each snr bin in the background there and you can see the number of words on the you right hand side yeah on the X on the E Y axes rather so uh anyhow so you can see that basically is a general trend that we're getting went from red to green uh that forget me up forget the green curves for no going from red to blue curve we we're getting significant our reductions in word error rate when we turn on D an a so a little bit more concretely the red curve with uh that's dash would diamond it's just with spectral subtraction and one we turn on D N a and spectral subtraction we have the blue curve that's dashed we're getting they screens especially at low well particularly at low snr um an interesting thing was that when we use the clean i clean D N a model and a clean back and model a a looking at the green curve with the upside down triangle a we can see that curve tracks uh a spectral subtraction uh spectral subtraction curve what it has a multi condition back so basically you you doing as well as spectral subtraction uh but if you a minute all that a our data is for the impressive okay so that's so that for more off so you can see what happens when you turn uh from a on so if you're doing noise next experiments without of more uh maybe you should rethink uh a it has a huge impact on on performance and uh fortunately in this case the an a compliment stuff helps to generate that are all alignments and again you can see comparing the uh that's see in this case one spectral so traction is oh i and and okay do when D N in spectral subtraction is on actually in this case there's does not a uh was not a big difference a low snr um but of course these these are only a a and well uh a maximum-likelihood system results so we're still not using the best system so then when we turn on F M P uh we see again that in general the word error rate is just dropping substantially everywhere except at low snr actually it's the it's a little bit in the lowest that's not bins um that makes quite a bit of sense because that that be and you don't get a lot of training data for and it's putting a lot of stress on the F and P model it's it has to memorise combinations of speech and noise uh to do well in that area and it's tricky when noise is actually modifying the feature a lot so here are you can see that comparing i'd spectral subtraction on that's the red diamonds to the D N A results with and without spectral subtraction we're getting a a big gains at very low snr and uh a uh basically in about fifteen db it's it's definitely better to turn on D N A um so one thing uh i'm not gonna go through this huge table this is a kind of daunting but uh you can look in the paper for details but a a well one thing we notice is that we're getting consisting gains at low snr but because of the amount of data that higher snr and the fact that the seems to be just a little bit of degradation from this D an A um describing it did today uh that ends up not not helping the F and peace system um but sense then we actually figured out uh how to make them more parse ammonia as i think it into the details here uh but uh a at this time actually the the latest version we have in improves a a on this database i it gives ten percent relative gains over all and below uh low ten db it's uh heading towards twenty percent relative so it it's improving our system quite a bit um okay so at this point i'll just wrap up i mean and some D N A it's working and it's this this data is a a pretty dated now unfortunately is working a lot better well have that wait till the next conference tell you about the rest uh because of patent issues um in turn into the future work i mean uh i hope uh as many you you the crowd are already where um when you use a graphical model work you can uh its modular so you can build you can make parts of the model stronger a in your inference algorithms obviously yeah we inference algorithm we have for the noise is very approximate we uh have a huge gmm we approximate made by gaussian at each time step um another thing would be tighter back and integration of course there's lot of people working on that in mark gales T uh and the microsoft and search um and uh i guess that's that's pretty much it and so it's good news for model based approach yeah i have time a couple a couple of questions a i oh sorry i uh so seems like you're uh uh your yeah that's a great slide but that seems like it works really great on the cases where you're word error rate is is very high but oh and the like three by db snrs you start to get degradation for a D N A i here here the little red triangle looks like cure where error rate is a doubled we need to a and N is that my reading that right or is that have yeah that's i mean that that that is pretty bad that yeah at the time that's that was the big problem with that um but we saw that problem of so with the iteration with that that you your salt you solve that problem a i just in general we we all that but unfortunately i okay i can tell you yet ideally i is just a clarification question i didn't quite do expose can see numbers a whether some numbers correspond to have been clean training a hundred hours and so noise so oh yeah i mean in general except for the uh the clean the green uh curves here the those have a a clean a back and and D N A model everything else is using multi condition more uh data okay yeah so question to a speech recognition uh rather than is to say a speech enhancement construction noise uh well actually maybe it in make that clear we actually do reconsider a that's why i have the question oh okay a the spectral so to construct well or so to to the uh and oh yeah sure or uh a this been a few publications on it in in the past and they they have spectrograms i didn't put them in that this time uh i mean it is in the mel domain so i i think of the trouble of regenerating generating the signal the listen to it a question i as one quick question here is so that when you compare the performance you already compelled the so the one the D A is all the i is or and also you use the S S this that the for the spectral subtraction right yes and all the what is differs to this uh as the as use the standard the as approach a standard bp S is oh uh i mean a i T S is would be a step up by say from spectral subtraction it it's uh uh but is this one line here pretty much sums it up you use a speech detector and then based on your segmentation you estimate the noise based on the frames their speech free and then you you adapted according you know uh according to some forgetting factor you uh i as i think probably the make more sense to compel this with so yes or one you use a single thing full from now so yeah that's a point so a actually this the spectral subtraction routine is pretty well tuned so when we ran and turned off the at that adaptation for D N a actually the my slight the price the the results are the same is with spectral subtraction so we got quite a bit again by turning on the adaptation of the noise uh so so because if we just lies the noise model T S that's what everybody does initialize on the first ten frames and then do vts T S that actually doesn't improve our system you have to adapt it and the reason that was shocking to me is because this database this database the the utterances are uh to five seconds long so means that you know cars are passing and one not and that's affecting the noise estimate significantly enough that you need to adapt it during the utterance that's important a a was to process

ROBUST SPEECH RECOGNITION USING DYNAMIC NOISE ADAPTATION

Robust ASR

Přednášející: Steven Rennie, Autoři: Steven Rennie, Pierre Dognin, Petr Fousek, IBM, United States