0:00:13i'm gonna be telling about
0:00:14robust speech recognition using dynamic
0:00:17noise adaptation
0:00:19i guess we'll go right into it
0:00:21um so
0:00:22and a line a talk all start by giving a brief overview
0:00:26of
0:00:28the area
0:00:29and D N A
0:00:30uh review the D N a model
0:00:32briefly go over how we do in print
0:00:35and then jump right two experiments and results
0:00:40okay so model based role robust asr it's a well
0:00:43establish paradigm
0:00:45basically use explicit models of noise channel distortion
0:00:49and their interaction with speech
0:00:51two
0:00:51model noisy speech
0:00:53uh there's been many interesting modeling an inference techniques that are thing developed of a lab
0:00:58i'd say twenty five years
0:01:00interaction models uh there's many references you have to refer to the paper
0:01:04um so
0:01:05one thing that is conjecture to some degree but
0:01:08uh
0:01:09i i i uh
0:01:11but i i believe it didn't and that
0:01:13basically sickly that the relevance to a commercial great is is not been this dependent definitively established
0:01:19a so why is that well that's because
0:01:21we have
0:01:22promising word error rate reductions on less sophisticated asr systems
0:01:26uh you to the small training or test sets
0:01:29either the our or simple pipe fines and or artificially mixed data
0:01:33so we don't know if these techniques actually improve asr as of this moment
0:01:38uh but all i'll give you some evidence it's of at least one model that does
0:01:43improve
0:01:44a state-of-the-art asr system
0:01:47okay so dynamic noise adaptation just uh a review
0:01:51a a model based approach uh essentially use a gmm of one sort or another first model speech
0:01:57in uh and i
0:01:58the model of noise in this case a gaussian process model
0:02:03uh the features
0:02:04of this
0:02:04some features of the map that are that it's
0:02:06i mismatch model there's noise
0:02:08no uh training required for the noise model
0:02:11there's no system retraining required
0:02:14and uh the technique models and certainly in the noise estimate now show you briefly why that's important to
0:02:20uh
0:02:21good
0:02:22speech
0:02:23uh
0:02:24noise thing
0:02:26terms of previous results actually actually D N A is been kicking around for a while now
0:02:31we does off every couple years and try it on more realistic data
0:02:35a so it
0:02:36a a previous publications have shown that it
0:02:38significantly a performs techniques like the
0:02:42F E
0:02:42and F for more on aurora two and the D an a plus or or two test these are artificially
0:02:48a data sets though we've
0:02:50we never tried it on real data
0:02:52so uh
0:02:53this work all talk about today all or review some new results on real data
0:02:57using the uh best
0:02:59uh embedded speech recognizer that we have a
0:03:02a i'll discuss how the an A is been which it for low latency deployments
0:03:06and all
0:03:07i'll show you some results that with some gains for example twenty two percent
0:03:12word error rate reduction before the below six B R
0:03:16on the system that includes spectral subtraction uh from more are and fmpe P
0:03:21uh
0:03:22first briefly here's a that D N a generative model i'll get into the components
0:03:27oh right away but basically the idea is that this is a general model which is
0:03:32it's modular there's the channel model speech model and noise model
0:03:35that are combined with an interaction model
0:03:37and
0:03:38uh basically you can play with how the pieces are structured
0:03:42and put them together you uh using the general framework which is very useful if you want to extend
0:03:47uh
0:03:48and through the model change the inference out were that make it
0:03:52a stronger it's true
0:03:54okay so right in the interaction model in the time domain
0:03:58uh this
0:03:58the standard uh distortion plus noise
0:04:02model in the frequency domain uh i think
0:04:04probably we most and this room are familiar with this
0:04:07uh we approximate the
0:04:09term with the phase
0:04:11the the third term in the sum
0:04:14uh in this
0:04:15in this uh paper and most other papers is scales in there are some better interaction
0:04:20models out there uh by lee gang another
0:04:23um there a little bit more
0:04:25computationally intensive but they are more accurate
0:04:28um so are probability model for the data we observe why given
0:04:33the
0:04:34speech
0:04:35uh X the channel H and the noise and and the log
0:04:39mel power spectral domain
0:04:41is
0:04:42uh
0:04:43a normal distribution
0:04:44with the interaction function have as shown
0:04:49okay so jumping right into the speech model one thing we changed is we uh use the bank quantized gmm
0:04:54instead of a regular gmm
0:04:56essentially for each
0:04:57uh state
0:04:59uh uh you have a map to a reduced number of gaussians at each frequency so
0:05:03here a maps acoustic state yes
0:05:05to get a seen a and
0:05:07a given frequency band yeah the idea is the number the number of
0:05:10yeah scenes in each band is a lot smaller than the number of states
0:05:14so this model can be very efficiently computed and ford
0:05:18C an example than the sec
0:05:21a the noise model for D an A it's a simple uh a simple uh
0:05:25gaussian process model is you can have
0:05:28essentially uh uh we model the noise level as uh as a stationary process
0:05:33so the prediction for the next frame
0:05:36given the previous frames noise level is
0:05:39the
0:05:40that
0:05:41that that same value
0:05:42a plus a the there's some propagation noise gamma M and model
0:05:47a some to
0:05:48uh capture how the noise
0:05:50of the over time
0:05:52uh this is combined with a simple transient noise model which just gaussian
0:05:56and the idea here is that the
0:05:58the noise level is
0:06:00uh changing slowly to the
0:06:02uh
0:06:03but is with respect to the frame rate
0:06:06and that the transient noise actually ends that being a dominating this so introducing this into additional layer
0:06:12makes it possible to track something that doesn't look smooth at all when you look at in the log power
0:06:17spectrum
0:06:18uh but with this with this filled uh this extra layer in the model which
0:06:24a in essence filters out the transient noise it is a very good model
0:06:29a the channel model like in the past mean didn't have a channel model was artificial data we did need
0:06:33one
0:06:34so uh for the channel model we used
0:06:37yeah the for this work
0:06:38just the stochastically adapted parameter vector this is actually
0:06:42uh
0:06:43the same the same model that uh we
0:06:46dang used in the past for
0:06:48modeling noise
0:06:49and not uh work quite well
0:06:53okay so now we're back to the gender model
0:06:55a show you a few slides ago go so
0:06:58uh now we can explain the as
0:07:00briefly so the channel model
0:07:02i have a little square for the variable there because it's not actually random variable it's
0:07:06a parameter that
0:07:08and and the arrows are showing that it adapts over time
0:07:11um
0:07:12uh the speech model uh the part inside the big grey box
0:07:16is essentially uh
0:07:18it is the
0:07:19uh
0:07:20variables K F T which are they gaussian seen indices for a given time and frequency
0:07:25and the speech
0:07:26the clean speech that's generated X
0:07:29subscript that T
0:07:30and of course there's and noise model that i just described it has two layers to facilitate
0:07:36a bus noise tracking and the interaction model i described
0:07:39oh one interesting thing is that everything and the box actually that's a plate
0:07:44a graphical model notation with that little have and the corner there
0:07:47i means that this structure is do created for
0:07:50but
0:07:50duplicated for all frequency
0:07:52and actually the only thing that is not independent
0:07:55in the model
0:07:57keep uh
0:07:58but the only thing that binds
0:07:59uh the free
0:08:00the estimates so we're frequency is actually the speech
0:08:04a components T here C T
0:08:07given C T V entire model
0:08:09uh factors
0:08:11uh well more precisely for a given
0:08:14segment of data over time and frequency you have to be given this
0:08:17all the states of the speech model
0:08:19for the model the factor
0:08:21uh
0:08:22over time and frequency
0:08:25okay so how do we do we it's in this small very quickly this is been
0:08:29oh shown before so the exact noise posterior because it's a gaussian process
0:08:33and you pick up
0:08:35uh
0:08:36uh
0:08:37uh
0:08:38there every time you evaluate a gmm that increases the number of components and the noise posterior by a factor
0:08:43of the number of on it
0:08:45so uh i T grows exponentially with time the number of components and the noise posterior
0:08:50so we actually approximate this just is gaussian and this work
0:08:54and now we done some investigation of
0:08:57some more complex methods and so far we haven't seen any advantage of that
0:09:01but that was on
0:09:02synthetic data it still needs to be tried for real data
0:09:06um um
0:09:07as it just showing how you compute the mean and the variance of that gaussian i'll skip right over that
0:09:14um
0:09:15okay so likelihood approximation a
0:09:18i showed you before the interaction function it's nonlinear when
0:09:21so basically we iteratively lynn your eyes it using a got when this is a very in of it or
0:09:26itself a variant of it or
0:09:28iterative
0:09:30T
0:09:31um um
0:09:32i i here is the a a uh i guess
0:09:35one thing is that the
0:09:37one interesting thing is that i for uh basically is the ratio of speech and noise in uh
0:09:42the weights on that
0:09:43the two factors in the model and
0:09:45it's actually ends up being just the power ratio of the distorted speech over
0:09:49total power
0:09:51under the model
0:09:53uh that's kind of difficult understands here's a picture
0:09:56uh
0:09:57so for a given time and frequency
0:09:59a are you have a joint prior for speech in noise in the first image
0:10:04a i a diagonal covariance gaussian the likelihood function and a given frequency in this case
0:10:10a a the observation has
0:10:12uh intensity of
0:10:13ten db
0:10:15a that's what it looks like in the exact posterior of the model looks like the thirty image
0:10:21so what do we do we uh uh your eyes this approximation it so
0:10:25in this case you can see that
0:10:27we linear arrays in yeah
0:10:29compute the posterior and it's actually nothing you nothing like the
0:10:33a a true posterior
0:10:35but if we iterate the process
0:10:38actually we get a result that is very faithful to the true posterior that shown in the previous slide
0:10:44so is that are two things here iteration is important we know that
0:10:47uh
0:10:48modeling the certainly in the noise
0:10:50when you when the noise is
0:10:52uh adapting
0:10:53is clearly very important if i didn't have uncertainty in the noise estimate
0:10:58uh a then this would have been my answer
0:11:00which is wrong
0:11:04okay um so to reconstruct the speech this is a front-end technique uh at least
0:11:09uh that's how we've been using it for this work
0:11:12we uh uh
0:11:13run D N A and then reconstruct and feed it of the back for recognition that's just the expected value
0:11:19wander
0:11:20a gaussian
0:11:21a mixture of gaussians rather so
0:11:24highly structured mixture of gaussians
0:11:26at a given time step
0:11:27given the prior for not
0:11:31okay so let's just jump straight in some results so the data
0:11:34we tested on was us english in-car speech
0:11:38recorded in various noise conditions uh
0:11:41mostly uh
0:11:42a a you know cars speeding by acceleration of the motor and one not would be the done and noise
0:11:49uh the training data is uh
0:11:51uh in a hundred thousand utterances that's about seven hundred and eighty six hours
0:11:56test data is a a whole out uh
0:11:59one a hundred twenty eight held-out speakers
0:12:01and uh approximately forty hours of data
0:12:05uh there's forty seven tasks that
0:12:07span a few domains navigation command control
0:12:11radio digits dialling
0:12:13seven regional a us accents
0:12:17oh the models that we use uh in the back and we use
0:12:20a pretty standard
0:12:21uh a word internal plusminus
0:12:24to phonetic context the with eight eight sixty five context-dependent state
0:12:29where dimensional the you features we built three models based on notes an ml model a clean and all model
0:12:35trained only on
0:12:36data it's twenty db know above
0:12:38and and fmpe model
0:12:40a once trained we compressed using a
0:12:43hierarchical bang quantization
0:12:45uh
0:12:47for efficiency
0:12:49a the D N a speech model it runs of the front and it has its own speech model a
0:12:54a little gmm
0:12:55so yeah uh it's set in the log mel domain
0:12:58and actually just has two hundred fifty six speech and sixteen silence component
0:13:02uh we compress this model as well to the Q gmm as i describe so is actually only a scenes
0:13:08per dimension you you need to evaluate
0:13:10to evaluate the you know
0:13:13so it's got the uh
0:13:15yeah i mean it's in terms the number of gaussians
0:13:18similar to what a speech detector go
0:13:23um so the adaptation algorithms that we uh
0:13:26tried in conjunction with a testing include spectral subtraction T in a a alone
0:13:31i mean normalisation was always in the pipeline and we tried switching and
0:13:35have a more and P the C L the
0:13:37techniques interact
0:13:39um
0:13:40so spectral subtraction is uh basically used standard
0:13:43a spectral subtraction module uh model based speech detector noise estimate from speech
0:13:48three
0:13:49frames with a
0:13:51a a floor value based on the current noise estimate to avoid musical noise
0:13:55um have a are this version is an online
0:13:58version it's
0:13:59stochastically adapted every friday five frame
0:14:04a the F P model will oh a uh
0:14:06implements the it the transformation using five under and twelve gaussians seventeen frame in a context a nine frame out
0:14:12or condo
0:14:14okay so uh
0:14:16this is a pretty busy go graph will spend a few minutes taking a look at it
0:14:20first thing i want you to look at it is
0:14:23uh basically compare the red lines to the blue lines
0:14:26uh this compares
0:14:27basically turning the an when D an a a is off
0:14:31to when the in is on
0:14:32so first but i
0:14:34but a red curve
0:14:37is a
0:14:38showing the word error rate as a function of snr actually we have a histogram of how much data is
0:14:43E in each snr bin
0:14:44in the background there and you can see the number of words
0:14:48on the you right hand side
0:14:49yeah on the X
0:14:50on the E Y axes rather
0:14:52so uh
0:14:54anyhow so you can see that basically is a general trend that we're getting
0:14:59went from red to green uh
0:15:01that forget me up forget the green curves for no going from red to blue curve we we're getting significant
0:15:07our reductions in word error rate when we turn on D an a
0:15:11so a little bit more concretely the red curve with
0:15:14uh that's dash would diamond
0:15:16it's just with spectral subtraction and one we
0:15:19turn on D N a and spectral subtraction we have the blue curve that's dashed
0:15:24we're getting they screens
0:15:25especially at low
0:15:26well
0:15:27particularly at low snr
0:15:30um
0:15:31an interesting thing was that when we use the clean
0:15:34i clean D N a model and a clean back and model
0:15:37a a looking at the green curve with the upside down triangle
0:15:41a
0:15:42we can see that curve tracks
0:15:44uh
0:15:46a spectral subtraction
0:15:48uh
0:15:49spectral subtraction curve
0:15:51what it has a multi condition back
0:15:53so basically
0:15:55you you doing as well as spectral subtraction
0:15:58uh but
0:15:59if you a minute all that
0:16:01a our data
0:16:03is
0:16:04for the impressive
0:16:06okay so that's so that for more off so you can see what happens when you turn uh from a
0:16:09on so if you're doing noise next experiments without of more uh
0:16:14maybe you should rethink
0:16:17uh a it has a huge impact on
0:16:19on performance and uh fortunately in this case the an a compliment stuff helps to generate
0:16:24that are all alignments
0:16:26and again you can see comparing the
0:16:28uh
0:16:29that's see in this case
0:16:30one spectral so
0:16:32traction is
0:16:33oh i and and
0:16:34okay do when D N in spectral subtraction is on
0:16:37actually in this case there's
0:16:39does not a uh
0:16:40was not a big difference
0:16:41a low snr
0:16:43um but of course
0:16:45these these are only a a and well uh
0:16:47a maximum-likelihood system results so we're still not using the best system
0:16:51so then when we turn on F M P
0:16:54uh
0:16:54we see again that in general the word error rate is just dropping
0:16:59substantially everywhere except at low snr actually it's
0:17:03the it's a little bit in the lowest that's not bins
0:17:06um
0:17:08that makes quite a bit of sense because that
0:17:10that be and you don't get a lot of training data for
0:17:13and it's putting a lot of stress on the F and P model it's
0:17:16it has to memorise combinations of speech and noise
0:17:20uh
0:17:20to do well in that area
0:17:22and it's tricky when noise is actually modifying the feature a lot
0:17:26so here
0:17:28are you can see that
0:17:29comparing i'd
0:17:31spectral subtraction on that's the red diamonds to
0:17:34the D N A results with and without spectral subtraction
0:17:38we're getting a a big gains at very low snr and
0:17:41uh
0:17:42a uh
0:17:43basically in about fifteen db it's
0:17:46it's definitely better to
0:17:48turn on D N A
0:17:50um
0:17:52so one thing uh i'm not gonna go through this huge table this is
0:17:56a kind of daunting but
0:17:57uh you can look in the paper for details but
0:18:01a a well one thing we notice is that we're getting consisting gains at low snr but
0:18:05because of the amount of data
0:18:07that higher snr and the fact that the seems to be just a little bit of degradation
0:18:12from this D an A um describing it did
0:18:14today
0:18:15uh that ends up not not helping the F and peace system
0:18:19um but
0:18:21sense then we actually
0:18:23figured out
0:18:24uh how to make them more parse ammonia as
0:18:26i think it into the details here uh
0:18:29but uh a at this time actually
0:18:32the
0:18:32the latest version we have in improves
0:18:35a a on this database i it gives
0:18:38ten percent relative gains over all and below
0:18:42uh
0:18:43low ten db it's
0:18:45uh
0:18:46heading towards twenty percent relative so it
0:18:49it's improving our system quite a bit
0:18:53um
0:18:54okay so
0:18:55at this point i'll just wrap up i mean and some D N A
0:18:58it's working and it's
0:19:00this this data is
0:19:01a a pretty dated now unfortunately is working a lot better well have that wait till the next conference
0:19:06tell you about the rest
0:19:08uh because of
0:19:09patent issues
0:19:11um in turn into the future work i mean uh i hope uh
0:19:15as many you you the crowd are already where um when you use a graphical model
0:19:19work you can
0:19:21uh its modular so you can build you can make
0:19:24parts of the model stronger
0:19:26a in your inference algorithms obviously
0:19:28yeah
0:19:29we inference algorithm we have for the noise is
0:19:31very approximate we
0:19:33uh have a huge gmm we approximate
0:19:36made by gaussian at each time step
0:19:39um another thing would be tighter back and integration of course there's lot of people working on that in
0:19:44mark gales T
0:19:45uh
0:19:46and the microsoft and search
0:19:48um
0:19:49and uh
0:19:51i guess that's
0:19:52that's pretty much it and
0:19:53so it's good news for
0:19:55model based approach
0:20:03yeah i have time
0:20:05a couple a couple of questions
0:20:11a
0:20:19i
0:20:21oh sorry
0:20:23i
0:20:25uh
0:20:29so seems like you're
0:20:31uh uh your yeah that's a great slide but that
0:20:34seems like it works really great on the cases where you're word error rate is is very high
0:20:39but
0:20:40oh and the like three by db snrs
0:20:43you start to get degradation for a D N A
0:20:47i
0:20:47here here the little red triangle looks like cure
0:20:50where error rate is
0:20:53a doubled
0:20:54we need to a and N is that my reading that right or is that have
0:20:57yeah that's i mean
0:20:59that that that is pretty bad that yeah at the time that's
0:21:02that was the big problem with that
0:21:04um
0:21:05but we saw that problem of so with the iteration with that that you your salt you solve that problem
0:21:09a i just in general we we all that but unfortunately i okay i can tell you yet ideally
0:21:16i is just a clarification question i didn't quite do expose can see numbers
0:21:20a whether some numbers correspond to have been clean training
0:21:25a hundred hours and so
0:21:28noise so
0:21:30oh yeah i mean in general except for the
0:21:33uh
0:21:34the clean the green uh
0:21:36curves here
0:21:37the those have a a clean
0:21:39a
0:21:40back and and D N A model everything else is using multi condition more uh data
0:21:45okay
0:21:51yeah
0:21:52so
0:21:53question to a speech recognition
0:21:59uh
0:22:00rather than is to say a speech enhancement
0:22:04construction noise
0:22:07uh well actually maybe it in make that clear we actually do reconsider a that's why i have the question
0:22:13oh okay
0:22:16a the spectral so to construct
0:22:21well
0:22:22or
0:22:23so to to the uh
0:22:27and
0:22:29oh yeah sure or uh a this been a few publications on it in in the past and they they
0:22:34have spectrograms i didn't put them in that this time
0:22:37uh i mean it is in the mel domain so i i think of the trouble of regenerating generating
0:22:41the signal
0:22:43the listen to it
0:22:47a question
0:22:48i
0:22:51as one quick question here is so that when you compare the performance you already compelled the
0:22:56so the one the D A is all the i is or
0:23:00and also you use the S S
0:23:02this that the for the spectral subtraction right
0:23:04yes
0:23:05and all the what is differs to this uh as the as use the standard the as approach
0:23:11a standard bp S is
0:23:13oh uh
0:23:15i mean a
0:23:17i T S is would be a step up by say from spectral subtraction it it's uh
0:23:23uh
0:23:24but is this one line here pretty much sums it up you
0:23:28use a speech detector
0:23:30and then based on your segmentation you estimate the noise based on the frames their speech free
0:23:35and then you
0:23:36you adapted according you know uh according to some forgetting factor you uh i as i think probably the make
0:23:42more sense to compel this with so
0:23:44yes or one you use a single thing
0:23:47full from now so yeah that's a point so
0:23:50a
0:23:52actually this
0:23:53the spectral subtraction routine is pretty well tuned so
0:23:56when we ran and turned off the at that adaptation for D N a
0:24:00actually the my slight the price the the results
0:24:03are the same is with spectral subtraction
0:24:06so we got quite a bit again by turning on the adaptation of the noise
0:24:11uh
0:24:12so so because if we just lies the noise model T S that's what everybody
0:24:16does initialize on the first ten frames and then do vts T S
0:24:20that actually doesn't improve our system
0:24:23you have to adapt it and
0:24:25the reason that was shocking to me is because this database
0:24:28this database the
0:24:30the utterances are
0:24:32uh
0:24:33to five seconds long
0:24:35so
0:24:35means that you know cars are passing and one not and that's affecting the noise estimate significantly enough that you
0:24:40need to adapt it during the utterance
0:24:43that's important
0:24:46a a was to process