Přepis řeči - ROBUST SPEECH RECOGNITION USING DYNAMIC NOISE ADAPTATION

0:00:13	i'm gonna be telling about
0:00:14	robust speech recognition using dynamic
0:00:17	noise adaptation
0:00:19	i guess we'll go right into it
0:00:21	um so
0:00:22	and a line a talk all start by giving a brief overview
0:00:26	of
0:00:28	the area
0:00:29	and D N A
0:00:30	uh review the D N a model
0:00:32	briefly go over how we do in print
0:00:35	and then jump right two experiments and results
0:00:40	okay so model based role robust asr it's a well
0:00:43	establish paradigm
0:00:45	basically use explicit models of noise channel distortion
0:00:49	and their interaction with speech
0:00:51	two
0:00:51	model noisy speech
0:00:53	uh there's been many interesting modeling an inference techniques that are thing developed of a lab
0:00:58	i'd say twenty five years
0:01:00	interaction models uh there's many references you have to refer to the paper
0:01:04	um so
0:01:05	one thing that is conjecture to some degree but
0:01:08	uh
0:01:09	i i i uh
0:01:11	but i i believe it didn't and that
0:01:13	basically sickly that the relevance to a commercial great is is not been this dependent definitively established
0:01:19	a so why is that well that's because
0:01:21	we have
0:01:22	promising word error rate reductions on less sophisticated asr systems
0:01:26	uh you to the small training or test sets
0:01:29	either the our or simple pipe fines and or artificially mixed data
0:01:33	so we don't know if these techniques actually improve asr as of this moment
0:01:38	uh but all i'll give you some evidence it's of at least one model that does
0:01:43	improve
0:01:44	a state-of-the-art asr system
0:01:47	okay so dynamic noise adaptation just uh a review
0:01:51	a a model based approach uh essentially use a gmm of one sort or another first model speech
0:01:57	in uh and i
0:01:58	the model of noise in this case a gaussian process model
0:02:03	uh the features
0:02:04	of this
0:02:04	some features of the map that are that it's
0:02:06	i mismatch model there's noise
0:02:08	no uh training required for the noise model
0:02:11	there's no system retraining required
0:02:14	and uh the technique models and certainly in the noise estimate now show you briefly why that's important to
0:02:20	uh
0:02:21	good
0:02:22	speech
0:02:23	uh
0:02:24	noise thing
0:02:26	terms of previous results actually actually D N A is been kicking around for a while now
0:02:31	we does off every couple years and try it on more realistic data
0:02:35	a so it
0:02:36	a a previous publications have shown that it
0:02:38	significantly a performs techniques like the
0:02:42	F E
0:02:42	and F for more on aurora two and the D an a plus or or two test these are artificially
0:02:48	a data sets though we've
0:02:50	we never tried it on real data
0:02:52	so uh
0:02:53	this work all talk about today all or review some new results on real data
0:02:57	using the uh best
0:02:59	uh embedded speech recognizer that we have a
0:03:02	a i'll discuss how the an A is been which it for low latency deployments
0:03:06	and all
0:03:07	i'll show you some results that with some gains for example twenty two percent
0:03:12	word error rate reduction before the below six B R
0:03:16	on the system that includes spectral subtraction uh from more are and fmpe P
0:03:21	uh
0:03:22	first briefly here's a that D N a generative model i'll get into the components
0:03:27	oh right away but basically the idea is that this is a general model which is
0:03:32	it's modular there's the channel model speech model and noise model
0:03:35	that are combined with an interaction model
0:03:37	and
0:03:38	uh basically you can play with how the pieces are structured
0:03:42	and put them together you uh using the general framework which is very useful if you want to extend
0:03:47	uh
0:03:48	and through the model change the inference out were that make it
0:03:52	a stronger it's true
0:03:54	okay so right in the interaction model in the time domain
0:03:58	uh this
0:03:58	the standard uh distortion plus noise
0:04:02	model in the frequency domain uh i think
0:04:04	probably we most and this room are familiar with this
0:04:07	uh we approximate the
0:04:09	term with the phase
0:04:11	the the third term in the sum
0:04:14	uh in this
0:04:15	in this uh paper and most other papers is scales in there are some better interaction
0:04:20	models out there uh by lee gang another
0:04:23	um there a little bit more
0:04:25	computationally intensive but they are more accurate
0:04:28	um so are probability model for the data we observe why given
0:04:33	the
0:04:34	speech
0:04:35	uh X the channel H and the noise and and the log
0:04:39	mel power spectral domain
0:04:41	is
0:04:42	uh
0:04:43	a normal distribution
0:04:44	with the interaction function have as shown
0:04:49	okay so jumping right into the speech model one thing we changed is we uh use the bank quantized gmm
0:04:54	instead of a regular gmm
0:04:56	essentially for each
0:04:57	uh state
0:04:59	uh uh you have a map to a reduced number of gaussians at each frequency so
0:05:03	here a maps acoustic state yes
0:05:05	to get a seen a and
0:05:07	a given frequency band yeah the idea is the number the number of
0:05:10	yeah scenes in each band is a lot smaller than the number of states
0:05:14	so this model can be very efficiently computed and ford
0:05:18	C an example than the sec
0:05:21	a the noise model for D an A it's a simple uh a simple uh
0:05:25	gaussian process model is you can have
0:05:28	essentially uh uh we model the noise level as uh as a stationary process
0:05:33	so the prediction for the next frame
0:05:36	given the previous frames noise level is
0:05:39	the
0:05:40	that
0:05:41	that that same value
0:05:42	a plus a the there's some propagation noise gamma M and model
0:05:47	a some to
0:05:48	uh capture how the noise
0:05:50	of the over time
0:05:52	uh this is combined with a simple transient noise model which just gaussian
0:05:56	and the idea here is that the
0:05:58	the noise level is
0:06:00	uh changing slowly to the
0:06:02	uh
0:06:03	but is with respect to the frame rate
0:06:06	and that the transient noise actually ends that being a dominating this so introducing this into additional layer
0:06:12	makes it possible to track something that doesn't look smooth at all when you look at in the log power
0:06:17	spectrum
0:06:18	uh but with this with this filled uh this extra layer in the model which
0:06:24	a in essence filters out the transient noise it is a very good model
0:06:29	a the channel model like in the past mean didn't have a channel model was artificial data we did need
0:06:33	one
0:06:34	so uh for the channel model we used
0:06:37	yeah the for this work
0:06:38	just the stochastically adapted parameter vector this is actually
0:06:42	uh
0:06:43	the same the same model that uh we
0:06:46	dang used in the past for
0:06:48	modeling noise
0:06:49	and not uh work quite well
0:06:53	okay so now we're back to the gender model
0:06:55	a show you a few slides ago go so
0:06:58	uh now we can explain the as
0:07:00	briefly so the channel model
0:07:02	i have a little square for the variable there because it's not actually random variable it's
0:07:06	a parameter that
0:07:08	and and the arrows are showing that it adapts over time
0:07:11	um
0:07:12	uh the speech model uh the part inside the big grey box
0:07:16	is essentially uh
0:07:18	it is the
0:07:19	uh
0:07:20	variables K F T which are they gaussian seen indices for a given time and frequency
0:07:25	and the speech
0:07:26	the clean speech that's generated X
0:07:29	subscript that T
0:07:30	and of course there's and noise model that i just described it has two layers to facilitate
0:07:36	a bus noise tracking and the interaction model i described
0:07:39	oh one interesting thing is that everything and the box actually that's a plate
0:07:44	a graphical model notation with that little have and the corner there
0:07:47	i means that this structure is do created for
0:07:50	but
0:07:50	duplicated for all frequency
0:07:52	and actually the only thing that is not independent
0:07:55	in the model
0:07:57	keep uh
0:07:58	but the only thing that binds
0:07:59	uh the free
0:08:00	the estimates so we're frequency is actually the speech
0:08:04	a components T here C T
0:08:07	given C T V entire model
0:08:09	uh factors
0:08:11	uh well more precisely for a given
0:08:14	segment of data over time and frequency you have to be given this
0:08:17	all the states of the speech model
0:08:19	for the model the factor
0:08:21	uh
0:08:22	over time and frequency
0:08:25	okay so how do we do we it's in this small very quickly this is been
0:08:29	oh shown before so the exact noise posterior because it's a gaussian process
0:08:33	and you pick up
0:08:35	uh
0:08:36	uh
0:08:37	uh
0:08:38	there every time you evaluate a gmm that increases the number of components and the noise posterior by a factor
0:08:43	of the number of on it
0:08:45	so uh i T grows exponentially with time the number of components and the noise posterior
0:08:50	so we actually approximate this just is gaussian and this work
0:08:54	and now we done some investigation of
0:08:57	some more complex methods and so far we haven't seen any advantage of that
0:09:01	but that was on
0:09:02	synthetic data it still needs to be tried for real data
0:09:06	um um
0:09:07	as it just showing how you compute the mean and the variance of that gaussian i'll skip right over that
0:09:14	um
0:09:15	okay so likelihood approximation a
0:09:18	i showed you before the interaction function it's nonlinear when
0:09:21	so basically we iteratively lynn your eyes it using a got when this is a very in of it or
0:09:26	itself a variant of it or
0:09:28	iterative
0:09:30	T
0:09:31	um um
0:09:32	i i here is the a a uh i guess
0:09:35	one thing is that the
0:09:37	one interesting thing is that i for uh basically is the ratio of speech and noise in uh
0:09:42	the weights on that
0:09:43	the two factors in the model and
0:09:45	it's actually ends up being just the power ratio of the distorted speech over
0:09:49	total power
0:09:51	under the model
0:09:53	uh that's kind of difficult understands here's a picture
0:09:56	uh
0:09:57	so for a given time and frequency
0:09:59	a are you have a joint prior for speech in noise in the first image
0:10:04	a i a diagonal covariance gaussian the likelihood function and a given frequency in this case
0:10:10	a a the observation has
0:10:12	uh intensity of
0:10:13	ten db
0:10:15	a that's what it looks like in the exact posterior of the model looks like the thirty image
0:10:21	so what do we do we uh uh your eyes this approximation it so
0:10:25	in this case you can see that
0:10:27	we linear arrays in yeah
0:10:29	compute the posterior and it's actually nothing you nothing like the
0:10:33	a a true posterior
0:10:35	but if we iterate the process
0:10:38	actually we get a result that is very faithful to the true posterior that shown in the previous slide
0:10:44	so is that are two things here iteration is important we know that
0:10:47	uh
0:10:48	modeling the certainly in the noise
0:10:50	when you when the noise is
0:10:52	uh adapting
0:10:53	is clearly very important if i didn't have uncertainty in the noise estimate
0:10:58	uh a then this would have been my answer
0:11:00	which is wrong
0:11:04	okay um so to reconstruct the speech this is a front-end technique uh at least
0:11:09	uh that's how we've been using it for this work
0:11:12	we uh uh
0:11:13	run D N A and then reconstruct and feed it of the back for recognition that's just the expected value
0:11:19	wander
0:11:20	a gaussian
0:11:21	a mixture of gaussians rather so
0:11:24	highly structured mixture of gaussians
0:11:26	at a given time step
0:11:27	given the prior for not
0:11:31	okay so let's just jump straight in some results so the data
0:11:34	we tested on was us english in-car speech
0:11:38	recorded in various noise conditions uh
0:11:41	mostly uh
0:11:42	a a you know cars speeding by acceleration of the motor and one not would be the done and noise
0:11:49	uh the training data is uh
0:11:51	uh in a hundred thousand utterances that's about seven hundred and eighty six hours
0:11:56	test data is a a whole out uh
0:11:59	one a hundred twenty eight held-out speakers
0:12:01	and uh approximately forty hours of data
0:12:05	uh there's forty seven tasks that
0:12:07	span a few domains navigation command control
0:12:11	radio digits dialling
0:12:13	seven regional a us accents
0:12:17	oh the models that we use uh in the back and we use
0:12:20	a pretty standard
0:12:21	uh a word internal plusminus
0:12:24	to phonetic context the with eight eight sixty five context-dependent state
0:12:29	where dimensional the you features we built three models based on notes an ml model a clean and all model
0:12:35	trained only on
0:12:36	data it's twenty db know above
0:12:38	and and fmpe model
0:12:40	a once trained we compressed using a
0:12:43	hierarchical bang quantization
0:12:45	uh
0:12:47	for efficiency
0:12:49	a the D N a speech model it runs of the front and it has its own speech model a
0:12:54	a little gmm
0:12:55	so yeah uh it's set in the log mel domain
0:12:58	and actually just has two hundred fifty six speech and sixteen silence component
0:13:02	uh we compress this model as well to the Q gmm as i describe so is actually only a scenes
0:13:08	per dimension you you need to evaluate
0:13:10	to evaluate the you know
0:13:13	so it's got the uh
0:13:15	yeah i mean it's in terms the number of gaussians
0:13:18	similar to what a speech detector go
0:13:23	um so the adaptation algorithms that we uh
0:13:26	tried in conjunction with a testing include spectral subtraction T in a a alone
0:13:31	i mean normalisation was always in the pipeline and we tried switching and
0:13:35	have a more and P the C L the
0:13:37	techniques interact
0:13:39	um
0:13:40	so spectral subtraction is uh basically used standard
0:13:43	a spectral subtraction module uh model based speech detector noise estimate from speech
0:13:48	three
0:13:49	frames with a
0:13:51	a a floor value based on the current noise estimate to avoid musical noise
0:13:55	um have a are this version is an online
0:13:58	version it's
0:13:59	stochastically adapted every friday five frame
0:14:04	a the F P model will oh a uh
0:14:06	implements the it the transformation using five under and twelve gaussians seventeen frame in a context a nine frame out
0:14:12	or condo
0:14:14	okay so uh
0:14:16	this is a pretty busy go graph will spend a few minutes taking a look at it
0:14:20	first thing i want you to look at it is
0:14:23	uh basically compare the red lines to the blue lines
0:14:26	uh this compares
0:14:27	basically turning the an when D an a a is off
0:14:31	to when the in is on
0:14:32	so first but i
0:14:34	but a red curve
0:14:37	is a
0:14:38	showing the word error rate as a function of snr actually we have a histogram of how much data is
0:14:43	E in each snr bin
0:14:44	in the background there and you can see the number of words
0:14:48	on the you right hand side
0:14:49	yeah on the X
0:14:50	on the E Y axes rather
0:14:52	so uh
0:14:54	anyhow so you can see that basically is a general trend that we're getting
0:14:59	went from red to green uh
0:15:01	that forget me up forget the green curves for no going from red to blue curve we we're getting significant
0:15:07	our reductions in word error rate when we turn on D an a
0:15:11	so a little bit more concretely the red curve with
0:15:14	uh that's dash would diamond
0:15:16	it's just with spectral subtraction and one we
0:15:19	turn on D N a and spectral subtraction we have the blue curve that's dashed
0:15:24	we're getting they screens
0:15:25	especially at low
0:15:26	well
0:15:27	particularly at low snr
0:15:30	um
0:15:31	an interesting thing was that when we use the clean
0:15:34	i clean D N a model and a clean back and model
0:15:37	a a looking at the green curve with the upside down triangle
0:15:41	a
0:15:42	we can see that curve tracks
0:15:44	uh
0:15:46	a spectral subtraction
0:15:48	uh
0:15:49	spectral subtraction curve
0:15:51	what it has a multi condition back
0:15:53	so basically
0:15:55	you you doing as well as spectral subtraction
0:15:58	uh but
0:15:59	if you a minute all that
0:16:01	a our data
0:16:03	is
0:16:04	for the impressive
0:16:06	okay so that's so that for more off so you can see what happens when you turn uh from a
0:16:09	on so if you're doing noise next experiments without of more uh
0:16:14	maybe you should rethink
0:16:17	uh a it has a huge impact on
0:16:19	on performance and uh fortunately in this case the an a compliment stuff helps to generate
0:16:24	that are all alignments
0:16:26	and again you can see comparing the
0:16:28	uh
0:16:29	that's see in this case
0:16:30	one spectral so
0:16:32	traction is
0:16:33	oh i and and
0:16:34	okay do when D N in spectral subtraction is on
0:16:37	actually in this case there's
0:16:39	does not a uh
0:16:40	was not a big difference
0:16:41	a low snr
0:16:43	um but of course
0:16:45	these these are only a a and well uh
0:16:47	a maximum-likelihood system results so we're still not using the best system
0:16:51	so then when we turn on F M P
0:16:54	uh
0:16:54	we see again that in general the word error rate is just dropping
0:16:59	substantially everywhere except at low snr actually it's
0:17:03	the it's a little bit in the lowest that's not bins
0:17:06	um
0:17:08	that makes quite a bit of sense because that
0:17:10	that be and you don't get a lot of training data for
0:17:13	and it's putting a lot of stress on the F and P model it's
0:17:16	it has to memorise combinations of speech and noise
0:17:20	uh
0:17:20	to do well in that area
0:17:22	and it's tricky when noise is actually modifying the feature a lot
0:17:26	so here
0:17:28	are you can see that
0:17:29	comparing i'd
0:17:31	spectral subtraction on that's the red diamonds to
0:17:34	the D N A results with and without spectral subtraction
0:17:38	we're getting a a big gains at very low snr and
0:17:41	uh
0:17:42	a uh
0:17:43	basically in about fifteen db it's
0:17:46	it's definitely better to
0:17:48	turn on D N A
0:17:50	um
0:17:52	so one thing uh i'm not gonna go through this huge table this is
0:17:56	a kind of daunting but
0:17:57	uh you can look in the paper for details but
0:18:01	a a well one thing we notice is that we're getting consisting gains at low snr but
0:18:05	because of the amount of data
0:18:07	that higher snr and the fact that the seems to be just a little bit of degradation
0:18:12	from this D an A um describing it did
0:18:14	today
0:18:15	uh that ends up not not helping the F and peace system
0:18:19	um but
0:18:21	sense then we actually
0:18:23	figured out
0:18:24	uh how to make them more parse ammonia as
0:18:26	i think it into the details here uh
0:18:29	but uh a at this time actually
0:18:32	the
0:18:32	the latest version we have in improves
0:18:35	a a on this database i it gives
0:18:38	ten percent relative gains over all and below
0:18:42	uh
0:18:43	low ten db it's
0:18:45	uh
0:18:46	heading towards twenty percent relative so it
0:18:49	it's improving our system quite a bit
0:18:53	um
0:18:54	okay so
0:18:55	at this point i'll just wrap up i mean and some D N A
0:18:58	it's working and it's
0:19:00	this this data is
0:19:01	a a pretty dated now unfortunately is working a lot better well have that wait till the next conference
0:19:06	tell you about the rest
0:19:08	uh because of
0:19:09	patent issues
0:19:11	um in turn into the future work i mean uh i hope uh
0:19:15	as many you you the crowd are already where um when you use a graphical model
0:19:19	work you can
0:19:21	uh its modular so you can build you can make
0:19:24	parts of the model stronger
0:19:26	a in your inference algorithms obviously
0:19:28	yeah
0:19:29	we inference algorithm we have for the noise is
0:19:31	very approximate we
0:19:33	uh have a huge gmm we approximate
0:19:36	made by gaussian at each time step
0:19:39	um another thing would be tighter back and integration of course there's lot of people working on that in
0:19:44	mark gales T
0:19:45	uh
0:19:46	and the microsoft and search
0:19:48	um
0:19:49	and uh
0:19:51	i guess that's
0:19:52	that's pretty much it and
0:19:53	so it's good news for
0:19:55	model based approach
0:20:03	yeah i have time
0:20:05	a couple a couple of questions
0:20:11	a
0:20:19	i
0:20:21	oh sorry
0:20:23	i
0:20:25	uh
0:20:29	so seems like you're
0:20:31	uh uh your yeah that's a great slide but that
0:20:34	seems like it works really great on the cases where you're word error rate is is very high
0:20:39	but
0:20:40	oh and the like three by db snrs
0:20:43	you start to get degradation for a D N A
0:20:47	i
0:20:47	here here the little red triangle looks like cure
0:20:50	where error rate is
0:20:53	a doubled
0:20:54	we need to a and N is that my reading that right or is that have
0:20:57	yeah that's i mean
0:20:59	that that that is pretty bad that yeah at the time that's
0:21:02	that was the big problem with that
0:21:04	um
0:21:05	but we saw that problem of so with the iteration with that that you your salt you solve that problem
0:21:09	a i just in general we we all that but unfortunately i okay i can tell you yet ideally
0:21:16	i is just a clarification question i didn't quite do expose can see numbers
0:21:20	a whether some numbers correspond to have been clean training
0:21:25	a hundred hours and so
0:21:28	noise so
0:21:30	oh yeah i mean in general except for the
0:21:33	uh
0:21:34	the clean the green uh
0:21:36	curves here
0:21:37	the those have a a clean
0:21:39	a
0:21:40	back and and D N A model everything else is using multi condition more uh data
0:21:45	okay
0:21:51	yeah
0:21:52	so
0:21:53	question to a speech recognition
0:21:59	uh
0:22:00	rather than is to say a speech enhancement
0:22:04	construction noise
0:22:07	uh well actually maybe it in make that clear we actually do reconsider a that's why i have the question
0:22:13	oh okay
0:22:16	a the spectral so to construct
0:22:21	well
0:22:22	or
0:22:23	so to to the uh
0:22:27	and
0:22:29	oh yeah sure or uh a this been a few publications on it in in the past and they they
0:22:34	have spectrograms i didn't put them in that this time
0:22:37	uh i mean it is in the mel domain so i i think of the trouble of regenerating generating
0:22:41	the signal
0:22:43	the listen to it
0:22:47	a question
0:22:48	i
0:22:51	as one quick question here is so that when you compare the performance you already compelled the
0:22:56	so the one the D A is all the i is or
0:23:00	and also you use the S S
0:23:02	this that the for the spectral subtraction right
0:23:04	yes
0:23:05	and all the what is differs to this uh as the as use the standard the as approach
0:23:11	a standard bp S is
0:23:13	oh uh
0:23:15	i mean a
0:23:17	i T S is would be a step up by say from spectral subtraction it it's uh
0:23:23	uh
0:23:24	but is this one line here pretty much sums it up you
0:23:28	use a speech detector
0:23:30	and then based on your segmentation you estimate the noise based on the frames their speech free
0:23:35	and then you
0:23:36	you adapted according you know uh according to some forgetting factor you uh i as i think probably the make
0:23:42	more sense to compel this with so
0:23:44	yes or one you use a single thing
0:23:47	full from now so yeah that's a point so
0:23:50	a
0:23:52	actually this
0:23:53	the spectral subtraction routine is pretty well tuned so
0:23:56	when we ran and turned off the at that adaptation for D N a
0:24:00	actually the my slight the price the the results
0:24:03	are the same is with spectral subtraction
0:24:06	so we got quite a bit again by turning on the adaptation of the noise
0:24:11	uh
0:24:12	so so because if we just lies the noise model T S that's what everybody
0:24:16	does initialize on the first ten frames and then do vts T S
0:24:20	that actually doesn't improve our system
0:24:23	you have to adapt it and
0:24:25	the reason that was shocking to me is because this database
0:24:28	this database the
0:24:30	the utterances are
0:24:32	uh
0:24:33	to five seconds long
0:24:35	so
0:24:35	means that you know cars are passing and one not and that's affecting the noise estimate significantly enough that you
0:24:40	need to adapt it during the utterance
0:24:43	that's important
0:24:46	a a was to process

ROBUST SPEECH RECOGNITION USING DYNAMIC NOISE ADAPTATION

Robust ASR

Přednášející: Steven Rennie, Autoři: Steven Rennie, Pierre Dognin, Petr Fousek, IBM, United States