Přepis řeči - GAMMATONE SUB-BAND MAGNITUDE-DOMAIN DEREVERBERATION FOR ASR

0:00:15	Q
0:00:16	and uh so far
0:00:19	thank you for being
0:00:21	are
0:00:22	through the session and reading flat annotation
0:00:24	a this paper or well so a that by she it's come on a just think was upstairs that manning
0:00:29	a poster or
0:00:30	myself and its tunnels go
0:00:32	and shift pitch could make it its i will be presenting a
0:00:36	oh speaking at the problem they're talking about today use is with a robustness to the reverberation
0:00:42	a speech is a natural medium of communication for humans
0:00:45	and we've been applying speech technologies everywhere at feast work great and control lab conditions
0:00:51	when we get to real world conditions things sort of break down
0:00:55	and one of the reasons for this is that operation
0:00:58	have immigration was what happens when we have reflections so
0:01:01	and getting from me to you
0:01:03	the sound not only takes the direct uh the meteor it also bounces of walls
0:01:08	that reflections reflections of reflections and so on
0:01:11	so if you actually look at a plot to the right
0:01:13	that just shows the impulse response of for a typical impulse response from a a source to a less not
0:01:18	and you can see the direct a i each of the spikes represents one of the reflections
0:01:23	a a is off
0:01:24	but
0:01:25	these things i continue for some time
0:01:28	so the sound that gets from the source to the listener can be part
0:01:33	in the nearly for
0:01:34	as of the room which is a stage of number here
0:01:37	yeah not be of operation is that is characterised through would be a
0:01:42	a R T sixty time which indicates
0:01:44	how much time
0:01:46	a sound takes to die off by sixty db B
0:01:49	yeah if
0:01:50	after are reflections
0:01:52	and uh the operation
0:01:54	ah
0:01:55	as what to effect or reverberation thus to a speech signal
0:01:59	the left
0:01:59	top panel shows a a a spectrogram of a signal
0:02:03	as from the resource management database
0:02:05	we we have a to that using an artificial room response for a room that was that that at a
0:02:11	a
0:02:11	T six you down for about three hundred miliseconds then it can see what happens
0:02:15	to the spectrogram
0:02:16	but near
0:02:18	and
0:02:20	can actually note that this looks like the spectrogram and the entire spectrogram is sneered
0:02:25	but it actually looks as of this mass spectrogram itself has been passed through a linear filter
0:02:30	and she is brought
0:02:31	happens
0:02:33	two
0:02:34	recognition accuracy because of reverberation
0:02:37	and this experiment we uh
0:02:39	trained our models with clean data from a resource management database
0:02:43	how we simulated room responses one of five cross for cross three room but be much map
0:02:49	we had reverberation time to the a few hundred and five hundred milliseconds
0:02:53	a if we recognise clean speech you get an error of less than ten percent which is the leftmost mar
0:02:58	but with that of a should time of only about three hundred miliseconds which is fairly standard for the
0:03:02	for a room we can get that
0:03:07	hmmm
0:03:08	i don't see that is not audio so
0:03:11	if you that it great
0:03:13	and no
0:03:14	so we
0:03:15	a it's it's a fairly standard row
0:03:17	and it can see that that of it immediately as got up to or fifty percent
0:03:20	and the room responses are what half a second it's
0:03:23	well over seventy percent so it's
0:03:24	right to
0:03:26	a it's very rapidly with bridge
0:03:28	yeah in on it do you know that to deal but that we begin by modeling the effect of reverberation
0:03:33	itself now
0:03:33	consider
0:03:35	how we compute feature as
0:03:37	for speech recognition
0:03:38	yeah have the speech signal can see look only at the grey blocks for not
0:03:42	a speech signal goes through a bunch of file filter that's like mel-frequency frequency filters
0:03:46	a and then the output of we compute
0:03:48	how our at the output of these those you compress the power using a lot function
0:03:53	and then eventually computed dct it gives you the feature
0:03:57	no be evaluation of fixed
0:03:59	in input to each of these filters it actually a a a a fix the signal such is the equivalent
0:04:04	of
0:04:05	affecting the input each of these but the so you can actually model
0:04:08	but a vibration and this manner by be red blocks
0:04:12	and uh the linearity of the uh
0:04:14	convolution that was on your the sweltering that was on over here
0:04:18	is that you can feel
0:04:20	the initial analysis but does it should be a at frequency for does and the room response it's that
0:04:25	so these two
0:04:27	i
0:04:28	strictly equivalent in terms of the effect on the features that are computed
0:04:34	for all the signal that intent
0:04:36	oh
0:04:37	yeah we introduce this
0:04:39	a mine an approximation
0:04:41	we say that
0:04:42	computing the how R
0:04:45	of that they were great signal is it and
0:04:48	roughly
0:04:49	to level grading the C Ds of power values that you get in every channel
0:04:54	and this filter or what here eight these H one to H T M
0:04:58	i i simply the for does that you'd get if you
0:05:02	oh essentially sample by sample square head
0:05:06	impulse response of the room impulse response of the room
0:05:09	and
0:05:10	approximating and this it in this manner
0:05:12	by
0:05:13	for this order
0:05:15	a because not perfect it gives you some and and the it is dependent on autocorrelation of the signal
0:05:20	a i that we have a plot which actually shows what kind of does it makes
0:05:25	the uh the red line is the spectrum of a signal this is the actually the output of a a
0:05:29	uh
0:05:30	a mel frequency for the centered at uh five hundred and it heard
0:05:35	a a not in the room with we actually have a braided the signal in this case we apply a
0:05:39	a i believe a a uh
0:05:41	a a three hundred millisecond our T
0:05:43	a operation
0:05:45	the output of the filter shown by the green line
0:05:48	but just
0:05:49	what should get in this case
0:05:50	this is what you get
0:05:52	out
0:05:53	oh
0:05:55	using list approximate model what we get out here
0:05:58	the shown but the blue line
0:06:00	and you can see that this approximation which we get from from thing
0:06:04	a quite but a vibration and the power
0:06:07	doesn't introduce very much better in fact we have a a a a a a that more quantitative result are
0:06:12	you know
0:06:13	a it turns out that applying this filter
0:06:16	to the palm or is different from applying this but the to the magnitude
0:06:21	but you'd actually be taking this quite of use
0:06:22	square root of these terms so good all points to
0:06:25	and when applied D filter to the palm or you introduce an ad or
0:06:29	which
0:06:30	which results in about a a a a results in some some distortion and the output of the
0:06:36	a a and the output of the uh that would be to to the
0:06:40	reverberation model
0:06:42	but is if you apply to the magnitude the kind of is much smaller
0:06:46	so we but actually in a model
0:06:48	a as you know that the abrasion is the filtering
0:06:52	that can cause on the magnitude
0:06:54	oh of a bird
0:06:56	cough off your mel
0:06:59	so the process can be caught up like so you have a
0:07:03	and channel a mel filter or its equivalent you have a power or magnitude computation
0:07:08	and then you have the spectral money which actually applies on the
0:07:12	to impose the effect of reverberation we've expanded it on it extended on it here
0:07:17	we have the
0:07:19	magnitude or power spectrum going into the room response to get the label
0:07:24	magnitude or power and then of course you have the log and the dct
0:07:28	so what we have done is we have effectively
0:07:31	that
0:07:32	a convolution on the signal which is the room response
0:07:36	to a convolution on magnitude are power spectrum
0:07:40	and only observe all these types
0:07:42	the
0:07:43	have a belated sequence of power right
0:07:46	and then just this
0:07:48	a a problem is to deter mine
0:07:50	oh i'll be stops the room response
0:07:52	as we have
0:07:54	as as the a problem that clean signal at seven
0:07:57	oh this is obviously a an i in constrained problems so we have to impose some constraints
0:08:03	and we're going to impose some constraints is going to say that uh
0:08:06	a because we are dealing with that magnitudes call times are nonnegative
0:08:10	in addition
0:08:12	i don't merely observe B
0:08:14	a a but in signal the actually observe a noise corrupted version of the reverberant signal
0:08:19	so what we will do is to try to estimate the signal
0:08:23	and the room response
0:08:25	such that the error between the output of for model
0:08:29	and what to actually that is uh is minimised
0:08:33	but some sparsity constraints
0:08:35	on the spectrum
0:08:37	no because
0:08:38	a scaling factor going and there also that impose an additional constraint that these room response times
0:08:44	some to one
0:08:46	can this it turns out simply a standard nonnegative matrix factorization problem
0:08:52	i would actually go to the derivation of work you know
0:08:54	but if you do you'd find a that you get a bit rules it's an iterative solution which gives a
0:08:59	it that a very similar to
0:09:01	a matrix factorization you can start off with an estimate
0:09:05	and at each iteration to get a multiplicative update
0:09:08	to this chart
0:09:09	which in shows that are always days
0:09:11	positive
0:09:13	i a propose of this this formulation we have here is not something that for introducing this paper
0:09:18	i has been proposed and by a me car and we also propose it separately a paper in uh
0:09:24	i believe that last year
0:09:26	oh the basic from of isn't but we proposed
0:09:29	she as what we do we had the standard short time pretty one then you compute the power
0:09:34	and the nmf decomposition which is what we have here
0:09:37	i you an estimate of the K
0:09:40	you can which you can perform an overlap add and
0:09:43	estimate B
0:09:44	a a no clean signal
0:09:47	a contribution of what he that is that are not going to work directly on this
0:09:51	power
0:09:52	instead
0:09:53	a actually apply had gammatone filter bank
0:09:56	so basically
0:09:57	that's but the bank here is gonna be a gammatone for the bank
0:10:01	and after having applied the gammatone for the bank we compute them
0:10:05	decomposition composition and the math
0:10:07	and a in a for the bank and then performed the overlap
0:10:11	so the got on for the bank can be thought of as a dimensionality it using
0:10:14	linear operation
0:10:16	on the
0:10:17	a or or the magnitude
0:10:19	and it is simply going to be the equivalent of multiplying the output of an have
0:10:23	the pseudo inverse of this
0:10:25	a lot for device matrix
0:10:27	so that that as an example of what we get this is a reverberated signal are gonna i don't have
0:10:31	audio so
0:10:33	yeah
0:10:34	i this sort of
0:10:36	uh
0:10:38	a a a a a
0:10:43	yeah
0:10:44	it's a lot of maybe
0:10:47	uh by this is what we hard with the
0:10:54	H
0:10:55	she
0:10:56	so i don't know yeah
0:10:58	the the a signal
0:10:59	yeah
0:11:01	okay
0:11:03	yeah
0:11:04	that that that my what right that liberation as very used
0:11:07	uh
0:11:08	it believe me
0:11:10	okay
0:11:11	so
0:11:12	a given that we can actually do this
0:11:14	that was actually can or when you are but in a perceptions a great thing you can you are all
0:11:19	sorts of nice stuff
0:11:20	but and then you put this to that signal at a recogniser
0:11:24	those
0:11:25	the improvements don't sure
0:11:27	so he are we and some experiments on the resource management database
0:11:31	this as a model trained on clean speech and you as what you get
0:11:34	and that was signal is the web braided with uh room response of that we hundred millisecond reverberation time
0:11:40	and the error it goes down if you actually try to dereverberated using the basic and a
0:11:45	mechanism
0:11:46	proposed a by a income come you can
0:11:49	and if this is that i don't the part or
0:11:52	it goes down a but but if you applied on the map that you're it goes down a lot more
0:11:55	so
0:11:56	which shows that gives better off to where we a better at working on the magnitude
0:12:00	and then
0:12:02	oh
0:12:03	here
0:12:05	i
0:12:06	G she M S R be don't and of for nmf variance
0:12:09	again when we apply the garment to one
0:12:12	and and of it so
0:12:13	in the H cases
0:12:15	a as you want that the room response responses the same in every
0:12:19	channel
0:12:20	oh the processing which is really that true
0:12:23	but what happens is that because you observing of that a version
0:12:27	of the signal
0:12:28	it does not make sense to us as that the room response the same in every channel that actually gives
0:12:32	you a bad estimate
0:12:34	so if you as a estimated different room response at each channel
0:12:37	and you get some improvements which a short by these guys
0:12:41	and then if you are actually apply the gammatone filtering
0:12:45	he is what you get when you were on the power but if you were on the magnitude
0:12:49	we can see that if you as you that the room response responses the in all channels you gets
0:12:54	is you performance
0:12:55	if a lot it to be different for different channel
0:12:58	a as a performance again
0:12:59	so the gist of it is that
0:13:01	that do you have a bidding the signal
0:13:03	a after down but don't training and then post
0:13:05	inverse filtering it
0:13:07	and and forming all the net at
0:13:09	but
0:13:10	yeah
0:13:12	a bit of signal results that's and at a rates which are less than half of what you'd get
0:13:16	but the there is no and segment
0:13:19	we got this not of you other
0:13:21	uh test sets this with the by so and was that we had a a three hundred millisecond reverberation time
0:13:26	this is good
0:13:26	a three hundred and five hundred
0:13:28	and we compare it with a bunch of other techniques which i one bar to explain
0:13:32	i take time
0:13:34	again the do just but was better
0:13:38	no
0:13:40	i actually making a model as i'm channel what here mainly that you can through
0:13:44	i
0:13:45	the power computation and the room response
0:13:48	and then performing the dereverberation vibration
0:13:51	i would this for the procedure
0:13:53	have
0:13:54	that is not an up box approximation but that would truly what happened
0:14:00	so in this experiment we actually sort of
0:14:02	fate reverberation
0:14:04	applying that a vibration to a sequence of part values and then
0:14:08	i to do have a big the signal nine but you can see
0:14:11	but that that's a and a interested in
0:14:13	these uh kind of
0:14:15	spurious use which are there because
0:14:17	the plot came out should be it's thesis
0:14:19	and you can see that when be uh
0:14:22	well actually holds to the improvements you get can be very very large
0:14:28	a this case we tried this not all of stuff we so idea was on fake room responses where the
0:14:33	room response computed using the image method
0:14:36	so we applied some true room responses to obtain from a T are this is a room response but
0:14:41	a for seventy millisecond response time is at six hundred millisecond and again
0:14:45	improvements the
0:14:47	no one of the things that everybody knows that
0:14:49	is that the
0:14:51	set up i should use so far
0:14:53	is good
0:14:55	a a speech recognition system is an never train on clean data are you actually train it on matched data
0:15:01	that kind of data that you actually expect to recognise
0:15:05	so
0:15:06	i all of this but all when you perform matched condition training
0:15:10	sure enough you observe that the implements you get
0:15:13	from the of a braiding the signal
0:15:16	if you train the signal a the recogniser and clean speech
0:15:20	not even then given to get to the kind of performance you get if you simply trained the recognizer on
0:15:26	dev a speech
0:15:27	this is a performance a get the yellow bars in each case
0:15:31	but even yeah
0:15:32	we should do you have a braided but the training and test data using our technique you get an additional
0:15:36	improvement which is about
0:15:38	i twenty to forty percent relative or believe
0:15:40	if i can find my at all
0:15:43	i could helps in every K
0:15:46	the truth of the matter is you don't merely have a operation
0:15:50	we also have additive noise
0:15:51	so that have a bit signal gets corrupted by additive noise be explained a here
0:15:56	so we can you know five this were process that i've just show
0:16:00	but
0:16:00	some additional processing to compensate for the noise
0:16:04	i of that's is to for that was presented but it's done yesterday at the leaf
0:16:08	we had to be present that something called
0:16:10	but that spectral
0:16:12	cepstral coefficients D C C
0:16:15	and uh
0:16:16	so so is a procedure that
0:16:18	in a in a in a
0:16:20	and summary
0:16:21	in the D S C C computation
0:16:23	it's of directly web team of the magnitude spectra and compressing magnitude spectra
0:16:28	can you but since is between a just in magnitude spectrum of what is that has is that
0:16:33	stationary signals get can so that
0:16:36	that things that maybe even a little bit
0:16:38	the
0:16:39	i as it turns out that's D speech is
0:16:41	they but as stationary as the noise that could upset
0:16:44	so as a result simply performing this
0:16:47	the friends operation and what only and the prince magnitude spectra
0:16:53	robust bus performance
0:16:54	so that it's bad it's better and now have what positive and negative values so we actually have to sort
0:16:58	of normalise its distribution
0:17:00	and then without any compression whatsoever
0:17:03	just apply dct
0:17:05	and then perform from additional operations as before
0:17:08	and this output is what we used for recognition
0:17:11	and we showed and now the post is that using that
0:17:14	feature
0:17:15	get much more robust recognition then what you get with just read up
0:17:19	a mel frequency cepstra
0:17:21	so it turns out that the room that they're dereverberation that we does that i just talked about can be
0:17:25	combined
0:17:27	but this tells that a spectral cepstral coefficient computation
0:17:31	so you actually do have a bit the signal first and then
0:17:34	you compute the mel spectra these of the not the log mel spectra computed differences is and then compute features
0:17:41	and then and you perform recognition on those features and sure enough you can see that
0:17:46	firstly
0:17:47	and all you have is that a web it'd signal it makes things just marginally words
0:17:51	but the moment to begin adding noise
0:17:54	a blue as as a lot of improvements so this uh is all on
0:17:57	a a a a a room
0:17:59	but millisecond room response
0:18:01	and the blue light shows the performance you get a
0:18:05	and you and do you that will be the signal and
0:18:07	then use the B C C features
0:18:10	and each please uh we've got noise of two different levels
0:18:13	could up in the signal
0:18:15	and you can see that but the best performance by far as what you get
0:18:19	a you but do that will be the signal and
0:18:22	then perform be C computation
0:18:25	so in summary
0:18:26	we model that operation
0:18:28	one speech spectral
0:18:30	we might it as
0:18:32	a phenomenon that a fixed this sequence of magnitude spectra
0:18:36	used an a F
0:18:37	a factor is this to perform of operation
0:18:41	i also used the gammatone sub-band non-negative matrix factorization
0:18:44	not that you have a perceptual weighting on this and
0:18:47	in perceptual weighting
0:18:49	and uh the compared its magnitude and power domain the were here
0:18:53	and studied the joint
0:18:55	normalized and a patient problem by integrating a noise that was feature a lot with
0:19:00	do we that operation and got significant improvements
0:19:03	thank you
0:19:09	well
0:19:10	the is to last talk and do used to have time to
0:19:14	change
0:19:17	yes
0:19:40	hmmm
0:19:47	i
0:19:48	a
0:19:49	oh
0:19:51	a
0:19:52	oh
0:19:53	i
0:19:54	i
0:19:56	i
0:20:00	oh
0:20:02	one
0:20:03	i
0:20:04	a
0:20:06	oh
0:20:07	and
0:20:08	yeah
0:20:16	uh
0:20:36	yeah
0:20:38	oh
0:20:59	yeah
0:21:00	okay
0:21:01	i
0:21:04	i
0:21:04	yeah
0:21:06	yeah
0:21:10	yeah
0:21:16	a question
0:21:19	and yeah
0:21:21	yes
0:21:30	a
0:21:47	a
0:21:49	yeah
0:21:50	okay
0:21:51	and
0:21:54	yeah
0:21:54	my
0:21:54	okay
0:21:56	i
0:22:01	i
0:22:05	oh
0:22:06	oh
0:22:10	oh
0:22:10	if
0:22:18	a
0:22:24	so
0:22:25	and
0:22:26	to i have the last question i have to prove
0:22:29	uh talk about
0:22:32	some sort of has to X and H
0:22:37	and
0:22:38	just just what don't have it is but have a bribe and was once had to it
0:22:44	that's perhaps
0:22:47	the so the optimized is not have to do that
0:22:52	situation
0:22:54	oh
0:22:55	i
0:22:55	i
0:22:56	vol
0:22:59	a
0:23:01	yeah
0:23:04	yeah
0:23:09	oh
0:23:10	a
0:23:14	you
0:23:16	yeah
0:23:16	for
0:23:18	i
0:23:20	i
0:23:23	uh_huh
0:23:24	and
0:23:25	i
0:23:26	i how to do this and the this one
0:23:29	experimental to have
0:23:33	yeah
0:23:36	well
0:23:36	the questions
0:23:40	well i guess so
0:23:41	people a a a a i want to
0:23:46	as the i'm to say of this paper is and in particular
0:23:50	do
0:23:51	so two and and sent much

GAMMATONE SUB-BAND MAGNITUDE-DOMAIN DEREVERBERATION FOR ASR

Robust ASR

Přednášející: Bhiksha Raj, Autoři: Kshitiz Kumar, Rita Singh, Carnegie Mellon University, United States; Bhiksha Raj, Disney Research, United States; Richard Stern, Carnegie Mellon University, United States