Přepis řeči - MODEL-BASED SPEECH ENHANCEMENT USING SNR DEPENDENT MMSE ESTIMATION

0:00:14	"'kay"
0:00:15	thank you
0:00:16	and so i come to this very last a presentation
0:00:20	a a a a model based
0:00:21	speech announcements
0:00:22	at first
0:00:24	i would like if the outline of the talk
0:00:27	i was start with the short
0:00:28	introduction
0:00:29	uh after what's our give a brief overview of out uh the
0:00:33	model based noise duck
0:00:35	use here
0:00:36	for speech enhancement
0:00:38	and i well so presents our as an already band
0:00:41	i am the estimator your
0:00:42	where as an art bands you means that we have different uh is the estimators and the input as an
0:00:48	hour decides which one
0:00:50	uh we choose
0:00:51	then i will show some test results so we're give a short demonstration and we'll final of the the summer
0:00:59	"'kay" first let me and
0:01:00	the notation we use
0:01:02	uh in this presentation
0:01:04	we considering for example of such a scenario where we record
0:01:08	the noisy microphone signal
0:01:10	why
0:01:11	which consists
0:01:12	of a clean
0:01:14	speech signal S that is additive
0:01:16	a
0:01:17	uh by noise signal and
0:01:20	we can see domain we use this representation here what we use capital letters
0:01:24	frame index X that and
0:01:26	frequency index
0:01:27	um
0:01:28	all estimates in the following hard you know to
0:01:31	by a head
0:01:32	for example here
0:01:33	at the output of or noise suppression
0:01:36	uh you know we enhanced
0:01:37	uh speech signal
0:01:39	okay
0:01:41	a literature uh so called statistical noise reduction
0:01:44	uh
0:01:45	approaches are often useful purpose of
0:01:48	as speech enhancement
0:01:49	among them for example the wiener filter all the weighting rooms roots by if for man
0:01:55	a model a
0:01:56	he's techniques usually as you a certain distribution for the speech and the noise signal for example a course in
0:02:02	or up pdf
0:02:04	and a light mathematical criteria like mmse embassy maximum likelihood or
0:02:09	a map in order to
0:02:11	i estimate the speech signal
0:02:13	so the classification is based here on memory less a priori no
0:02:17	in contrast
0:02:18	advantage of model based approaches is that they can additionally uh consider correlation
0:02:24	cross time and or frequency
0:02:26	for example by using a specific uh model of our speech that
0:02:32	so here
0:02:32	we can exploit a priori information of fire or
0:02:36	and
0:02:38	uh one example of such a model based approach is the modified
0:02:41	a a to but uh i will present the following
0:02:45	the system consists of
0:02:46	two steps the first
0:02:47	set calls propagation that we tried to exploit
0:02:51	uh temporal correlation of
0:02:52	speech dft coefficient
0:02:54	this is illustrated here
0:02:56	we working at the frequency domain and you see of the previous
0:03:02	and K enhanced
0:03:03	speech coefficients
0:03:05	uh for one specific uh frequency bin
0:03:08	in order to predict
0:03:09	of the current speech
0:03:10	a coefficient
0:03:12	for this we use a a conventional then your prediction techniques based on a a model of a and K
0:03:18	and they ar coefficients a which uh i have to be known or estimated
0:03:22	five
0:03:24	the second step called uh update step we then
0:03:27	uh only have to estimate the prediction error that you've made in the forest
0:03:31	step
0:03:32	this prediction error is denoted
0:03:34	the following and
0:03:35	um S
0:03:37	and in order to
0:03:38	um estimate this prediction error we consider of this the french and C D which is
0:03:43	a a noise input efficient
0:03:45	mine the first
0:03:46	a prediction
0:03:47	and as was see later
0:03:49	uh we perform here
0:03:50	in order to estimate yes we perform spectral weighting of the the french C
0:03:56	E
0:03:59	uh by a weighting gain G
0:04:00	in order to estimate
0:04:03	once you've estimated yes we can that update
0:04:06	our first
0:04:07	uh prediction
0:04:09	and finally get a you the enhanced
0:04:11	a speech but
0:04:13	such a a
0:04:14	come come the system the low or a common that the system is
0:04:17	uh a light
0:04:18	a a separately for each uh frequency
0:04:21	uh have and we and then finally
0:04:23	uh transform you
0:04:25	uh the whole frame back into the time
0:04:28	oh system can be is extended to uh noise signals in order to exploit also possible uh correlation of
0:04:35	um noise signals
0:04:37	that for we apply the propagation step
0:04:40	also to the
0:04:42	a voice signal
0:04:43	where are we use of the previous M K
0:04:45	uh
0:04:46	and hands
0:04:47	uh no estimates noise estimates from the
0:04:50	uh past in order to
0:04:52	a pretty
0:04:53	a the current noise
0:04:54	fish
0:04:55	the it's that in
0:04:57	we have to estimate two prediction errors
0:04:59	that of the speech signal E S and that of
0:05:02	noise signal
0:05:03	and
0:05:04	so let's out
0:05:05	close to
0:05:06	look to this problem
0:05:09	so the objective here and the update step uh i just mentioned
0:05:12	to estimate the S and S and N
0:05:16	uh based on the difference signal be
0:05:18	and this case D is given as the noisy input coefficient
0:05:22	S plus and minus
0:05:24	a first
0:05:24	speech prediction
0:05:26	mine
0:05:26	the first uh noise each
0:05:28	and this expression can also be
0:05:31	uh stated it as some of the two prediction errors S
0:05:34	last and
0:05:36	so we have
0:05:39	a a classical noise reduction problem you know the update step
0:05:42	we have a target signal E S but want to estimate
0:05:44	which is
0:05:45	just or by an additive noise signal E and and we've but access only to the
0:05:49	a noise a difference of signal D
0:05:52	and this allows us to use here uh and a conventional statistical estimator which is that
0:05:57	uh to the statistics of S
0:05:59	and
0:06:00	and we can perform you know the spectral weighting
0:06:03	of the
0:06:04	a french and signal
0:06:05	by weighting gain G in order to
0:06:07	estimate S
0:06:08	or
0:06:09	five one minus G in order to estimate a
0:06:13	nor or to each and all these
0:06:14	uh weighting gains G
0:06:16	the original
0:06:17	common to
0:06:19	approach assumes
0:06:21	a got P F
0:06:22	for a S and and and minimize the mean square error between S and its estimate
0:06:27	and comes that was to the well-known known
0:06:29	we know just solution for the a weighting gain G as can be seen here
0:06:34	however
0:06:35	we met at the statistics of the speech prediction error signal E S
0:06:39	and
0:06:40	distribution of yes
0:06:41	is not caution
0:06:42	but
0:06:44	a course
0:06:45	as we showed that the i guess in two thousand
0:06:48	a and i eight
0:06:49	and this fact can i'll be exploited in the update step if we did not
0:06:53	use the we have filter but
0:06:55	statistical estimate which can be adapted to the as statistics
0:06:59	for example this and was the estimator
0:07:01	by a loans and is
0:07:03	uh code leaks
0:07:04	which assumes a generalized gamma distribution for
0:07:07	uh
0:07:09	the uh target see
0:07:11	so far we measure our
0:07:13	a pdf of the S
0:07:14	for uh
0:07:15	a a as an are ranged and averaged the results um at the end
0:07:19	so at the end we had
0:07:20	uh
0:07:21	one single a histogram
0:07:23	this contribution now we performed an as an norton band
0:07:26	measurement
0:07:27	of the
0:07:28	statistics
0:07:29	therefore for we just or our uh speech signals by white portion
0:07:34	uh
0:07:35	noise noise at different input snr values and measured
0:07:38	uh the histograms
0:07:40	the result can be seen here the normalized
0:07:42	a P F
0:07:43	for the mac you'd of yes depending on the input as an hour which bears you from minus
0:07:49	twenty two thirty five
0:07:50	um do
0:07:52	and you can clearly see
0:07:53	uh the
0:07:54	uh input as an or
0:07:56	has influence on the uh histograms
0:07:59	you high of the input as a or the higher the of the probability that a a only small of
0:08:04	prediction errors
0:08:05	a Q
0:08:06	and
0:08:07	this fact now can also be
0:08:09	supported in our system
0:08:11	if
0:08:11	we use an as an already and mmse estimator
0:08:14	in the update step
0:08:16	or this
0:08:17	we use the em as the estimator by uh i mentioned before which is not adapted
0:08:22	to each of the uh histogram we've just
0:08:26	see
0:08:26	so
0:08:27	for each quantized as an or a value with the step size of a five db T V we um
0:08:33	use here a different uh mmse estimator
0:08:36	so a gate
0:08:37	is now also uh depends
0:08:40	uh on the input as an or
0:08:42	and it's
0:08:43	in order to estimate its input there's and or in our system we simply use
0:08:47	a and has speech and noise
0:08:49	ephesians form
0:08:50	previous frames
0:08:52	with
0:08:53	such a system we increase of course
0:08:55	the computational complexity and the a memory requirements compared to a
0:09:00	a conventional statistical estimator
0:09:02	compared to we know filter for example we increase the complexity by a factor of uh six
0:09:08	round about
0:09:09	and it's additionally we of course have to um store previous frames for the prediction part
0:09:15	and um a look up table for each
0:09:20	and as test
0:09:22	K for a calm
0:09:23	to the results some more
0:09:25	a a system that things
0:09:26	use your relative low a model orders of model all of
0:09:29	three for speech signal
0:09:31	and a lot of two for of the noisy signal
0:09:35	they are coefficients are it's so in each frame using the elevens and uh i read which is applied to
0:09:40	estimate from
0:09:42	a previous frames
0:09:43	and is names statistics and the up step to pull four
0:09:46	uh the noise power
0:09:49	a would can we cheap with
0:09:50	such a system at first
0:09:52	uh objective measurements averaged over five
0:09:55	different uh noise signals
0:09:57	is see that segment of speech as an hour
0:09:59	lot of over the noise attenuation with the input as an R
0:10:03	um bearing your from mine
0:10:06	a ten to thirty five
0:10:07	E
0:10:09	objective here is to achieve a high noise attenuation and a high
0:10:13	a segment speech as an are so the more these curves of place in the upper right corner
0:10:17	the better performance
0:10:19	in
0:10:21	a blue and red to see the results of two purely statistical estimators
0:10:25	the wiener filter and a low plus mse the estimator
0:10:28	which assumes a low class distribution for the a speech signal
0:10:32	and the green and
0:10:34	um like to see the
0:10:36	i two proposed
0:10:37	uh common filter approach
0:10:39	in green uh a where we use the
0:10:42	a as an art in depend was the estimator and the update step
0:10:45	and in
0:10:46	um
0:10:47	like
0:10:48	B and you approach would be
0:10:50	as an already penned M Ms
0:10:53	and overall you can see here that
0:10:55	um we
0:10:56	three to come of it approach we uh of the form
0:11:00	a T to a statistical estimators
0:11:03	look here for example an input as an hour of
0:11:05	um five db and keep here
0:11:08	a the segment speech as an hour
0:11:10	uh constance speech if you're a
0:11:12	a much higher
0:11:13	a a noise attenuation
0:11:15	with a to model based approaches
0:11:17	and in gain here
0:11:19	a to three
0:11:20	i D V noise attenuation
0:11:22	if we compare the wiener filter and then you as an or then
0:11:25	a common to
0:11:27	also like to give
0:11:28	you shorts demonstration
0:11:30	uh these four
0:11:32	uh
0:11:33	investigated techniques
0:11:34	at first
0:11:35	oh play the noise the signal
0:11:37	then
0:11:37	uh the enhanced signals but the wiener filter and the the plus
0:11:41	and was the estimator
0:11:43	then the to common food approaches and it last once again
0:11:46	of of more
0:11:53	i
0:11:56	i
0:11:57	i
0:12:02	i
0:12:05	i
0:12:06	i
0:12:11	i
0:12:14	i
0:12:15	i
0:12:20	i
0:12:23	i
0:12:24	i
0:12:28	and
0:12:31	i
0:12:32	i
0:12:37	i
0:12:40	i
0:12:41	i
0:12:41	i
0:12:43	so that you could hear that we if you with the you proposed
0:12:47	as an art event come to that of the as noise attenuation
0:12:50	while achieving almost
0:12:51	same same a speech was see that
0:12:53	the other
0:12:55	additional objective measurements showing uh
0:12:58	similar similar behavior can be found in the paper
0:13:02	and the meantime we also conducted
0:13:04	and informal listening tests
0:13:06	which uh cannot be found the the paper
0:13:10	the left side be compared to estimate which were not
0:13:13	uh that to just to some a its statistics and compare to of be as an or independent common filter
0:13:19	with the
0:13:20	a wiener filter
0:13:22	and on the right side we compared to estimate as which are ecstasy the
0:13:26	um
0:13:27	adapted to match it
0:13:28	uh statistics
0:13:30	you we compared
0:13:31	the as an already pen common do with the simple course my
0:13:36	so that we had nineteen uh a test persons
0:13:39	and would just the over quality of there is bad
0:13:42	respective
0:13:43	uh techniques
0:13:45	and in both figures
0:13:46	and
0:13:48	yeah you can see a clear preference for the uh you proposed
0:13:51	um
0:13:52	common that
0:13:55	K just summarise
0:13:56	we presented here a modified common filter approach which
0:13:59	able to
0:14:00	small its temporal correlation of
0:14:02	a a speech and noise
0:14:03	uh signal
0:14:05	sure that in the update step the input as an R has influence
0:14:08	uh on the statistics of the speech prediction error signal
0:14:12	this fact can be exploited
0:14:14	by uh using an as an already depend and and the estimator which just adapted to the method
0:14:19	uh
0:14:20	is the grams
0:14:21	um in the update step
0:14:23	and we showed in objective and subjective
0:14:25	uh of iterations that
0:14:27	uh we can approve
0:14:28	the results of the statistical estimate
0:14:32	Q
0:14:34	i
0:14:35	i
0:14:37	first question
0:14:39	oh
0:14:42	it
0:14:46	it it's hard to hear but i thought i detected an increased amount of musical noise in your
0:14:52	oh
0:14:53	you're a little last examples that you play it can you come place yeah that's true
0:14:59	i mean is the trade between noise attenuation speech distortion and he's a good tones
0:15:03	and um
0:15:04	well the first two aspects we we are better
0:15:07	but we unfortunately increase slight increase
0:15:10	uh a to of music
0:15:11	noise there um well you could use some some of post processing
0:15:15	techniques three you
0:15:16	uh the remaining
0:15:23	john
0:15:26	it in in a plot that you uh i think you at for different types of use did you were
0:15:31	looking at
0:15:32	where you were you were correctly
0:15:36	where are or where be you can choose
0:15:39	um
0:15:40	five different types of noise
0:15:41	i the average across all five
0:15:44	yeah
0:15:44	a word you play we're supposed to white gaussian noise example that was the effect noise the fact we note
0:15:50	okay
0:15:50	oh i it just a quick question
0:15:53	and it's a term you set that equal to three for speech two for noise
0:15:58	did you tried during that with different types of speech
0:16:01	oh like growls or or a conscience or not the average of over a large database of speech yeah
0:16:08	this before
0:16:09	for of
0:16:10	and
0:16:11	the values you set for to this
0:16:13	that
0:16:14	does make a difference for the different types of noise
0:16:18	yeah D bands or to the noise
0:16:20	so
0:16:21	of course can
0:16:22	the
0:16:22	the type of noise
0:16:24	less for white course noise
0:16:26	and the more of course for about
0:16:33	okay i
0:16:34	don't see for contents
0:16:36	so i i would like to thank
0:16:38	all
0:16:39	the speaker some of the session

MODEL-BASED SPEECH ENHANCEMENT USING SNR DEPENDENT MMSE ESTIMATION

Speech Enhancement

Přednášející: Thomas Esch , Autoři: Thomas Esch, Peter Vary, RWTH Aachen University, Germany