Přepis řeči - A FLEXIBLE SPEECH DISTORTION WEIGHTED MULTI-CHANNEL WIENER FILTER FOR NOISE REDUCTION IN HEARING AIDS

0:00:13	and you for an introduction
0:00:15	so
0:00:16	that's is that right away um the out of or
0:00:19	um i'm got make a short introduction
0:00:22	i giving a problem statement
0:00:23	um and then wanna
0:00:25	so introduce the uh
0:00:27	the speech distortion weighted multichannel wiener filter
0:00:30	and then
0:00:32	but introduced
0:00:33	we also very short in that condition of speech present but
0:00:36	which um
0:00:38	is the basis for the to solution that we gonna propose
0:00:41	and find "'em" one hope to the word
0:00:43	just to give a
0:00:44	shock the on the
0:00:46	hearing loss problem
0:00:48	so some common cost of here well ways and B
0:00:51	H related or
0:00:52	exposed to to noise and
0:00:54	or or of listening to loud music for
0:00:57	a long time here
0:00:58	so these a
0:00:59	and a fact that can
0:01:00	a fact all of us
0:01:02	but more or the a consequence
0:01:04	if you have a hearing loss or
0:01:06	is uh
0:01:08	you have a reduce the frequency resolution solution temporal resolution so
0:01:12	you have difficulty distinguish between
0:01:15	different sounds
0:01:16	a a different frequency
0:01:18	was a have
0:01:19	problems with a low class sounds
0:01:21	and it's problem of course um
0:01:23	one or
0:01:24	when
0:01:25	hearing aid uses is
0:01:27	is in a
0:01:28	noise environment
0:01:29	possibly with multiple speakers or any kind of noise
0:01:33	and also
0:01:34	a problem can be
0:01:35	reverberation
0:01:37	so for this
0:01:38	reason there
0:01:39	in the past and
0:01:40	many more to microphone or structure proposed
0:01:42	uh a is as directional microphones
0:01:45	various of but beam formers
0:01:47	how it is work would a for was on the multichannel wiener few
0:01:51	so
0:01:52	basically a the idea of all approach to
0:01:54	find a set of filter coefficient
0:01:57	so that
0:01:59	you can do do a reduce the noise and minimize the speech distortion
0:02:03	and the old goal of course is two
0:02:05	improve the um
0:02:06	intelligibility
0:02:08	so
0:02:09	if it does that by the defining the uh mike of all signals so you have a
0:02:13	speech signal
0:02:14	and uh
0:02:15	additive noise contribution where
0:02:18	is the frequency index and
0:02:19	it is the frame index
0:02:21	in this case we will more than the uh two microphone set
0:02:25	so the and and yeah um
0:02:28	it miss a criterion we form an like this so we wanna find a set of
0:02:31	that the coefficient that minimize
0:02:33	the difference between the
0:02:35	decide speech component
0:02:37	the filter to
0:02:38	version of the
0:02:40	noisy signals
0:02:42	so basically we choose to estimate the
0:02:44	the speech complained the first microphone so that would be the front microphone the hearing
0:02:49	so
0:02:50	an extension of this is if we sure that the speech and noise
0:02:53	uh are statistically independent
0:02:55	we can formulate a
0:02:57	the M secrets here in this way so the first term corresponds to
0:03:00	a speech distortion term
0:03:02	and the second term corresponds to the
0:03:04	the she'd don't noise
0:03:07	and
0:03:08	then the formation
0:03:10	can be like like this
0:03:11	so basically you have the estimated
0:03:13	speech correlation matrix and the
0:03:15	is the noise only correlation matrix
0:03:18	weighted by a certain factor
0:03:19	which correspond to um
0:03:21	almost was a fact
0:03:23	so you at this point we can
0:03:24	see that the end of we have basically
0:03:28	is based on the correlation matrix
0:03:31	so we show a course of details and
0:03:32	the problems involved in
0:03:34	is to make these
0:03:35	contribution
0:03:37	so
0:03:38	in general to estimate the uh
0:03:41	the basic is to estimate the noise
0:03:43	only correlation lectures
0:03:45	and this speech plus noise
0:03:47	correlation majors
0:03:48	so they're the
0:03:50	a speech
0:03:51	so user
0:03:52	basically to get a clean speech production major
0:03:54	a can do that by for instance
0:03:56	using a a voice activity detector to estimate the
0:04:00	P that T speech correlation images
0:04:02	doing a a speech plus noise pair
0:04:04	and noise only doing is only few it's and then you make this
0:04:08	so structure here
0:04:10	so basically a
0:04:11	in the in we have um
0:04:13	is contribution
0:04:14	skip fixed doing
0:04:15	different periods
0:04:17	so of course but as if you have a a speech does not here
0:04:21	the update of the noise only correlation may just would be kept fixed
0:04:24	and the speech plus noise correlation majors will
0:04:27	the update
0:04:28	so of course level
0:04:29	also so the limitation of the
0:04:32	tracking of the
0:04:33	noise correlation matches because
0:04:35	you imagine and if
0:04:36	but the noise
0:04:37	prior to the speech appear
0:04:39	higher then
0:04:41	then the speech plus noise pier
0:04:42	and if we stop adapting
0:04:45	the noise pollution images
0:04:47	we basically have a
0:04:48	that's a a red
0:04:49	or or more special ability
0:04:52	furthermore the estimation of the correlation may just as
0:04:56	is typically
0:04:57	don with a high averaging a
0:04:59	should really in the area of two to three seconds
0:05:01	so somehow how this also limits the um
0:05:04	tracking capability
0:05:05	spectral
0:05:08	so if you look at the motivation for work are we start that
0:05:11	um um
0:05:12	since the S D W and all we depends on the long term average
0:05:16	uh
0:05:18	basically the noise to do some kind of a limited
0:05:20	kind of eliminate um
0:05:22	start time if X us and musical noise and
0:05:25	and all that at
0:05:26	i fixed but this present a single channel noise reduction
0:05:29	another issue that we got a would here is this
0:05:32	a
0:05:32	weighting factor here
0:05:34	with a general is used as a fixed weighting factor
0:05:37	for
0:05:38	all frequency of all frames
0:05:40	and this is what we kind of say well
0:05:42	this
0:05:43	the what a base of our work is to find a optimal weighting factor
0:05:47	because
0:05:48	in general you can say that the speech and noise
0:05:51	will be a stationary and in general was a say that one the speaking will have a lot of silence
0:05:56	here in to in that we can exploit
0:05:58	in the noise option
0:06:00	process
0:06:01	why
0:06:03	the noise
0:06:03	general general could be
0:06:05	continues press
0:06:06	so what propose is that
0:06:09	we want to apply a different weight to the
0:06:12	speech dominant segments and to the noise
0:06:14	them
0:06:15	dominant segments
0:06:17	to do that
0:06:18	which of inspiration from uh
0:06:20	a single channel much ducks approach where there
0:06:23	but a lot of work been done on a
0:06:25	spectral try
0:06:27	so
0:06:28	so basically we don't inspiration from
0:06:30	a a of the speech present ability
0:06:33	basically
0:06:34	they there's that by finding that two state models
0:06:36	so you have one one state what you have
0:06:39	noise only and then have
0:06:40	once we go speech plus noise
0:06:42	where as the use standard approach basing that assume that we have
0:06:46	noise given that all time
0:06:48	so by
0:06:49	exploiting a to state model
0:06:52	who we can improve the noise option
0:06:55	so basically just a very shortly introduced to speech possible bill T
0:06:58	it's estimate for each frequency for each frame
0:07:01	it is based on uh
0:07:04	an estimate of the
0:07:05	the probability of
0:07:07	speech being absent
0:07:08	and then you have very contribution of
0:07:10	different
0:07:11	see to noise ratio measures
0:07:13	so an example can be shown here
0:07:15	where are you can see here that
0:07:18	so low frequency area yeah
0:07:20	high probability of speech and then
0:07:22	a certain point you have a lot or build
0:07:24	so the question was
0:07:25	how can be
0:07:26	exploit this in a
0:07:28	in the most channel wiener feel
0:07:31	we we start by kind of what to find the uh objective function so
0:07:36	have we first have a first term
0:07:38	we is the H one state where the the P
0:07:40	and we have a second term
0:07:42	which is the H zero state weighted by the
0:07:44	one minus P so basically
0:07:46	we take into account that we also have a
0:07:49	a whether
0:07:50	noise only so we
0:07:51	can be
0:07:52	more aggressive this stays in terms of noise reduction
0:07:56	where we derive it of course the
0:07:58	now we have
0:07:59	you end up with a term
0:08:01	one O P
0:08:02	which basically
0:08:03	um
0:08:04	kind of like a um
0:08:06	is not change for each
0:08:07	frequency for each frame of that's with a fixed weighting factor B
0:08:11	so basically if you have a high probability of speech
0:08:13	you go back to kind of like preserving the speech and
0:08:16	if you have a low probability
0:08:17	you got
0:08:19	to more aggressive noise reduction
0:08:20	the problem here however is that
0:08:23	as you so before
0:08:24	the uh
0:08:26	this speech present bob it's a kind of various a lot for each frequency of course when we applied in
0:08:30	in this setup
0:08:32	we we had a lot of distortion a lot of to face basically
0:08:36	some aspects that was related to
0:08:39	signal channel noise reduction
0:08:40	a fact is that
0:08:43	this filter here doesn't really distinguish between the
0:08:45	it show the H one state
0:08:48	so we
0:08:48	when a little further
0:08:50	i mean look and we kind of like that
0:08:52	what have as if we could actually
0:08:54	to take the H where H one state
0:08:56	so we had was so we propose a simple method to do this
0:08:59	we already have
0:09:00	the information
0:09:02	per frequency
0:09:03	so we kind of just set okay we look at for each
0:09:06	each frame we to be average
0:09:08	and if the average is higher and a than a certain
0:09:11	that stress how
0:09:14	we were we were selected as
0:09:15	H one state and
0:09:16	otherwise i eight zero
0:09:18	here's an example of this is a clean speech signal but of course it was estimated on the
0:09:23	noise signal
0:09:24	and here you can see that are certain
0:09:27	so do values here we we be
0:09:29	did take S H one state and all the S it's zero state
0:09:33	so the rational behind having this
0:09:35	information is that
0:09:37	in the H
0:09:38	zero state
0:09:39	the noise corruption perform form there can be
0:09:41	wait differently because that's no speech presence of B can be
0:09:44	must must rested without
0:09:46	compromising the
0:09:48	this
0:09:48	or increase the speech distortion
0:09:51	in the H one state of course
0:09:52	we
0:09:53	we also want to reduce some most but we want to do it a bit more carefully
0:09:57	so this is the idea of what we wanna apply a certain
0:09:59	flexible weighting
0:10:02	to do that a similar way
0:10:04	what you can see here is that
0:10:05	if we have detected a
0:10:07	H one state we apply much small higher stress L
0:10:10	a weighting factor
0:10:12	and
0:10:13	if it's a H one state
0:10:14	at some point
0:10:15	we were still apply a a lower but
0:10:17	fixed weighting factor
0:10:19	and it went if a bit to gets higher a kind of weighted
0:10:22	according
0:10:23	in that way
0:10:24	in
0:10:25	you can kind of preserve certain speech Q
0:10:29	so to build that into the uh
0:10:32	the standard and double there
0:10:34	so basically we have a combination of soft values and a binary detection
0:10:39	so the first one
0:10:40	is uh a function of
0:10:42	H one state
0:10:43	which is a function of
0:10:45	certain fixed trestle
0:10:47	and the speech present ability
0:10:49	and the second term is basically
0:10:51	kind of using a fixed weighting fight
0:10:54	and we derive it is
0:10:55	a of course it all
0:10:57	a P here this is the
0:10:58	weighting factor
0:10:59	so
0:11:00	by exploiting both the
0:11:02	soft value and the hardware
0:11:06	and then we is honest
0:11:07	uh simulation as well uh
0:11:09	use the to microphone hearing the idea
0:11:12	in a one all set up
0:11:14	a
0:11:15	and we have a relatively low level and time
0:11:18	to more to babble noise sources
0:11:22	and we used to objective quality measures uh
0:11:24	uh which is the
0:11:26	it's it's is an hour and
0:11:29	the signal distortion
0:11:32	so
0:11:32	if we look at the results
0:11:34	it to see that the standard method gives a much or
0:11:38	signal to noise ratio
0:11:39	but when you're re what when we decrease the weighting factor
0:11:42	at the same time that E
0:11:43	the distortion or also increases
0:11:46	where we use the the one but we initially use with the one or what peter
0:11:50	the problem was the high situation
0:11:52	so
0:11:53	you was still get like quite a good
0:11:55	um
0:11:56	is in uh performance but the distortion simply when very high
0:12:00	but with the flexible press hall
0:12:03	we use the
0:12:05	different way fighter here we can see that the distortion like uh the um
0:12:09	see does not stream
0:12:10	improvement
0:12:11	when is relatively high
0:12:12	and the distortion was also have low
0:12:15	of course the question is like how we you choose this weighting factor
0:12:18	and that's of course still something that you're working on
0:12:23	so
0:12:23	does to summarise uh
0:12:25	percent a different the extension of the uh
0:12:28	is D W the we have algorithms
0:12:30	we started to look at it with a fixed weighting factor
0:12:34	then we incorporated the
0:12:35	speech present T
0:12:37	and then at the end we ended up with a combine solve
0:12:40	and the binary detection
0:12:42	in future work
0:12:43	um
0:12:45	we are aiming at performance some perceptual evaluation using a
0:12:49	hearing it that listeners
0:12:50	and
0:12:51	we we'll we for the working on a
0:12:53	finding a mall
0:12:55	perceptually motivated weighting factor for is as we put
0:12:58	uh exploits certain
0:13:00	masking properties or
0:13:02	even incorporating some
0:13:04	a hearing models uh in the waiting process itself
0:13:08	i do
0:13:11	i
0:13:13	i question
0:13:15	i i yes please back the back
0:13:21	Q for for each intention so my question is that the he C so a P you for
0:13:26	uh uh speech do uh as
0:13:28	to each is possible to apply to or twenty five each the wiener filtering for a speech an action
0:13:34	for me just that you have to design speech and you have we include in speech
0:13:39	so was a time not can you don't P and in can be a you know can she do we're
0:13:42	still
0:13:43	uh and these guys piece all seas
0:13:45	so we yeah
0:13:47	each that do you C C's D not P cable and how do you choose a weighting factor
0:13:51	i i we use
0:13:53	should you know one oh can you can you use some ninety
0:13:57	i
0:13:57	can you repeat the question go
0:13:59	i so i can hear
0:14:01	E yes okay yeah now you applies a multichannel channel mean if you mean for a noise reduction
0:14:07	so my question is that E C's in a both a the speech production
0:14:12	for symbol you have a desire to speech
0:14:15	and you have will in turn few speech
0:14:19	oh you are you mean like a multiple speakers in now yeah yeah yeah yeah
0:14:22	well i guess it was still be uh
0:14:24	i think you can it up i what's a scenario but but of course is gonna be more difficult
0:14:29	estimating this a conditional speech possible of to because
0:14:33	now the spectrum
0:14:34	gonna be most most similar to the
0:14:37	but decide speech signals of course
0:14:39	no have to be much more careful when estimating the weighting fact
0:14:42	and i think still that
0:14:43	you it was to be
0:14:45	you was it applied a multi
0:14:46	speaker so that
0:14:47	a build the results would be a little worse
0:14:50	uh_huh
0:14:51	okay thank you
0:14:52	my question
0:14:54	comments yes
0:14:58	i mean and my questions a uh you reminded to he's question asking is uh
0:15:03	uh when you apply i was them to
0:15:06	uh to these uh
0:15:07	do you have constraint on the east or or us and an something on the noise type
0:15:11	right i
0:15:13	because you the noise is an impulsive noise
0:15:16	or
0:15:17	and that type of noise my out you know
0:15:19	as he set
0:15:20	if for the noise is speech
0:15:22	well in impulsive noise on the kind of noise um
0:15:25	can used
0:15:26	can this reasons do do you know with this see
0:15:29	yeah well
0:15:30	at this point we don't make any assumption of the noise actually
0:15:33	a it can work one
0:15:34	i i i was a that uh
0:15:36	the most difficult scenario would be the motive
0:15:39	a speaker in there but in terms of um
0:15:42	noise types of thing you can apply to any of most
0:15:45	there's no
0:15:46	assumption so that we make a had to be
0:15:48	certain type of noise
0:16:02	a a user
0:16:05	um
0:16:06	so you mean that
0:16:08	is
0:16:08	this algorithm can be used for any type of uh
0:16:12	uh
0:16:13	noise
0:16:15	or given the noise use uh and the speech just top
0:16:18	inter speech
0:16:20	yeah okay so
0:16:21	well i
0:16:21	i i think that uh
0:16:23	in terms of choosing all these the values for threshold
0:16:26	of course uh if you have like multiple speakers scenarios
0:16:30	if you
0:16:31	because
0:16:32	well have depends on how well you can estimate all these uh a spectral components like that
0:16:36	speech possible but
0:16:38	and how how well you make the binary decision
0:16:40	so of course if you have a multiple targets in out
0:16:44	you might have a
0:16:45	a large error on your estimation and then of course if you choose
0:16:49	then probably you wanna choose a different value because
0:16:52	if you have a large row in you will be
0:16:55	subject to maybe
0:16:56	a higher speech distortion what your five
0:16:58	in this case if you have a that say
0:17:01	read easy scenario like
0:17:03	maybe like a car noise in that you have like more station noise
0:17:06	then you estimation
0:17:07	the speech by simple but probably most
0:17:10	hi accuracy
0:17:11	of course you can also apply most more aggressive
0:17:13	press
0:17:14	but if you have a was able talk as in there you probably have to be much more careful you
0:17:18	can use them on there
0:17:19	yeah
0:17:20	oh
0:17:22	i i mean i just one to ask have you to
0:17:24	these type of scenario
0:17:26	you you have to go any result
0:17:28	you mean on the um you minimum a remote of all speakers scenario
0:17:32	no will we didn't as the multiple speaker scenario of what we did it as the ways it was uh
0:17:36	a much higher
0:17:37	uh
0:17:38	a room reverberation
0:17:39	and then we saw that
0:17:41	the estimation needed to be
0:17:42	to a little bit
0:17:43	and some of the values it is a carefully chosen but to
0:17:47	increase the distortion
0:17:48	in that case
0:17:49	the estimation of the spectral components was much more in a
0:17:52	so we kind of had to
0:17:55	choose different values
0:17:57	so of course
0:17:58	it all depends on how a you can
0:18:00	estimate he's
0:18:01	components
0:18:02	and
0:18:03	and of here we just as a proof of concept we had like a low revisions an hour and just
0:18:07	had a it's over
0:18:08	babble
0:18:11	my questions
0:18:12	yeah a things that is not what i mean hearing it's then it's of course a of from you they
0:18:17	can not only for speech right
0:18:18	how does it found yeah because you use a different i state
0:18:22	um depending on the frequency right and it depending on the frame
0:18:26	yeah right so if yes and that for example then people want to you of the music might be
0:18:32	yeah
0:18:32	only second know how we will work in was used in there because this is more like a
0:18:36	much should option process so i guess and use get a no
0:18:39	if you this
0:18:40	besides
0:18:41	move
0:18:41	so but then you should split it off many uh yeah music yeah probably ones with stop
0:18:47	we only work with the
0:18:49	speech signal yeah but if yeah had do you and it's of course it's applicable to any
0:18:53	yeah of course i mean but of course in that terms them
0:18:56	of recess doing more like a
0:18:58	what a convex with between different the settings and so on that
0:19:02	or in this case it
0:19:04	it will not what well
0:19:06	uh_huh
0:19:07	i G and if you have used for speech then for example this start of the P to speech might
0:19:10	not be detected well i because it's uh consider that known as noise
0:19:14	yes to but but one example you can see is that sometimes like if you have like a high frequency
0:19:19	component like he's
0:19:20	some something yeah
0:19:22	uh
0:19:23	you are
0:19:24	these because C but the colour by the noise
0:19:26	i if the
0:19:27	speech present probability built for in the case a very low probability speech and you are now it to be
0:19:31	very grass
0:19:33	these areas
0:19:34	sometimes you really missed is
0:19:35	yeah on the time so he's
0:19:37	what a since it ways as was like
0:19:39	he was saying like shoes
0:19:41	i sometimes you
0:19:42	will not be able to hit is actually
0:19:44	if not allow
0:19:45	a notion of to be very aggressive
0:19:47	those
0:19:49	yeah
0:19:50	the of the techniques um a yeah and then uh well basically what we were can ours
0:19:55	we know that we could be pretty aggressive
0:19:57	but it would come at a cost
0:19:59	so right now we are
0:20:00	trying to kind of constrained these waiting like to by some
0:20:03	psycho-acoustical problem
0:20:05	so we exactly know when how when and how much to apply
0:20:10	so basically if
0:20:11	if we know that certain things see
0:20:13	hi built of speech and
0:20:15	then you probably mask or the noise of all the frequency
0:20:18	and then we may not have to remove that most noise
0:20:21	at the coming
0:20:22	in the following week
0:20:24	okay
0:20:25	a comments questions
0:20:28	okay thank you that's

A FLEXIBLE SPEECH DISTORTION WEIGHTED MULTI-CHANNEL WIENER FILTER FOR NOISE REDUCTION IN HEARING AIDS

Signal Separation

Přednášející: Kim Ngo, Autoři: Kim Ngo, Marc Moonen, Katholieke Universiteit Leuven, Belgium; Søren Holdt Jensen, Aalborg University, Denmark; Jan Wouters, Katholieke Universiteit Leuven, Belgium