Přepis řeči - PHONEME SELECTIVE SPEECH ENHANCEMENT USING THE GENERALIZED PARAMETRIC SPECTRAL SUBTRACTION ESTIMATOR

0:00:13	i
0:00:14	um well actually a a a a it's uh
0:00:18	one of my for P H Ds are one i for masters to
0:00:21	uh
0:00:22	use now working at a company and because of that he was name
0:00:25	a travel here for
0:00:27	for the presentation
0:00:28	where
0:00:29	course into
0:00:31	but a a present a that the is focused on phoneme selective be speech enhancement for
0:00:36	uh a generalized
0:00:37	a parametric
0:00:38	a spectral subtraction
0:00:40	H
0:00:41	so
0:00:42	the approach there we're kind of looking at here
0:00:44	uh it's to
0:00:46	uh
0:00:47	try to balance the differences between uh
0:00:50	voice
0:00:51	uh structures
0:00:52	see C an articulatory domain
0:00:54	uh noise will impact speech differently uh depending on the speech class
0:00:59	i believe that
0:01:00	uh adapting enhancement strategies is these different domains all actually
0:01:04	prove your overall form
0:01:06	um regions of low signal to noise ratio are we it gonna be more sensitive to
0:01:11	uh a different types of noise
0:01:13	babble or background
0:01:14	type fluctuations
0:01:16	um
0:01:17	so it would make sense to try and track obviously see the signal to noise ratio
0:01:21	but also look at that are with respect to
0:01:23	uh the types of phone class
0:01:26	a noise characteristics obviously a
0:01:28	well and both quality intelligibility
0:01:30	so that you approach here we like to kind of focus on a a phoneme class selective based
0:01:35	a strategy
0:01:37	uh that adapts
0:01:38	sell
0:01:39	have to phone classes over time
0:01:41	so
0:01:43	let's kind of maybe talk talk little bit about uh a different approach is people what to for phone class
0:01:48	based
0:01:49	uh enhancement
0:01:50	um we you know obviously a noise is gonna impact of different from class
0:01:54	as differently
0:01:55	um
0:01:56	and so based on the frequency content so articulatory structure an influence of noise and the phone as well as
0:02:02	the
0:02:02	uh a stationary noise you would expect
0:02:05	uh quality
0:02:06	packed differently
0:02:07	so
0:02:08	going back to uh transactions paper from a call in well has a uh this one at the soft decision
0:02:13	based noise suppression strategy
0:02:15	across different phone classes
0:02:17	it was a very nice
0:02:18	a fact approach
0:02:20	um one of my former students lot are slim uh uh we had a paper and transactions and nine nine
0:02:25	that looked at a hidden markov model based strategy to kind of classify
0:02:29	uh a different phone class
0:02:30	adapt uh it or to of uh are all S P
0:02:35	a strategy
0:02:36	a different phone class
0:02:37	found that work
0:02:38	well
0:02:39	uh
0:02:41	minutes so are former students a uh in our interspeech two thousand seven paper this class constraint over strategy
0:02:48	and here again what we focused on was to try and work at
0:02:51	uh extracting a pieces of enhanced speech from different types of
0:02:56	uh a constrained representation
0:02:58	see whether we can give
0:03:00	the overall enhancement
0:03:01	uh solution
0:03:02	so
0:03:03	this uh figure here kind shows the uh
0:03:06	the strategy here i'll try to illustrate this the
0:03:09	pointer
0:03:10	so
0:03:11	the uh
0:03:13	and has been strategy this was a an older version from from interspeech two thousand seven
0:03:18	we had a list the and it or to constrain or chip enhanced method
0:03:22	and that since we kind of the a number of different enhancement solutions here
0:03:26	uh for the input speech you basically try a whole uh a range of different to
0:03:30	approaches approach in and if you look on the right here
0:03:32	you kind of think of this is starting off with a single in hand are single degraded speech waveform
0:03:38	i mean what you end up with is actually a very very large collection of enhanced waveforms
0:03:43	and can kind of think of this is a very large break for the lack of
0:03:46	anything thing else
0:03:47	and the time domain is a got going along here and a cost each of these levels here
0:03:52	and across this space of here
0:03:54	could be a a very variations of different parameters that are controlled
0:03:59	by the enhancement strategy and so
0:04:01	and S since you end up with a a a a very large collection of enhanced waveforms
0:04:06	then be approach is to try and C could you use a strategy like a a a gaussian mixture model
0:04:11	approach to maybe go through and select
0:04:13	which phone classes you actually have
0:04:15	and identify which uh blocks actually would be most improved
0:04:20	uh
0:04:21	for that particular enhancements solutions up here
0:04:24	uh based on the phone class
0:04:25	so doing
0:04:27	we use what's called a a a a what we're solution this is something that's very common in the speech
0:04:31	recognition community
0:04:32	and so what's done is
0:04:34	looking at this big block to kind of going pick out you know with a particular
0:04:38	enhancement configuration this particular piece and drop it down here
0:04:42	the cup an X drop it here in since you're kind of looking across all the different domains
0:04:46	in piecing them together uh
0:04:48	hopefully coming up with a nice
0:04:50	sequence of optimize enhanced uh
0:04:53	blocks the for each of the different phone classes or a phone sequences
0:04:57	um
0:04:58	and hopefully the overall enhanced a signal is actually better
0:05:02	so
0:05:03	that
0:05:04	just to a kind of a concept you when we look at a traditional uh and the this like an
0:05:09	mse or spectral subtraction and these
0:05:12	or just
0:05:13	uh what
0:05:13	typically we would they what people typically see
0:05:17	uh is that uh you have maybe several classes of phones maybe this could be class one quest to quest
0:05:22	three
0:05:23	and these be different types of phone class one would argue that
0:05:27	maybe a particular enhancement the like mmse
0:05:30	if you can kind of to it properly
0:05:32	tries to kinda give you
0:05:34	good sounding speech across all the classes for one configuration
0:05:38	and in a sense what you'd like to try and use to migrate this solution over
0:05:42	specifically to this type of class maybe to the centroid of this space
0:05:46	yeah with the idea of it now this is been optimized to this particular phone class and
0:05:51	the results are force if you
0:05:53	pick up each of these
0:05:54	particular centroid
0:05:55	you'd end up with a better overall solution than simply
0:05:58	uh keeping the enhancement strategy constant for the whole way
0:06:03	so the approach that we look for here to kind of
0:06:06	uh use an alternative approach to
0:06:09	uh the generalized spectral subtraction strategy
0:06:12	uh and that is to to look at a weighted a euclidean
0:06:16	we did clean
0:06:17	a distortion
0:06:18	and we believe this might be a better uh
0:06:21	measure then then using uh
0:06:24	uh i mean square error because we feel that that this
0:06:26	would have a little but more perceptual based criteria incorporated into it
0:06:31	so the idea is you have
0:06:34	oh this vector of of uh
0:06:36	of harmonic uh coefficients here
0:06:38	uh from an fft
0:06:40	and what we can do is to emphasise the errors during the val let's say by
0:06:44	decreasing this pay a term uh that you're have in the representations of this is less a well we can
0:06:49	assume for example that
0:06:51	uh when B
0:06:52	uh uh when you're in the ballot use that to the B magnitude of a X would be less than
0:06:57	one
0:06:58	so if you allow beta to be small i actually
0:07:02	"'cause" this estimator here to increase
0:07:04	um
0:07:05	on the other hand if you're in a spectral peaks
0:07:08	let's say during voiced of blocks uh X will be greater than one uh
0:07:12	particular frequency harmonic
0:07:15	and then this estimator term actually uh allowing
0:07:18	uh
0:07:19	uh the value of beta to be greater than zero
0:07:22	well allow this term to actually inc
0:07:24	and so
0:07:24	that allows us to kind of adapt a
0:07:26	a parametric way
0:07:28	uh be enhancement solution
0:07:30	so
0:07:31	we approach that we've gonna look for is to kind of for use the
0:07:35	uh
0:07:35	a generalized spectral subtraction approach and and this was a introduced by
0:07:40	uh same tone chang and tan in their transactions paper or ninety i U
0:07:45	uh in this is the uh the estimator here it it basically finds the
0:07:49	uh
0:07:50	the best to estimate uh
0:07:52	between
0:07:53	uh a the term X of had an X uh
0:07:56	uh from the original degraded speech signal
0:07:59	uh the to uh components that we see here the a in
0:08:03	B terms here the are frequency dependent weighting coefficients that need to be estimated
0:08:07	any the for term here's a kind of the spectrum of exponent that you would see
0:08:11	uh a the terms and here
0:08:13	okay
0:08:13	so what we'd like to do is to be able to optimize the a be terms here
0:08:17	and in so doing in the
0:08:19	for the general spectral subtraction approaches is basically to minimize the mean square error
0:08:24	termed to see here so
0:08:25	uh there is two solutions that uh uh that come up with one
0:08:29	referred to as the unconstrained approach
0:08:31	basically means that the a be terms of not equal to each other
0:08:34	and the constrained approach which basically means that
0:08:37	the two terms are in fact
0:08:39	each
0:08:40	so how do we are approach hours well what we're going to do is to
0:08:45	uh work at minimizing uh
0:08:48	optimising the terms uh in be subject to the a weighted euclidean distortion
0:08:53	so doing
0:08:54	we end up with these particular solutions for the in be terms
0:08:58	uh we can then take these estimates of
0:09:00	uh
0:09:01	a B
0:09:02	and such to them back into the a generalized spectral subtraction approach
0:09:06	and form a new parametric estimators that we've fuel
0:09:10	at offer some greater flexibility for enhance
0:09:13	uh i just as a side note the minimum mean square error uh optimize coefficients are really just a special
0:09:19	case of this weighted euclidean distortion approach
0:09:21	when you lower out of beta T equal zero it actually
0:09:24	falls back to the
0:09:26	a a previous solution
0:09:28	so this is is kind of a
0:09:31	a busy plot but i will try to
0:09:33	i like
0:09:34	piece is
0:09:35	i here here we're first looking at uh at fixing i'll and along the beta term to decrease
0:09:40	and on this side were allowing L in increasing keeping beta fit
0:09:44	basically there four quadrants here one is
0:09:46	when there's speech region
0:09:48	a you see this up here
0:09:49	uh obviously this times to be the case where you have a fine high speech information and so you'd like
0:09:55	to
0:09:56	obviously try to suppress some of the noise but you don't want really touch or or damaged the speech signal
0:10:00	as much
0:10:01	um
0:10:02	the second region here a a Q two is actually be
0:10:06	unlike like be region it's these spots you
0:10:08	respect to be operating in
0:10:10	Q three is a noise only region and in this part here you really would like to actually have
0:10:14	a a greater suppression and if you look at the beta based uh a constrained and train solutions
0:10:19	we actually have a greater suppression
0:10:21	a gain on the side so that's actually desirable to have
0:10:24	and and is quite and for this is actually the case where you typically see um
0:10:28	uh
0:10:29	side harmonics that are popping up
0:10:31	and this is actually the most dangerous area of the "'cause"
0:10:35	in this part here i really would like to have suppression but
0:10:38	i you really would like to ensure that you don't have to uh a musical tone artifacts that might be
0:10:43	popping
0:10:43	so
0:10:44	this region is spot that you like to kind of sure
0:10:47	i will have good perform
0:10:49	so there are quite a few different enhancement methods that work can be comparing here are all try to highlight
0:10:54	there was and not this slide but the next slide
0:10:56	we were be going through a rover type solution and this
0:11:00	i using what's called a mix mac
0:11:02	uh a solution and this is actually L match this is actually coming from
0:11:06	uh transactions paper from uh not as
0:11:09	uh they david
0:11:10	uh
0:11:11	david
0:11:12	a how movement michael but chaney back in eighty nine for speech recognition
0:11:16	so i approach here basically we assume we have to great speech um
0:11:20	we didn't going to have three estimators one
0:11:22	and that we believe is a good estimator for sonorants one a good estimator for option
0:11:28	an another one which we believe would be good for silence
0:11:31	uh if we have a high energy we assume it's sonorants we know we're gonna kind i'm move four with
0:11:35	that
0:11:36	if it's a trained we it may or may not be a i'm noise and so we are up by
0:11:41	a voice activity detector here
0:11:42	if is in fact a a a a a uh
0:11:45	no i then we we'd like to do is to kind of move down an update or noise reference characteristics
0:11:50	here
0:11:50	uh if it is in fact a a speech then we're gonna just use this uh in our model
0:11:55	so
0:11:56	uh we pull of the mfcc coefficients
0:11:58	these are used primarily simply for that
0:12:00	gaussian mixture models here
0:12:02	these are basically to try trying classify whether were sitting and a show and sonorants so and silence blocks
0:12:08	once we have this knowledge we feed this into the mix maxed type
0:12:11	uh solution and what this does is it sets maximum likelihood
0:12:15	uh weights that we can then used to weight
0:12:18	uh the solutions from the sonorant a constraint
0:12:20	and
0:12:21	noise based estimators that we see a lot here
0:12:23	hopefully coming up with
0:12:25	integrated
0:12:26	a solution that will sound better than
0:12:28	and you the individual uh a solution
0:12:31	so the categories as i said there are three broad phone class of a class types here sonorants obstruents and
0:12:37	silence
0:12:38	we group what we believe to be the
0:12:39	the fricatives are for kids and stops
0:12:41	the option
0:12:43	um
0:12:44	again we're doing this some kind of an unsupervised manner uh
0:12:47	over time so what we believe the stops are actually finding a way in the actual the
0:12:52	in fact
0:12:53	move into the silence
0:12:54	um again the uh and the parametric beta estimators are there were using
0:12:59	or gonna to knows the each of the
0:13:01	a broad phone classes
0:13:03	uh for sonorants in a trend
0:13:06	now the outputs from these estimators and convert mfccs and then the decision weights here kind of used
0:13:12	uh to make a soft combine uh wait for each of the composite utterance utterances
0:13:17	uh similar to the rover solution that weight
0:13:19	back in in speech are seven
0:13:21	fine like the noisy speech can be modelled using this uh mix max type
0:13:25	uh model
0:13:27	uh
0:13:27	is also incorporate
0:13:28	classification for the silence and the house
0:13:32	um um in this mix max
0:13:33	model model uh the gmms uh indicate we need to have to one for the
0:13:38	sonorants one for the utterance
0:13:40	uh so we have a set number of mixtures
0:13:42	components that are used to estimate the
0:13:45	for the silence were we're using right now she's one mixture of course if you have multiple noise types you
0:13:49	can
0:13:50	more than
0:13:50	one mixture care that
0:13:52	um
0:13:53	in the mixed next um model uh as i pointed out now as uh they've number um of michael but
0:13:58	any
0:13:59	had had this uh idea of for uh modeling noise characteristics
0:14:04	uh for speech recognition in nine we're using here so that the track noise structure
0:14:10	next there a look at the enhancement our the experimental a up here a uh we use results from thirty
0:14:16	two a individual sentences from timit
0:14:19	a the metrics we use was
0:14:21	uh a segmental signal to noise ratio and itakura-saito distortion
0:14:25	results um some gonna show here just the other course a you know of sorry the a segmental snr
0:14:30	the paper has all the results from a
0:14:32	or C
0:14:33	as well
0:14:34	uh the gmms trained we used a three in tokens with sixteen mixtures
0:14:38	and for the silence model
0:14:39	uh just a single mixture
0:14:41	and for the noise types we have two types uh a flat communications channel noise that uh we had from
0:14:47	an eighteen T voice channel
0:14:49	um
0:14:50	and a large crowd noise so this multiple people speaking but not babble it's kind of a broader
0:14:55	a noise i
0:14:57	and uh there are quite a few different enhancement strategies the standard uh a from a line and sc C
0:15:03	uh mmse
0:15:04	uh the joint map scheme from
0:15:07	a a patch of able from simon got soul
0:15:09	uh from their paper and uh
0:15:12	two thousand
0:15:13	nine i believe
0:15:15	um
0:15:15	same paper on the generalized to a spectral subtraction the unconstrained approach were ain't be uh in B terms don't
0:15:21	have to be equal to each other and then constraint scheme where they do have to be equal to each
0:15:25	other
0:15:26	a parametric approach is
0:15:27	for
0:15:28	uh a weighted euclidean just a uh a distortion based approach
0:15:32	uh four icassp paper last year we had a chi-square prior for the
0:15:37	uh for the amplitudes on the scheme and that was reported last year
0:15:41	and we also
0:15:43	a chi-square
0:15:44	prior for the um
0:15:45	but J map solutions of this we have
0:15:47	a a of last year
0:15:49	uh for what we're doing this year with the rover approach a of this has the rover based uh
0:15:54	corporation of the weighted clean distortion
0:15:57	chi
0:15:58	priors
0:15:59	uh and same for the J map type solution
0:16:02	and then we take a the beta on constrained and beta constrained approach here and also feed
0:16:06	a and world of a solution so
0:16:08	and has since we have
0:16:09	a to different enhanced and that that's uh
0:16:12	well really wanna
0:16:13	a benchmark a baseline against the parametric against
0:16:18	so uh this uh uh a to shows the uh sec well signal noise ratio increase of this is actually
0:16:25	any positive value here shows a
0:16:28	in
0:16:28	an improvement in segmental signal-to-noise ratio
0:16:31	so uh which G you term here this is basically the generalized spectral subtraction approach which is a baseline scheme
0:16:37	from sin
0:16:38	a a paper
0:16:40	and you can see the uh
0:16:41	proof been here on the sonorants are quite good
0:16:44	an improvement on obstruents and silence or not as good and the overall the kind of
0:16:49	right
0:16:50	but
0:16:50	like to see
0:16:51	now each of these three are actually optimized for sonorants obstruents
0:16:56	and uh the noise types so
0:16:58	what we do is we search across all the possible configurations
0:17:02	uh for the terms here we find it best can figure it
0:17:05	best configuration
0:17:06	uh for the sonorants
0:17:08	and that's the best improvement that we
0:17:09	yet
0:17:10	um
0:17:11	at the same time for the ops to C C
0:17:14	uh this simple is actually quite a
0:17:16	is there
0:17:17	and silence
0:17:17	is not so bad uh
0:17:20	like it to be
0:17:21	um but if you look across the diagonal here see for the sonorants the ops rents
0:17:25	and the noise we when we optimize this we actually get
0:17:29	a nice improvement across here better than what we would have gotten
0:17:32	with the uh sims approach uh a we at this council
0:17:36	cross
0:17:37	phase
0:17:38	now the goal then is to try and figure out how quite kind of take the best from each of
0:17:41	these
0:17:41	and together
0:17:43	so this approach here's a lower based uh a solution at this does not use the exact
0:17:48	optimum solutions here it actually goes and finds what it thinks is the best approach
0:17:52	based on that fine and classifier so
0:17:55	this is kind of what you would expect to see performance wise
0:17:58	if it's free running not knowing what the what the best performance
0:18:01	and you can see
0:18:02	improvement in sec mel signal noise ratio is quite nice
0:18:06	uh both for sonorants obstruents and silence
0:18:08	a not too bad and the overall
0:18:12	track of time here some just
0:18:14	a quickly here
0:18:15	these sure signal signal to noise ratio increases for flat communications channel noise
0:18:19	across all the different uh
0:18:21	uh noise types
0:18:23	uh or noise levels some sorry
0:18:25	in the main solution to kind of see from here is that
0:18:27	these uh approaches down here are the on rover to approach and use of the rover solutions that
0:18:33	and a combining them in a nice automatic way uh
0:18:36	uh allows you to can get better performance for the flat communications channel noise
0:18:41	and likewise for the a large crowd noise you can see the performance here is quite nice
0:18:46	um
0:18:47	for the ops true joints for the sonorants here
0:18:49	also as well and and you combine them there actually a much much better than
0:18:53	the jewel
0:18:55	uh so
0:18:56	if you're kind of looking what out of these maybe sixteen or so different
0:19:00	enhancement strategies
0:19:01	what of the best ones of these indicate the first and second best
0:19:05	can spent strategy
0:19:06	you can pretty much see across all of our evaluations here that the row based solutions are we
0:19:11	uh a quite well
0:19:13	uh the beta bait the beta or uh the parametric beta scheme
0:19:17	uh for the general spectral subtraction was also a
0:19:20	uh a good to a candidate there M and the J map uh version was also
0:19:25	a successful mean
0:19:26	and
0:19:28	so are can in conclusion we for considered uh to parametric uh
0:19:32	a generalized uh
0:19:34	spectral subtraction approach here
0:19:36	uh
0:19:37	he's parametric estimators can be preaching and for the different phone classes
0:19:41	uh a name been may not perform a small across all the phone classes
0:19:45	uh incorporating a rover paradigm a large to pick off some of the better
0:19:49	a segment
0:19:51	one together for an overall enhanced uh approach
0:19:54	and uh we looked to these estimators across uh individual um
0:19:58	uh
0:19:59	uh
0:20:01	we we
0:20:01	compare them against the individual estimators uh
0:20:05	without having a rover solution and found that their combinations improve performance for flat communications channel voice large crowd noise
0:20:12	or over different signal
0:20:13	to noise ratio
0:20:16	john film thank you very much
0:20:22	so uh any questions
0:20:27	the i was the better is constant in each group
0:20:30	the often all the frequency
0:20:32	uh they are constant of what happens is that when we uh uh there that but they can be different
0:20:38	for each of the classes so rents uh obstruents
0:20:42	uh silence they can be different for those
0:20:45	so it's kind of when you when we look at a prior or um
0:20:49	a rover solution the prior over solution we actually had many many more classes here we only kind of looking
0:20:55	at three
0:20:56	so we allow kind of some flexibility you can generalise it a more class
0:21:00	three
0:21:01	and and how robust is with respect to
0:21:04	um
0:21:05	misclassification
0:21:06	yeah that's a good question so
0:21:08	um we have we are running a test where we intentionally putting in
0:21:12	five five and ten percent errors in
0:21:14	you're less likely to have an error between of sonorant two
0:21:17	and after right but you're more likely to have
0:21:19	and error between a string
0:21:21	and
0:21:21	sign
0:21:22	a so that was the issue should be when i one out to the stops there's the stop sometimes it
0:21:27	the leading or training stops
0:21:28	it ten to get cold and you go into the silence side or the other side so you have to
0:21:32	much suppression
0:21:35	further comments
0:21:37	so thank you once more

PHONEME SELECTIVE SPEECH ENHANCEMENT USING THE GENERALIZED PARAMETRIC SPECTRAL SUBTRACTION ESTIMATOR

Speech Enhancement

Přednášející: John Hansen, Autoři: Amit Das, University of Colorado Boulder / University of Texas at Dallas, United States; John Hansen, The University of Texas at Dallas, United States