0:00:13i
0:00:14um well actually a a a a it's uh
0:00:18one of my for P H Ds are one i for masters to
0:00:21uh
0:00:22use now working at a company and because of that he was name
0:00:25a travel here for
0:00:27for the presentation
0:00:28where
0:00:29course into
0:00:31but a a present a that the is focused on phoneme selective be speech enhancement for
0:00:36uh a generalized
0:00:37a parametric
0:00:38a spectral subtraction
0:00:40H
0:00:41so
0:00:42the approach there we're kind of looking at here
0:00:44uh it's to
0:00:46uh
0:00:47try to balance the differences between uh
0:00:50voice
0:00:51uh structures
0:00:52see C an articulatory domain
0:00:54uh noise will impact speech differently uh depending on the speech class
0:00:59i believe that
0:01:00uh adapting enhancement strategies is these different domains all actually
0:01:04prove your overall form
0:01:06um regions of low signal to noise ratio are we it gonna be more sensitive to
0:01:11uh a different types of noise
0:01:13babble or background
0:01:14type fluctuations
0:01:16um
0:01:17so it would make sense to try and track obviously see the signal to noise ratio
0:01:21but also look at that are with respect to
0:01:23uh the types of phone class
0:01:26a noise characteristics obviously a
0:01:28well and both quality intelligibility
0:01:30so that you approach here we like to kind of focus on a a phoneme class selective based
0:01:35a strategy
0:01:37uh that adapts
0:01:38sell
0:01:39have to phone classes over time
0:01:41so
0:01:43let's kind of maybe talk talk little bit about uh a different approach is people what to for phone class
0:01:48based
0:01:49uh enhancement
0:01:50um we you know obviously a noise is gonna impact of different from class
0:01:54as differently
0:01:55um
0:01:56and so based on the frequency content so articulatory structure an influence of noise and the phone as well as
0:02:02the
0:02:02uh a stationary noise you would expect
0:02:05uh quality
0:02:06packed differently
0:02:07so
0:02:08going back to uh transactions paper from a call in well has a uh this one at the soft decision
0:02:13based noise suppression strategy
0:02:15across different phone classes
0:02:17it was a very nice
0:02:18a fact approach
0:02:20um one of my former students lot are slim uh uh we had a paper and transactions and nine nine
0:02:25that looked at a hidden markov model based strategy to kind of classify
0:02:29uh a different phone class
0:02:30adapt uh it or to of uh are all S P
0:02:35a strategy
0:02:36a different phone class
0:02:37found that work
0:02:38well
0:02:39uh
0:02:41minutes so are former students a uh in our interspeech two thousand seven paper this class constraint over strategy
0:02:48and here again what we focused on was to try and work at
0:02:51uh extracting a pieces of enhanced speech from different types of
0:02:56uh a constrained representation
0:02:58see whether we can give
0:03:00the overall enhancement
0:03:01uh solution
0:03:02so
0:03:03this uh figure here kind shows the uh
0:03:06the strategy here i'll try to illustrate this the
0:03:09pointer
0:03:10so
0:03:11the uh
0:03:13and has been strategy this was a an older version from from interspeech two thousand seven
0:03:18we had a list the and it or to constrain or chip enhanced method
0:03:22and that since we kind of the a number of different enhancement solutions here
0:03:26uh for the input speech you basically try a whole uh a range of different to
0:03:30approaches approach in and if you look on the right here
0:03:32you kind of think of this is starting off with a single in hand are single degraded speech waveform
0:03:38i mean what you end up with is actually a very very large collection of enhanced waveforms
0:03:43and can kind of think of this is a very large break for the lack of
0:03:46anything thing else
0:03:47and the time domain is a got going along here and a cost each of these levels here
0:03:52and across this space of here
0:03:54could be a a very variations of different parameters that are controlled
0:03:59by the enhancement strategy and so
0:04:01and S since you end up with a a a a very large collection of enhanced waveforms
0:04:06then be approach is to try and C could you use a strategy like a a a gaussian mixture model
0:04:11approach to maybe go through and select
0:04:13which phone classes you actually have
0:04:15and identify which uh blocks actually would be most improved
0:04:20uh
0:04:21for that particular enhancements solutions up here
0:04:24uh based on the phone class
0:04:25so doing
0:04:27we use what's called a a a a what we're solution this is something that's very common in the speech
0:04:31recognition community
0:04:32and so what's done is
0:04:34looking at this big block to kind of going pick out you know with a particular
0:04:38enhancement configuration this particular piece and drop it down here
0:04:42the cup an X drop it here in since you're kind of looking across all the different domains
0:04:46in piecing them together uh
0:04:48hopefully coming up with a nice
0:04:50sequence of optimize enhanced uh
0:04:53blocks the for each of the different phone classes or a phone sequences
0:04:57um
0:04:58and hopefully the overall enhanced a signal is actually better
0:05:02so
0:05:03that
0:05:04just to a kind of a concept you when we look at a traditional uh and the this like an
0:05:09mse or spectral subtraction and these
0:05:12or just
0:05:13uh what
0:05:13typically we would they what people typically see
0:05:17uh is that uh you have maybe several classes of phones maybe this could be class one quest to quest
0:05:22three
0:05:23and these be different types of phone class one would argue that
0:05:27maybe a particular enhancement the like mmse
0:05:30if you can kind of to it properly
0:05:32tries to kinda give you
0:05:34good sounding speech across all the classes for one configuration
0:05:38and in a sense what you'd like to try and use to migrate this solution over
0:05:42specifically to this type of class maybe to the centroid of this space
0:05:46yeah with the idea of it now this is been optimized to this particular phone class and
0:05:51the results are force if you
0:05:53pick up each of these
0:05:54particular centroid
0:05:55you'd end up with a better overall solution than simply
0:05:58uh keeping the enhancement strategy constant for the whole way
0:06:03so the approach that we look for here to kind of
0:06:06uh use an alternative approach to
0:06:09uh the generalized spectral subtraction strategy
0:06:12uh and that is to to look at a weighted a euclidean
0:06:16we did clean
0:06:17a distortion
0:06:18and we believe this might be a better uh
0:06:21measure then then using uh
0:06:24uh i mean square error because we feel that that this
0:06:26would have a little but more perceptual based criteria incorporated into it
0:06:31so the idea is you have
0:06:34oh this vector of of uh
0:06:36of harmonic uh coefficients here
0:06:38uh from an fft
0:06:40and what we can do is to emphasise the errors during the val let's say by
0:06:44decreasing this pay a term uh that you're have in the representations of this is less a well we can
0:06:49assume for example that
0:06:51uh when B
0:06:52uh uh when you're in the ballot use that to the B magnitude of a X would be less than
0:06:57one
0:06:58so if you allow beta to be small i actually
0:07:02"'cause" this estimator here to increase
0:07:04um
0:07:05on the other hand if you're in a spectral peaks
0:07:08let's say during voiced of blocks uh X will be greater than one uh
0:07:12particular frequency harmonic
0:07:15and then this estimator term actually uh allowing
0:07:18uh
0:07:19uh the value of beta to be greater than zero
0:07:22well allow this term to actually inc
0:07:24and so
0:07:24that allows us to kind of adapt a
0:07:26a parametric way
0:07:28uh be enhancement solution
0:07:30so
0:07:31we approach that we've gonna look for is to kind of for use the
0:07:35uh
0:07:35a generalized spectral subtraction approach and and this was a introduced by
0:07:40uh same tone chang and tan in their transactions paper or ninety i U
0:07:45uh in this is the uh the estimator here it it basically finds the
0:07:49uh
0:07:50the best to estimate uh
0:07:52between
0:07:53uh a the term X of had an X uh
0:07:56uh from the original degraded speech signal
0:07:59uh the to uh components that we see here the a in
0:08:03B terms here the are frequency dependent weighting coefficients that need to be estimated
0:08:07any the for term here's a kind of the spectrum of exponent that you would see
0:08:11uh a the terms and here
0:08:13okay
0:08:13so what we'd like to do is to be able to optimize the a be terms here
0:08:17and in so doing in the
0:08:19for the general spectral subtraction approaches is basically to minimize the mean square error
0:08:24termed to see here so
0:08:25uh there is two solutions that uh uh that come up with one
0:08:29referred to as the unconstrained approach
0:08:31basically means that the a be terms of not equal to each other
0:08:34and the constrained approach which basically means that
0:08:37the two terms are in fact
0:08:39each
0:08:40so how do we are approach hours well what we're going to do is to
0:08:45uh work at minimizing uh
0:08:48optimising the terms uh in be subject to the a weighted euclidean distortion
0:08:53so doing
0:08:54we end up with these particular solutions for the in be terms
0:08:58uh we can then take these estimates of
0:09:00uh
0:09:01a B
0:09:02and such to them back into the a generalized spectral subtraction approach
0:09:06and form a new parametric estimators that we've fuel
0:09:10at offer some greater flexibility for enhance
0:09:13uh i just as a side note the minimum mean square error uh optimize coefficients are really just a special
0:09:19case of this weighted euclidean distortion approach
0:09:21when you lower out of beta T equal zero it actually
0:09:24falls back to the
0:09:26a a previous solution
0:09:28so this is is kind of a
0:09:31a busy plot but i will try to
0:09:33i like
0:09:34piece is
0:09:35i here here we're first looking at uh at fixing i'll and along the beta term to decrease
0:09:40and on this side were allowing L in increasing keeping beta fit
0:09:44basically there four quadrants here one is
0:09:46when there's speech region
0:09:48a you see this up here
0:09:49uh obviously this times to be the case where you have a fine high speech information and so you'd like
0:09:55to
0:09:56obviously try to suppress some of the noise but you don't want really touch or or damaged the speech signal
0:10:00as much
0:10:01um
0:10:02the second region here a a Q two is actually be
0:10:06unlike like be region it's these spots you
0:10:08respect to be operating in
0:10:10Q three is a noise only region and in this part here you really would like to actually have
0:10:14a a greater suppression and if you look at the beta based uh a constrained and train solutions
0:10:19we actually have a greater suppression
0:10:21a gain on the side so that's actually desirable to have
0:10:24and and is quite and for this is actually the case where you typically see um
0:10:28uh
0:10:29side harmonics that are popping up
0:10:31and this is actually the most dangerous area of the "'cause"
0:10:35in this part here i really would like to have suppression but
0:10:38i you really would like to ensure that you don't have to uh a musical tone artifacts that might be
0:10:43popping
0:10:43so
0:10:44this region is spot that you like to kind of sure
0:10:47i will have good perform
0:10:49so there are quite a few different enhancement methods that work can be comparing here are all try to highlight
0:10:54there was and not this slide but the next slide
0:10:56we were be going through a rover type solution and this
0:11:00i using what's called a mix mac
0:11:02uh a solution and this is actually L match this is actually coming from
0:11:06uh transactions paper from uh not as
0:11:09uh they david
0:11:10uh
0:11:11david
0:11:12a how movement michael but chaney back in eighty nine for speech recognition
0:11:16so i approach here basically we assume we have to great speech um
0:11:20we didn't going to have three estimators one
0:11:22and that we believe is a good estimator for sonorants one a good estimator for option
0:11:28an another one which we believe would be good for silence
0:11:31uh if we have a high energy we assume it's sonorants we know we're gonna kind i'm move four with
0:11:35that
0:11:36if it's a trained we it may or may not be a i'm noise and so we are up by
0:11:41a voice activity detector here
0:11:42if is in fact a a a a a uh
0:11:45no i then we we'd like to do is to kind of move down an update or noise reference characteristics
0:11:50here
0:11:50uh if it is in fact a a speech then we're gonna just use this uh in our model
0:11:55so
0:11:56uh we pull of the mfcc coefficients
0:11:58these are used primarily simply for that
0:12:00gaussian mixture models here
0:12:02these are basically to try trying classify whether were sitting and a show and sonorants so and silence blocks
0:12:08once we have this knowledge we feed this into the mix maxed type
0:12:11uh solution and what this does is it sets maximum likelihood
0:12:15uh weights that we can then used to weight
0:12:18uh the solutions from the sonorant a constraint
0:12:20and
0:12:21noise based estimators that we see a lot here
0:12:23hopefully coming up with
0:12:25integrated
0:12:26a solution that will sound better than
0:12:28and you the individual uh a solution
0:12:31so the categories as i said there are three broad phone class of a class types here sonorants obstruents and
0:12:37silence
0:12:38we group what we believe to be the
0:12:39the fricatives are for kids and stops
0:12:41the option
0:12:43um
0:12:44again we're doing this some kind of an unsupervised manner uh
0:12:47over time so what we believe the stops are actually finding a way in the actual the
0:12:52in fact
0:12:53move into the silence
0:12:54um again the uh and the parametric beta estimators are there were using
0:12:59or gonna to knows the each of the
0:13:01a broad phone classes
0:13:03uh for sonorants in a trend
0:13:06now the outputs from these estimators and convert mfccs and then the decision weights here kind of used
0:13:12uh to make a soft combine uh wait for each of the composite utterance utterances
0:13:17uh similar to the rover solution that weight
0:13:19back in in speech are seven
0:13:21fine like the noisy speech can be modelled using this uh mix max type
0:13:25uh model
0:13:27uh
0:13:27is also incorporate
0:13:28classification for the silence and the house
0:13:32um um in this mix max
0:13:33model model uh the gmms uh indicate we need to have to one for the
0:13:38sonorants one for the utterance
0:13:40uh so we have a set number of mixtures
0:13:42components that are used to estimate the
0:13:45for the silence were we're using right now she's one mixture of course if you have multiple noise types you
0:13:49can
0:13:50more than
0:13:50one mixture care that
0:13:52um
0:13:53in the mixed next um model uh as i pointed out now as uh they've number um of michael but
0:13:58any
0:13:59had had this uh idea of for uh modeling noise characteristics
0:14:04uh for speech recognition in nine we're using here so that the track noise structure
0:14:10next there a look at the enhancement our the experimental a up here a uh we use results from thirty
0:14:16two a individual sentences from timit
0:14:19a the metrics we use was
0:14:21uh a segmental signal to noise ratio and itakura-saito distortion
0:14:25results um some gonna show here just the other course a you know of sorry the a segmental snr
0:14:30the paper has all the results from a
0:14:32or C
0:14:33as well
0:14:34uh the gmms trained we used a three in tokens with sixteen mixtures
0:14:38and for the silence model
0:14:39uh just a single mixture
0:14:41and for the noise types we have two types uh a flat communications channel noise that uh we had from
0:14:47an eighteen T voice channel
0:14:49um
0:14:50and a large crowd noise so this multiple people speaking but not babble it's kind of a broader
0:14:55a noise i
0:14:57and uh there are quite a few different enhancement strategies the standard uh a from a line and sc C
0:15:03uh mmse
0:15:04uh the joint map scheme from
0:15:07a a patch of able from simon got soul
0:15:09uh from their paper and uh
0:15:12two thousand
0:15:13nine i believe
0:15:15um
0:15:15same paper on the generalized to a spectral subtraction the unconstrained approach were ain't be uh in B terms don't
0:15:21have to be equal to each other and then constraint scheme where they do have to be equal to each
0:15:25other
0:15:26a parametric approach is
0:15:27for
0:15:28uh a weighted euclidean just a uh a distortion based approach
0:15:32uh four icassp paper last year we had a chi-square prior for the
0:15:37uh for the amplitudes on the scheme and that was reported last year
0:15:41and we also
0:15:43a chi-square
0:15:44prior for the um
0:15:45but J map solutions of this we have
0:15:47a a of last year
0:15:49uh for what we're doing this year with the rover approach a of this has the rover based uh
0:15:54corporation of the weighted clean distortion
0:15:57chi
0:15:58priors
0:15:59uh and same for the J map type solution
0:16:02and then we take a the beta on constrained and beta constrained approach here and also feed
0:16:06a and world of a solution so
0:16:08and has since we have
0:16:09a to different enhanced and that that's uh
0:16:12well really wanna
0:16:13a benchmark a baseline against the parametric against
0:16:18so uh this uh uh a to shows the uh sec well signal noise ratio increase of this is actually
0:16:25any positive value here shows a
0:16:28in
0:16:28an improvement in segmental signal-to-noise ratio
0:16:31so uh which G you term here this is basically the generalized spectral subtraction approach which is a baseline scheme
0:16:37from sin
0:16:38a a paper
0:16:40and you can see the uh
0:16:41proof been here on the sonorants are quite good
0:16:44an improvement on obstruents and silence or not as good and the overall the kind of
0:16:49right
0:16:50but
0:16:50like to see
0:16:51now each of these three are actually optimized for sonorants obstruents
0:16:56and uh the noise types so
0:16:58what we do is we search across all the possible configurations
0:17:02uh for the terms here we find it best can figure it
0:17:05best configuration
0:17:06uh for the sonorants
0:17:08and that's the best improvement that we
0:17:09yet
0:17:10um
0:17:11at the same time for the ops to C C
0:17:14uh this simple is actually quite a
0:17:16is there
0:17:17and silence
0:17:17is not so bad uh
0:17:20like it to be
0:17:21um but if you look across the diagonal here see for the sonorants the ops rents
0:17:25and the noise we when we optimize this we actually get
0:17:29a nice improvement across here better than what we would have gotten
0:17:32with the uh sims approach uh a we at this council
0:17:36cross
0:17:37phase
0:17:38now the goal then is to try and figure out how quite kind of take the best from each of
0:17:41these
0:17:41and together
0:17:43so this approach here's a lower based uh a solution at this does not use the exact
0:17:48optimum solutions here it actually goes and finds what it thinks is the best approach
0:17:52based on that fine and classifier so
0:17:55this is kind of what you would expect to see performance wise
0:17:58if it's free running not knowing what the what the best performance
0:18:01and you can see
0:18:02improvement in sec mel signal noise ratio is quite nice
0:18:06uh both for sonorants obstruents and silence
0:18:08a not too bad and the overall
0:18:12track of time here some just
0:18:14a quickly here
0:18:15these sure signal signal to noise ratio increases for flat communications channel noise
0:18:19across all the different uh
0:18:21uh noise types
0:18:23uh or noise levels some sorry
0:18:25in the main solution to kind of see from here is that
0:18:27these uh approaches down here are the on rover to approach and use of the rover solutions that
0:18:33and a combining them in a nice automatic way uh
0:18:36uh allows you to can get better performance for the flat communications channel noise
0:18:41and likewise for the a large crowd noise you can see the performance here is quite nice
0:18:46um
0:18:47for the ops true joints for the sonorants here
0:18:49also as well and and you combine them there actually a much much better than
0:18:53the jewel
0:18:55uh so
0:18:56if you're kind of looking what out of these maybe sixteen or so different
0:19:00enhancement strategies
0:19:01what of the best ones of these indicate the first and second best
0:19:05can spent strategy
0:19:06you can pretty much see across all of our evaluations here that the row based solutions are we
0:19:11uh a quite well
0:19:13uh the beta bait the beta or uh the parametric beta scheme
0:19:17uh for the general spectral subtraction was also a
0:19:20uh a good to a candidate there M and the J map uh version was also
0:19:25a successful mean
0:19:26and
0:19:28so are can in conclusion we for considered uh to parametric uh
0:19:32a generalized uh
0:19:34spectral subtraction approach here
0:19:36uh
0:19:37he's parametric estimators can be preaching and for the different phone classes
0:19:41uh a name been may not perform a small across all the phone classes
0:19:45uh incorporating a rover paradigm a large to pick off some of the better
0:19:49a segment
0:19:51one together for an overall enhanced uh approach
0:19:54and uh we looked to these estimators across uh individual um
0:19:58uh
0:19:59uh
0:20:01we we
0:20:01compare them against the individual estimators uh
0:20:05without having a rover solution and found that their combinations improve performance for flat communications channel voice large crowd noise
0:20:12or over different signal
0:20:13to noise ratio
0:20:16john film thank you very much
0:20:22so uh any questions
0:20:27the i was the better is constant in each group
0:20:30the often all the frequency
0:20:32uh they are constant of what happens is that when we uh uh there that but they can be different
0:20:38for each of the classes so rents uh obstruents
0:20:42uh silence they can be different for those
0:20:45so it's kind of when you when we look at a prior or um
0:20:49a rover solution the prior over solution we actually had many many more classes here we only kind of looking
0:20:55at three
0:20:56so we allow kind of some flexibility you can generalise it a more class
0:21:00three
0:21:01and and how robust is with respect to
0:21:04um
0:21:05misclassification
0:21:06yeah that's a good question so
0:21:08um we have we are running a test where we intentionally putting in
0:21:12five five and ten percent errors in
0:21:14you're less likely to have an error between of sonorant two
0:21:17and after right but you're more likely to have
0:21:19and error between a string
0:21:21and
0:21:21sign
0:21:22a so that was the issue should be when i one out to the stops there's the stop sometimes it
0:21:27the leading or training stops
0:21:28it ten to get cold and you go into the silence side or the other side so you have to
0:21:32much suppression
0:21:35further comments
0:21:37so thank you once more