i um well actually a a a a it's uh one of my for P H Ds are one i for masters to uh use now working at a company and because of that he was name a travel here for for the presentation where course into but a a present a that the is focused on phoneme selective be speech enhancement for uh a generalized a parametric a spectral subtraction H so the approach there we're kind of looking at here uh it's to uh try to balance the differences between uh voice uh structures see C an articulatory domain uh noise will impact speech differently uh depending on the speech class i believe that uh adapting enhancement strategies is these different domains all actually prove your overall form um regions of low signal to noise ratio are we it gonna be more sensitive to uh a different types of noise babble or background type fluctuations um so it would make sense to try and track obviously see the signal to noise ratio but also look at that are with respect to uh the types of phone class a noise characteristics obviously a well and both quality intelligibility so that you approach here we like to kind of focus on a a phoneme class selective based a strategy uh that adapts sell have to phone classes over time so let's kind of maybe talk talk little bit about uh a different approach is people what to for phone class based uh enhancement um we you know obviously a noise is gonna impact of different from class as differently um and so based on the frequency content so articulatory structure an influence of noise and the phone as well as the uh a stationary noise you would expect uh quality packed differently so going back to uh transactions paper from a call in well has a uh this one at the soft decision based noise suppression strategy across different phone classes it was a very nice a fact approach um one of my former students lot are slim uh uh we had a paper and transactions and nine nine that looked at a hidden markov model based strategy to kind of classify uh a different phone class adapt uh it or to of uh are all S P a strategy a different phone class found that work well uh minutes so are former students a uh in our interspeech two thousand seven paper this class constraint over strategy and here again what we focused on was to try and work at uh extracting a pieces of enhanced speech from different types of uh a constrained representation see whether we can give the overall enhancement uh solution so this uh figure here kind shows the uh the strategy here i'll try to illustrate this the pointer so the uh and has been strategy this was a an older version from from interspeech two thousand seven we had a list the and it or to constrain or chip enhanced method and that since we kind of the a number of different enhancement solutions here uh for the input speech you basically try a whole uh a range of different to approaches approach in and if you look on the right here you kind of think of this is starting off with a single in hand are single degraded speech waveform i mean what you end up with is actually a very very large collection of enhanced waveforms and can kind of think of this is a very large break for the lack of anything thing else and the time domain is a got going along here and a cost each of these levels here and across this space of here could be a a very variations of different parameters that are controlled by the enhancement strategy and so and S since you end up with a a a a very large collection of enhanced waveforms then be approach is to try and C could you use a strategy like a a a gaussian mixture model approach to maybe go through and select which phone classes you actually have and identify which uh blocks actually would be most improved uh for that particular enhancements solutions up here uh based on the phone class so doing we use what's called a a a a what we're solution this is something that's very common in the speech recognition community and so what's done is looking at this big block to kind of going pick out you know with a particular enhancement configuration this particular piece and drop it down here the cup an X drop it here in since you're kind of looking across all the different domains in piecing them together uh hopefully coming up with a nice sequence of optimize enhanced uh blocks the for each of the different phone classes or a phone sequences um and hopefully the overall enhanced a signal is actually better so that just to a kind of a concept you when we look at a traditional uh and the this like an mse or spectral subtraction and these or just uh what typically we would they what people typically see uh is that uh you have maybe several classes of phones maybe this could be class one quest to quest three and these be different types of phone class one would argue that maybe a particular enhancement the like mmse if you can kind of to it properly tries to kinda give you good sounding speech across all the classes for one configuration and in a sense what you'd like to try and use to migrate this solution over specifically to this type of class maybe to the centroid of this space yeah with the idea of it now this is been optimized to this particular phone class and the results are force if you pick up each of these particular centroid you'd end up with a better overall solution than simply uh keeping the enhancement strategy constant for the whole way so the approach that we look for here to kind of uh use an alternative approach to uh the generalized spectral subtraction strategy uh and that is to to look at a weighted a euclidean we did clean a distortion and we believe this might be a better uh measure then then using uh uh i mean square error because we feel that that this would have a little but more perceptual based criteria incorporated into it so the idea is you have oh this vector of of uh of harmonic uh coefficients here uh from an fft and what we can do is to emphasise the errors during the val let's say by decreasing this pay a term uh that you're have in the representations of this is less a well we can assume for example that uh when B uh uh when you're in the ballot use that to the B magnitude of a X would be less than one so if you allow beta to be small i actually "'cause" this estimator here to increase um on the other hand if you're in a spectral peaks let's say during voiced of blocks uh X will be greater than one uh particular frequency harmonic and then this estimator term actually uh allowing uh uh the value of beta to be greater than zero well allow this term to actually inc and so that allows us to kind of adapt a a parametric way uh be enhancement solution so we approach that we've gonna look for is to kind of for use the uh a generalized spectral subtraction approach and and this was a introduced by uh same tone chang and tan in their transactions paper or ninety i U uh in this is the uh the estimator here it it basically finds the uh the best to estimate uh between uh a the term X of had an X uh uh from the original degraded speech signal uh the to uh components that we see here the a in B terms here the are frequency dependent weighting coefficients that need to be estimated any the for term here's a kind of the spectrum of exponent that you would see uh a the terms and here okay so what we'd like to do is to be able to optimize the a be terms here and in so doing in the for the general spectral subtraction approaches is basically to minimize the mean square error termed to see here so uh there is two solutions that uh uh that come up with one referred to as the unconstrained approach basically means that the a be terms of not equal to each other and the constrained approach which basically means that the two terms are in fact each so how do we are approach hours well what we're going to do is to uh work at minimizing uh optimising the terms uh in be subject to the a weighted euclidean distortion so doing we end up with these particular solutions for the in be terms uh we can then take these estimates of uh a B and such to them back into the a generalized spectral subtraction approach and form a new parametric estimators that we've fuel at offer some greater flexibility for enhance uh i just as a side note the minimum mean square error uh optimize coefficients are really just a special case of this weighted euclidean distortion approach when you lower out of beta T equal zero it actually falls back to the a a previous solution so this is is kind of a a busy plot but i will try to i like piece is i here here we're first looking at uh at fixing i'll and along the beta term to decrease and on this side were allowing L in increasing keeping beta fit basically there four quadrants here one is when there's speech region a you see this up here uh obviously this times to be the case where you have a fine high speech information and so you'd like to obviously try to suppress some of the noise but you don't want really touch or or damaged the speech signal as much um the second region here a a Q two is actually be unlike like be region it's these spots you respect to be operating in Q three is a noise only region and in this part here you really would like to actually have a a greater suppression and if you look at the beta based uh a constrained and train solutions we actually have a greater suppression a gain on the side so that's actually desirable to have and and is quite and for this is actually the case where you typically see um uh side harmonics that are popping up and this is actually the most dangerous area of the "'cause" in this part here i really would like to have suppression but i you really would like to ensure that you don't have to uh a musical tone artifacts that might be popping so this region is spot that you like to kind of sure i will have good perform so there are quite a few different enhancement methods that work can be comparing here are all try to highlight there was and not this slide but the next slide we were be going through a rover type solution and this i using what's called a mix mac uh a solution and this is actually L match this is actually coming from uh transactions paper from uh not as uh they david uh david a how movement michael but chaney back in eighty nine for speech recognition so i approach here basically we assume we have to great speech um we didn't going to have three estimators one and that we believe is a good estimator for sonorants one a good estimator for option an another one which we believe would be good for silence uh if we have a high energy we assume it's sonorants we know we're gonna kind i'm move four with that if it's a trained we it may or may not be a i'm noise and so we are up by a voice activity detector here if is in fact a a a a a uh no i then we we'd like to do is to kind of move down an update or noise reference characteristics here uh if it is in fact a a speech then we're gonna just use this uh in our model so uh we pull of the mfcc coefficients these are used primarily simply for that gaussian mixture models here these are basically to try trying classify whether were sitting and a show and sonorants so and silence blocks once we have this knowledge we feed this into the mix maxed type uh solution and what this does is it sets maximum likelihood uh weights that we can then used to weight uh the solutions from the sonorant a constraint and noise based estimators that we see a lot here hopefully coming up with integrated a solution that will sound better than and you the individual uh a solution so the categories as i said there are three broad phone class of a class types here sonorants obstruents and silence we group what we believe to be the the fricatives are for kids and stops the option um again we're doing this some kind of an unsupervised manner uh over time so what we believe the stops are actually finding a way in the actual the in fact move into the silence um again the uh and the parametric beta estimators are there were using or gonna to knows the each of the a broad phone classes uh for sonorants in a trend now the outputs from these estimators and convert mfccs and then the decision weights here kind of used uh to make a soft combine uh wait for each of the composite utterance utterances uh similar to the rover solution that weight back in in speech are seven fine like the noisy speech can be modelled using this uh mix max type uh model uh is also incorporate classification for the silence and the house um um in this mix max model model uh the gmms uh indicate we need to have to one for the sonorants one for the utterance uh so we have a set number of mixtures components that are used to estimate the for the silence were we're using right now she's one mixture of course if you have multiple noise types you can more than one mixture care that um in the mixed next um model uh as i pointed out now as uh they've number um of michael but any had had this uh idea of for uh modeling noise characteristics uh for speech recognition in nine we're using here so that the track noise structure next there a look at the enhancement our the experimental a up here a uh we use results from thirty two a individual sentences from timit a the metrics we use was uh a segmental signal to noise ratio and itakura-saito distortion results um some gonna show here just the other course a you know of sorry the a segmental snr the paper has all the results from a or C as well uh the gmms trained we used a three in tokens with sixteen mixtures and for the silence model uh just a single mixture and for the noise types we have two types uh a flat communications channel noise that uh we had from an eighteen T voice channel um and a large crowd noise so this multiple people speaking but not babble it's kind of a broader a noise i and uh there are quite a few different enhancement strategies the standard uh a from a line and sc C uh mmse uh the joint map scheme from a a patch of able from simon got soul uh from their paper and uh two thousand nine i believe um same paper on the generalized to a spectral subtraction the unconstrained approach were ain't be uh in B terms don't have to be equal to each other and then constraint scheme where they do have to be equal to each other a parametric approach is for uh a weighted euclidean just a uh a distortion based approach uh four icassp paper last year we had a chi-square prior for the uh for the amplitudes on the scheme and that was reported last year and we also a chi-square prior for the um but J map solutions of this we have a a of last year uh for what we're doing this year with the rover approach a of this has the rover based uh corporation of the weighted clean distortion chi priors uh and same for the J map type solution and then we take a the beta on constrained and beta constrained approach here and also feed a and world of a solution so and has since we have a to different enhanced and that that's uh well really wanna a benchmark a baseline against the parametric against so uh this uh uh a to shows the uh sec well signal noise ratio increase of this is actually any positive value here shows a in an improvement in segmental signal-to-noise ratio so uh which G you term here this is basically the generalized spectral subtraction approach which is a baseline scheme from sin a a paper and you can see the uh proof been here on the sonorants are quite good an improvement on obstruents and silence or not as good and the overall the kind of right but like to see now each of these three are actually optimized for sonorants obstruents and uh the noise types so what we do is we search across all the possible configurations uh for the terms here we find it best can figure it best configuration uh for the sonorants and that's the best improvement that we yet um at the same time for the ops to C C uh this simple is actually quite a is there and silence is not so bad uh like it to be um but if you look across the diagonal here see for the sonorants the ops rents and the noise we when we optimize this we actually get a nice improvement across here better than what we would have gotten with the uh sims approach uh a we at this council cross phase now the goal then is to try and figure out how quite kind of take the best from each of these and together so this approach here's a lower based uh a solution at this does not use the exact optimum solutions here it actually goes and finds what it thinks is the best approach based on that fine and classifier so this is kind of what you would expect to see performance wise if it's free running not knowing what the what the best performance and you can see improvement in sec mel signal noise ratio is quite nice uh both for sonorants obstruents and silence a not too bad and the overall track of time here some just a quickly here these sure signal signal to noise ratio increases for flat communications channel noise across all the different uh uh noise types uh or noise levels some sorry in the main solution to kind of see from here is that these uh approaches down here are the on rover to approach and use of the rover solutions that and a combining them in a nice automatic way uh uh allows you to can get better performance for the flat communications channel noise and likewise for the a large crowd noise you can see the performance here is quite nice um for the ops true joints for the sonorants here also as well and and you combine them there actually a much much better than the jewel uh so if you're kind of looking what out of these maybe sixteen or so different enhancement strategies what of the best ones of these indicate the first and second best can spent strategy you can pretty much see across all of our evaluations here that the row based solutions are we uh a quite well uh the beta bait the beta or uh the parametric beta scheme uh for the general spectral subtraction was also a uh a good to a candidate there M and the J map uh version was also a successful mean and so are can in conclusion we for considered uh to parametric uh a generalized uh spectral subtraction approach here uh he's parametric estimators can be preaching and for the different phone classes uh a name been may not perform a small across all the phone classes uh incorporating a rover paradigm a large to pick off some of the better a segment one together for an overall enhanced uh approach and uh we looked to these estimators across uh individual um uh uh we we compare them against the individual estimators uh without having a rover solution and found that their combinations improve performance for flat communications channel voice large crowd noise or over different signal to noise ratio john film thank you very much so uh any questions the i was the better is constant in each group the often all the frequency uh they are constant of what happens is that when we uh uh there that but they can be different for each of the classes so rents uh obstruents uh silence they can be different for those so it's kind of when you when we look at a prior or um a rover solution the prior over solution we actually had many many more classes here we only kind of looking at three so we allow kind of some flexibility you can generalise it a more class three and and how robust is with respect to um misclassification yeah that's a good question so um we have we are running a test where we intentionally putting in five five and ten percent errors in you're less likely to have an error between of sonorant two and after right but you're more likely to have and error between a string and sign a so that was the issue should be when i one out to the stops there's the stop sometimes it the leading or training stops it ten to get cold and you go into the silence side or the other side so you have to much suppression further comments so thank you once more