Přepis řeči - GAMMATONE SUB-BAND MAGNITUDE-DOMAIN DEREVERBERATION FOR ASR

Q and uh so far thank you for being are through the session and reading flat annotation a this paper or well so a that by she it's come on a just think was upstairs that manning a poster or myself and its tunnels go and shift pitch could make it its i will be presenting a oh speaking at the problem they're talking about today use is with a robustness to the reverberation a speech is a natural medium of communication for humans and we've been applying speech technologies everywhere at feast work great and control lab conditions when we get to real world conditions things sort of break down and one of the reasons for this is that operation have immigration was what happens when we have reflections so and getting from me to you the sound not only takes the direct uh the meteor it also bounces of walls that reflections reflections of reflections and so on so if you actually look at a plot to the right that just shows the impulse response of for a typical impulse response from a a source to a less not and you can see the direct a i each of the spikes represents one of the reflections a a is off but these things i continue for some time so the sound that gets from the source to the listener can be part in the nearly for as of the room which is a stage of number here yeah not be of operation is that is characterised through would be a a R T sixty time which indicates how much time a sound takes to die off by sixty db B yeah if after are reflections and uh the operation ah as what to effect or reverberation thus to a speech signal the left top panel shows a a a spectrogram of a signal as from the resource management database we we have a to that using an artificial room response for a room that was that that at a a T six you down for about three hundred miliseconds then it can see what happens to the spectrogram but near and can actually note that this looks like the spectrogram and the entire spectrogram is sneered but it actually looks as of this mass spectrogram itself has been passed through a linear filter and she is brought happens two recognition accuracy because of reverberation and this experiment we uh trained our models with clean data from a resource management database how we simulated room responses one of five cross for cross three room but be much map we had reverberation time to the a few hundred and five hundred milliseconds a if we recognise clean speech you get an error of less than ten percent which is the leftmost mar but with that of a should time of only about three hundred miliseconds which is fairly standard for the for a room we can get that hmmm i don't see that is not audio so if you that it great and no so we a it's it's a fairly standard row and it can see that that of it immediately as got up to or fifty percent and the room responses are what half a second it's well over seventy percent so it's right to a it's very rapidly with bridge yeah in on it do you know that to deal but that we begin by modeling the effect of reverberation itself now consider how we compute feature as for speech recognition yeah have the speech signal can see look only at the grey blocks for not a speech signal goes through a bunch of file filter that's like mel-frequency frequency filters a and then the output of we compute how our at the output of these those you compress the power using a lot function and then eventually computed dct it gives you the feature no be evaluation of fixed in input to each of these filters it actually a a a a fix the signal such is the equivalent of affecting the input each of these but the so you can actually model but a vibration and this manner by be red blocks and uh the linearity of the uh convolution that was on your the sweltering that was on over here is that you can feel the initial analysis but does it should be a at frequency for does and the room response it's that so these two i strictly equivalent in terms of the effect on the features that are computed for all the signal that intent oh yeah we introduce this a mine an approximation we say that computing the how R of that they were great signal is it and roughly to level grading the C Ds of power values that you get in every channel and this filter or what here eight these H one to H T M i i simply the for does that you'd get if you oh essentially sample by sample square head impulse response of the room impulse response of the room and approximating and this it in this manner by for this order a because not perfect it gives you some and and the it is dependent on autocorrelation of the signal a i that we have a plot which actually shows what kind of does it makes the uh the red line is the spectrum of a signal this is the actually the output of a a uh a mel frequency for the centered at uh five hundred and it heard a a not in the room with we actually have a braided the signal in this case we apply a a i believe a a uh a a three hundred millisecond our T a operation the output of the filter shown by the green line but just what should get in this case this is what you get out oh using list approximate model what we get out here the shown but the blue line and you can see that this approximation which we get from from thing a quite but a vibration and the power doesn't introduce very much better in fact we have a a a a a a that more quantitative result are you know a it turns out that applying this filter to the palm or is different from applying this but the to the magnitude but you'd actually be taking this quite of use square root of these terms so good all points to and when applied D filter to the palm or you introduce an ad or which which results in about a a a a results in some some distortion and the output of the a a and the output of the uh that would be to to the reverberation model but is if you apply to the magnitude the kind of is much smaller so we but actually in a model a as you know that the abrasion is the filtering that can cause on the magnitude oh of a bird cough off your mel so the process can be caught up like so you have a and channel a mel filter or its equivalent you have a power or magnitude computation and then you have the spectral money which actually applies on the to impose the effect of reverberation we've expanded it on it extended on it here we have the magnitude or power spectrum going into the room response to get the label magnitude or power and then of course you have the log and the dct so what we have done is we have effectively that a convolution on the signal which is the room response to a convolution on magnitude are power spectrum and only observe all these types the have a belated sequence of power right and then just this a a problem is to deter mine oh i'll be stops the room response as we have as as the a problem that clean signal at seven oh this is obviously a an i in constrained problems so we have to impose some constraints and we're going to impose some constraints is going to say that uh a because we are dealing with that magnitudes call times are nonnegative in addition i don't merely observe B a a but in signal the actually observe a noise corrupted version of the reverberant signal so what we will do is to try to estimate the signal and the room response such that the error between the output of for model and what to actually that is uh is minimised but some sparsity constraints on the spectrum no because a scaling factor going and there also that impose an additional constraint that these room response times some to one can this it turns out simply a standard nonnegative matrix factorization problem i would actually go to the derivation of work you know but if you do you'd find a that you get a bit rules it's an iterative solution which gives a it that a very similar to a matrix factorization you can start off with an estimate and at each iteration to get a multiplicative update to this chart which in shows that are always days positive i a propose of this this formulation we have here is not something that for introducing this paper i has been proposed and by a me car and we also propose it separately a paper in uh i believe that last year oh the basic from of isn't but we proposed she as what we do we had the standard short time pretty one then you compute the power and the nmf decomposition which is what we have here i you an estimate of the K you can which you can perform an overlap add and estimate B a a no clean signal a contribution of what he that is that are not going to work directly on this power instead a actually apply had gammatone filter bank so basically that's but the bank here is gonna be a gammatone for the bank and after having applied the gammatone for the bank we compute them decomposition composition and the math and a in a for the bank and then performed the overlap so the got on for the bank can be thought of as a dimensionality it using linear operation on the a or or the magnitude and it is simply going to be the equivalent of multiplying the output of an have the pseudo inverse of this a lot for device matrix so that that as an example of what we get this is a reverberated signal are gonna i don't have audio so yeah i this sort of uh a a a a a yeah it's a lot of maybe uh by this is what we hard with the H she so i don't know yeah the the a signal yeah okay yeah that that that my what right that liberation as very used uh it believe me okay so a given that we can actually do this that was actually can or when you are but in a perceptions a great thing you can you are all sorts of nice stuff but and then you put this to that signal at a recogniser those the improvements don't sure so he are we and some experiments on the resource management database this as a model trained on clean speech and you as what you get and that was signal is the web braided with uh room response of that we hundred millisecond reverberation time and the error it goes down if you actually try to dereverberated using the basic and a mechanism proposed a by a income come you can and if this is that i don't the part or it goes down a but but if you applied on the map that you're it goes down a lot more so which shows that gives better off to where we a better at working on the magnitude and then oh here i G she M S R be don't and of for nmf variance again when we apply the garment to one and and of it so in the H cases a as you want that the room response responses the same in every channel oh the processing which is really that true but what happens is that because you observing of that a version of the signal it does not make sense to us as that the room response the same in every channel that actually gives you a bad estimate so if you as a estimated different room response at each channel and you get some improvements which a short by these guys and then if you are actually apply the gammatone filtering he is what you get when you were on the power but if you were on the magnitude we can see that if you as you that the room response responses the in all channels you gets is you performance if a lot it to be different for different channel a as a performance again so the gist of it is that that do you have a bidding the signal a after down but don't training and then post inverse filtering it and and forming all the net at but yeah a bit of signal results that's and at a rates which are less than half of what you'd get but the there is no and segment we got this not of you other uh test sets this with the by so and was that we had a a three hundred millisecond reverberation time this is good a three hundred and five hundred and we compare it with a bunch of other techniques which i one bar to explain i take time again the do just but was better no i actually making a model as i'm channel what here mainly that you can through i the power computation and the room response and then performing the dereverberation vibration i would this for the procedure have that is not an up box approximation but that would truly what happened so in this experiment we actually sort of fate reverberation applying that a vibration to a sequence of part values and then i to do have a big the signal nine but you can see but that that's a and a interested in these uh kind of spurious use which are there because the plot came out should be it's thesis and you can see that when be uh well actually holds to the improvements you get can be very very large a this case we tried this not all of stuff we so idea was on fake room responses where the room response computed using the image method so we applied some true room responses to obtain from a T are this is a room response but a for seventy millisecond response time is at six hundred millisecond and again improvements the no one of the things that everybody knows that is that the set up i should use so far is good a a speech recognition system is an never train on clean data are you actually train it on matched data that kind of data that you actually expect to recognise so i all of this but all when you perform matched condition training sure enough you observe that the implements you get from the of a braiding the signal if you train the signal a the recogniser and clean speech not even then given to get to the kind of performance you get if you simply trained the recognizer on dev a speech this is a performance a get the yellow bars in each case but even yeah we should do you have a braided but the training and test data using our technique you get an additional improvement which is about i twenty to forty percent relative or believe if i can find my at all i could helps in every K the truth of the matter is you don't merely have a operation we also have additive noise so that have a bit signal gets corrupted by additive noise be explained a here so we can you know five this were process that i've just show but some additional processing to compensate for the noise i of that's is to for that was presented but it's done yesterday at the leaf we had to be present that something called but that spectral cepstral coefficients D C C and uh so so is a procedure that in a in a in a and summary in the D S C C computation it's of directly web team of the magnitude spectra and compressing magnitude spectra can you but since is between a just in magnitude spectrum of what is that has is that stationary signals get can so that that things that maybe even a little bit the i as it turns out that's D speech is they but as stationary as the noise that could upset so as a result simply performing this the friends operation and what only and the prince magnitude spectra robust bus performance so that it's bad it's better and now have what positive and negative values so we actually have to sort of normalise its distribution and then without any compression whatsoever just apply dct and then perform from additional operations as before and this output is what we used for recognition and we showed and now the post is that using that feature get much more robust recognition then what you get with just read up a mel frequency cepstra so it turns out that the room that they're dereverberation that we does that i just talked about can be combined but this tells that a spectral cepstral coefficient computation so you actually do have a bit the signal first and then you compute the mel spectra these of the not the log mel spectra computed differences is and then compute features and then and you perform recognition on those features and sure enough you can see that firstly and all you have is that a web it'd signal it makes things just marginally words but the moment to begin adding noise a blue as as a lot of improvements so this uh is all on a a a a a room but millisecond room response and the blue light shows the performance you get a and you and do you that will be the signal and then use the B C C features and each please uh we've got noise of two different levels could up in the signal and you can see that but the best performance by far as what you get a you but do that will be the signal and then perform be C computation so in summary we model that operation one speech spectral we might it as a phenomenon that a fixed this sequence of magnitude spectra used an a F a factor is this to perform of operation i also used the gammatone sub-band non-negative matrix factorization not that you have a perceptual weighting on this and in perceptual weighting and uh the compared its magnitude and power domain the were here and studied the joint normalized and a patient problem by integrating a noise that was feature a lot with do we that operation and got significant improvements thank you well the is to last talk and do used to have time to change yes hmmm i a oh a oh i i i oh one i a oh and yeah uh yeah oh yeah okay i i yeah yeah yeah a question and yeah yes a a yeah okay and yeah my okay i i oh oh oh if a so and to i have the last question i have to prove uh talk about some sort of has to X and H and just just what don't have it is but have a bribe and was once had to it that's perhaps the so the optimized is not have to do that situation oh i i vol a yeah yeah oh a you yeah for i i uh_huh and i i how to do this and the this one experimental to have yeah well the questions well i guess so people a a a a i want to as the i'm to say of this paper is and in particular do so two and and sent much

GAMMATONE SUB-BAND MAGNITUDE-DOMAIN DEREVERBERATION FOR ASR

Robust ASR

Přednášející: Bhiksha Raj, Autoři: Kshitiz Kumar, Rita Singh, Carnegie Mellon University, United States; Bhiksha Raj, Disney Research, United States; Richard Stern, Carnegie Mellon University, United States