0:00:16and uh so far
0:00:19thank you for being
0:00:22through the session and reading flat annotation
0:00:24a this paper or well so a that by she it's come on a just think was upstairs that manning
0:00:29a poster or
0:00:30myself and its tunnels go
0:00:32and shift pitch could make it its i will be presenting a
0:00:36oh speaking at the problem they're talking about today use is with a robustness to the reverberation
0:00:42a speech is a natural medium of communication for humans
0:00:45and we've been applying speech technologies everywhere at feast work great and control lab conditions
0:00:51when we get to real world conditions things sort of break down
0:00:55and one of the reasons for this is that operation
0:00:58have immigration was what happens when we have reflections so
0:01:01and getting from me to you
0:01:03the sound not only takes the direct uh the meteor it also bounces of walls
0:01:08that reflections reflections of reflections and so on
0:01:11so if you actually look at a plot to the right
0:01:13that just shows the impulse response of for a typical impulse response from a a source to a less not
0:01:18and you can see the direct a i each of the spikes represents one of the reflections
0:01:23a a is off
0:01:25these things i continue for some time
0:01:28so the sound that gets from the source to the listener can be part
0:01:33in the nearly for
0:01:34as of the room which is a stage of number here
0:01:37yeah not be of operation is that is characterised through would be a
0:01:42a R T sixty time which indicates
0:01:44how much time
0:01:46a sound takes to die off by sixty db B
0:01:49yeah if
0:01:50after are reflections
0:01:52and uh the operation
0:01:55as what to effect or reverberation thus to a speech signal
0:01:59the left
0:01:59top panel shows a a a spectrogram of a signal
0:02:03as from the resource management database
0:02:05we we have a to that using an artificial room response for a room that was that that at a
0:02:11T six you down for about three hundred miliseconds then it can see what happens
0:02:15to the spectrogram
0:02:16but near
0:02:20can actually note that this looks like the spectrogram and the entire spectrogram is sneered
0:02:25but it actually looks as of this mass spectrogram itself has been passed through a linear filter
0:02:30and she is brought
0:02:34recognition accuracy because of reverberation
0:02:37and this experiment we uh
0:02:39trained our models with clean data from a resource management database
0:02:43how we simulated room responses one of five cross for cross three room but be much map
0:02:49we had reverberation time to the a few hundred and five hundred milliseconds
0:02:53a if we recognise clean speech you get an error of less than ten percent which is the leftmost mar
0:02:58but with that of a should time of only about three hundred miliseconds which is fairly standard for the
0:03:02for a room we can get that
0:03:08i don't see that is not audio so
0:03:11if you that it great
0:03:13and no
0:03:14so we
0:03:15a it's it's a fairly standard row
0:03:17and it can see that that of it immediately as got up to or fifty percent
0:03:20and the room responses are what half a second it's
0:03:23well over seventy percent so it's
0:03:24right to
0:03:26a it's very rapidly with bridge
0:03:28yeah in on it do you know that to deal but that we begin by modeling the effect of reverberation
0:03:33itself now
0:03:35how we compute feature as
0:03:37for speech recognition
0:03:38yeah have the speech signal can see look only at the grey blocks for not
0:03:42a speech signal goes through a bunch of file filter that's like mel-frequency frequency filters
0:03:46a and then the output of we compute
0:03:48how our at the output of these those you compress the power using a lot function
0:03:53and then eventually computed dct it gives you the feature
0:03:57no be evaluation of fixed
0:03:59in input to each of these filters it actually a a a a fix the signal such is the equivalent
0:04:05affecting the input each of these but the so you can actually model
0:04:08but a vibration and this manner by be red blocks
0:04:12and uh the linearity of the uh
0:04:14convolution that was on your the sweltering that was on over here
0:04:18is that you can feel
0:04:20the initial analysis but does it should be a at frequency for does and the room response it's that
0:04:25so these two
0:04:28strictly equivalent in terms of the effect on the features that are computed
0:04:34for all the signal that intent
0:04:37yeah we introduce this
0:04:39a mine an approximation
0:04:41we say that
0:04:42computing the how R
0:04:45of that they were great signal is it and
0:04:49to level grading the C Ds of power values that you get in every channel
0:04:54and this filter or what here eight these H one to H T M
0:04:58i i simply the for does that you'd get if you
0:05:02oh essentially sample by sample square head
0:05:06impulse response of the room impulse response of the room
0:05:10approximating and this it in this manner
0:05:13for this order
0:05:15a because not perfect it gives you some and and the it is dependent on autocorrelation of the signal
0:05:20a i that we have a plot which actually shows what kind of does it makes
0:05:25the uh the red line is the spectrum of a signal this is the actually the output of a a
0:05:30a mel frequency for the centered at uh five hundred and it heard
0:05:35a a not in the room with we actually have a braided the signal in this case we apply a
0:05:39a i believe a a uh
0:05:41a a three hundred millisecond our T
0:05:43a operation
0:05:45the output of the filter shown by the green line
0:05:48but just
0:05:49what should get in this case
0:05:50this is what you get
0:05:55using list approximate model what we get out here
0:05:58the shown but the blue line
0:06:00and you can see that this approximation which we get from from thing
0:06:04a quite but a vibration and the power
0:06:07doesn't introduce very much better in fact we have a a a a a a that more quantitative result are
0:06:12you know
0:06:13a it turns out that applying this filter
0:06:16to the palm or is different from applying this but the to the magnitude
0:06:21but you'd actually be taking this quite of use
0:06:22square root of these terms so good all points to
0:06:25and when applied D filter to the palm or you introduce an ad or
0:06:30which results in about a a a a results in some some distortion and the output of the
0:06:36a a and the output of the uh that would be to to the
0:06:40reverberation model
0:06:42but is if you apply to the magnitude the kind of is much smaller
0:06:46so we but actually in a model
0:06:48a as you know that the abrasion is the filtering
0:06:52that can cause on the magnitude
0:06:54oh of a bird
0:06:56cough off your mel
0:06:59so the process can be caught up like so you have a
0:07:03and channel a mel filter or its equivalent you have a power or magnitude computation
0:07:08and then you have the spectral money which actually applies on the
0:07:12to impose the effect of reverberation we've expanded it on it extended on it here
0:07:17we have the
0:07:19magnitude or power spectrum going into the room response to get the label
0:07:24magnitude or power and then of course you have the log and the dct
0:07:28so what we have done is we have effectively
0:07:32a convolution on the signal which is the room response
0:07:36to a convolution on magnitude are power spectrum
0:07:40and only observe all these types
0:07:43have a belated sequence of power right
0:07:46and then just this
0:07:48a a problem is to deter mine
0:07:50oh i'll be stops the room response
0:07:52as we have
0:07:54as as the a problem that clean signal at seven
0:07:57oh this is obviously a an i in constrained problems so we have to impose some constraints
0:08:03and we're going to impose some constraints is going to say that uh
0:08:06a because we are dealing with that magnitudes call times are nonnegative
0:08:10in addition
0:08:12i don't merely observe B
0:08:14a a but in signal the actually observe a noise corrupted version of the reverberant signal
0:08:19so what we will do is to try to estimate the signal
0:08:23and the room response
0:08:25such that the error between the output of for model
0:08:29and what to actually that is uh is minimised
0:08:33but some sparsity constraints
0:08:35on the spectrum
0:08:37no because
0:08:38a scaling factor going and there also that impose an additional constraint that these room response times
0:08:44some to one
0:08:46can this it turns out simply a standard nonnegative matrix factorization problem
0:08:52i would actually go to the derivation of work you know
0:08:54but if you do you'd find a that you get a bit rules it's an iterative solution which gives a
0:08:59it that a very similar to
0:09:01a matrix factorization you can start off with an estimate
0:09:05and at each iteration to get a multiplicative update
0:09:08to this chart
0:09:09which in shows that are always days
0:09:13i a propose of this this formulation we have here is not something that for introducing this paper
0:09:18i has been proposed and by a me car and we also propose it separately a paper in uh
0:09:24i believe that last year
0:09:26oh the basic from of isn't but we proposed
0:09:29she as what we do we had the standard short time pretty one then you compute the power
0:09:34and the nmf decomposition which is what we have here
0:09:37i you an estimate of the K
0:09:40you can which you can perform an overlap add and
0:09:43estimate B
0:09:44a a no clean signal
0:09:47a contribution of what he that is that are not going to work directly on this
0:09:53a actually apply had gammatone filter bank
0:09:56so basically
0:09:57that's but the bank here is gonna be a gammatone for the bank
0:10:01and after having applied the gammatone for the bank we compute them
0:10:05decomposition composition and the math
0:10:07and a in a for the bank and then performed the overlap
0:10:11so the got on for the bank can be thought of as a dimensionality it using
0:10:14linear operation
0:10:16on the
0:10:17a or or the magnitude
0:10:19and it is simply going to be the equivalent of multiplying the output of an have
0:10:23the pseudo inverse of this
0:10:25a lot for device matrix
0:10:27so that that as an example of what we get this is a reverberated signal are gonna i don't have
0:10:31audio so
0:10:34i this sort of
0:10:38a a a a a
0:10:44it's a lot of maybe
0:10:47uh by this is what we hard with the
0:10:56so i don't know yeah
0:10:58the the a signal
0:11:04that that that my what right that liberation as very used
0:11:08it believe me
0:11:12a given that we can actually do this
0:11:14that was actually can or when you are but in a perceptions a great thing you can you are all
0:11:19sorts of nice stuff
0:11:20but and then you put this to that signal at a recogniser
0:11:25the improvements don't sure
0:11:27so he are we and some experiments on the resource management database
0:11:31this as a model trained on clean speech and you as what you get
0:11:34and that was signal is the web braided with uh room response of that we hundred millisecond reverberation time
0:11:40and the error it goes down if you actually try to dereverberated using the basic and a
0:11:46proposed a by a income come you can
0:11:49and if this is that i don't the part or
0:11:52it goes down a but but if you applied on the map that you're it goes down a lot more
0:11:56which shows that gives better off to where we a better at working on the magnitude
0:12:00and then
0:12:06G she M S R be don't and of for nmf variance
0:12:09again when we apply the garment to one
0:12:12and and of it so
0:12:13in the H cases
0:12:15a as you want that the room response responses the same in every
0:12:20oh the processing which is really that true
0:12:23but what happens is that because you observing of that a version
0:12:27of the signal
0:12:28it does not make sense to us as that the room response the same in every channel that actually gives
0:12:32you a bad estimate
0:12:34so if you as a estimated different room response at each channel
0:12:37and you get some improvements which a short by these guys
0:12:41and then if you are actually apply the gammatone filtering
0:12:45he is what you get when you were on the power but if you were on the magnitude
0:12:49we can see that if you as you that the room response responses the in all channels you gets
0:12:54is you performance
0:12:55if a lot it to be different for different channel
0:12:58a as a performance again
0:12:59so the gist of it is that
0:13:01that do you have a bidding the signal
0:13:03a after down but don't training and then post
0:13:05inverse filtering it
0:13:07and and forming all the net at
0:13:12a bit of signal results that's and at a rates which are less than half of what you'd get
0:13:16but the there is no and segment
0:13:19we got this not of you other
0:13:21uh test sets this with the by so and was that we had a a three hundred millisecond reverberation time
0:13:26this is good
0:13:26a three hundred and five hundred
0:13:28and we compare it with a bunch of other techniques which i one bar to explain
0:13:32i take time
0:13:34again the do just but was better
0:13:40i actually making a model as i'm channel what here mainly that you can through
0:13:45the power computation and the room response
0:13:48and then performing the dereverberation vibration
0:13:51i would this for the procedure
0:13:54that is not an up box approximation but that would truly what happened
0:14:00so in this experiment we actually sort of
0:14:02fate reverberation
0:14:04applying that a vibration to a sequence of part values and then
0:14:08i to do have a big the signal nine but you can see
0:14:11but that that's a and a interested in
0:14:13these uh kind of
0:14:15spurious use which are there because
0:14:17the plot came out should be it's thesis
0:14:19and you can see that when be uh
0:14:22well actually holds to the improvements you get can be very very large
0:14:28a this case we tried this not all of stuff we so idea was on fake room responses where the
0:14:33room response computed using the image method
0:14:36so we applied some true room responses to obtain from a T are this is a room response but
0:14:41a for seventy millisecond response time is at six hundred millisecond and again
0:14:45improvements the
0:14:47no one of the things that everybody knows that
0:14:49is that the
0:14:51set up i should use so far
0:14:53is good
0:14:55a a speech recognition system is an never train on clean data are you actually train it on matched data
0:15:01that kind of data that you actually expect to recognise
0:15:06i all of this but all when you perform matched condition training
0:15:10sure enough you observe that the implements you get
0:15:13from the of a braiding the signal
0:15:16if you train the signal a the recogniser and clean speech
0:15:20not even then given to get to the kind of performance you get if you simply trained the recognizer on
0:15:26dev a speech
0:15:27this is a performance a get the yellow bars in each case
0:15:31but even yeah
0:15:32we should do you have a braided but the training and test data using our technique you get an additional
0:15:36improvement which is about
0:15:38i twenty to forty percent relative or believe
0:15:40if i can find my at all
0:15:43i could helps in every K
0:15:46the truth of the matter is you don't merely have a operation
0:15:50we also have additive noise
0:15:51so that have a bit signal gets corrupted by additive noise be explained a here
0:15:56so we can you know five this were process that i've just show
0:16:00some additional processing to compensate for the noise
0:16:04i of that's is to for that was presented but it's done yesterday at the leaf
0:16:08we had to be present that something called
0:16:10but that spectral
0:16:12cepstral coefficients D C C
0:16:15and uh
0:16:16so so is a procedure that
0:16:18in a in a in a
0:16:20and summary
0:16:21in the D S C C computation
0:16:23it's of directly web team of the magnitude spectra and compressing magnitude spectra
0:16:28can you but since is between a just in magnitude spectrum of what is that has is that
0:16:33stationary signals get can so that
0:16:36that things that maybe even a little bit
0:16:39i as it turns out that's D speech is
0:16:41they but as stationary as the noise that could upset
0:16:44so as a result simply performing this
0:16:47the friends operation and what only and the prince magnitude spectra
0:16:53robust bus performance
0:16:54so that it's bad it's better and now have what positive and negative values so we actually have to sort
0:16:58of normalise its distribution
0:17:00and then without any compression whatsoever
0:17:03just apply dct
0:17:05and then perform from additional operations as before
0:17:08and this output is what we used for recognition
0:17:11and we showed and now the post is that using that
0:17:15get much more robust recognition then what you get with just read up
0:17:19a mel frequency cepstra
0:17:21so it turns out that the room that they're dereverberation that we does that i just talked about can be
0:17:27but this tells that a spectral cepstral coefficient computation
0:17:31so you actually do have a bit the signal first and then
0:17:34you compute the mel spectra these of the not the log mel spectra computed differences is and then compute features
0:17:41and then and you perform recognition on those features and sure enough you can see that
0:17:47and all you have is that a web it'd signal it makes things just marginally words
0:17:51but the moment to begin adding noise
0:17:54a blue as as a lot of improvements so this uh is all on
0:17:57a a a a a room
0:17:59but millisecond room response
0:18:01and the blue light shows the performance you get a
0:18:05and you and do you that will be the signal and
0:18:07then use the B C C features
0:18:10and each please uh we've got noise of two different levels
0:18:13could up in the signal
0:18:15and you can see that but the best performance by far as what you get
0:18:19a you but do that will be the signal and
0:18:22then perform be C computation
0:18:25so in summary
0:18:26we model that operation
0:18:28one speech spectral
0:18:30we might it as
0:18:32a phenomenon that a fixed this sequence of magnitude spectra
0:18:36used an a F
0:18:37a factor is this to perform of operation
0:18:41i also used the gammatone sub-band non-negative matrix factorization
0:18:44not that you have a perceptual weighting on this and
0:18:47in perceptual weighting
0:18:49and uh the compared its magnitude and power domain the were here
0:18:53and studied the joint
0:18:55normalized and a patient problem by integrating a noise that was feature a lot with
0:19:00do we that operation and got significant improvements
0:19:03thank you
0:19:10the is to last talk and do used to have time to
0:21:16a question
0:21:19and yeah
0:22:26to i have the last question i have to prove
0:22:29uh talk about
0:22:32some sort of has to X and H
0:22:38just just what don't have it is but have a bribe and was once had to it
0:22:44that's perhaps
0:22:47the so the optimized is not have to do that
0:23:26i how to do this and the this one
0:23:29experimental to have
0:23:36the questions
0:23:40well i guess so
0:23:41people a a a a i want to
0:23:46as the i'm to say of this paper is and in particular
0:23:51so two and and sent much