Speech Transcript - SPARSE CODING OF AUDITORY FEATURES FOR MACHINE HEARING IN INTERFERENCE

was gonna continue and the theme um looking at what we can get from biologically inspired auditory processing um his in doing a lot of work on figure out how you can extract useful features from auditory models um that can be used to do interesting task making in this case exposed to use of sparsity thanks to make josh i to see so many uh all the new friends here of friends like now malcolm a friends like josh action so many people attending um the system i'm discussing today and the representations of audio that it uses were reported in uh considerable detail in uh i think september neural computation magazine and then discussed again and uh signal processing magazine column about that time what's new in today's paper uh uh the last couple words in the title the says in interference so we we've done some more tests to see to what extent these uh kind auditory model based representations uh perform better than more conventional mfcc like representations in mixed sounds so let me just describe what the representation is since that's the yeah main topic of this session uh we start off with a uh a we have simulation that produces this uh this thing we call the uh neural activity pattern a purse round screens um the neural activity pattern you can think of that as sort of a a a cochlea gram or a a representation of uh firing rate or firing probability of primary auditory neurons as a function of time and place with a high frequencies that the top here representing the base of the cochlea and low frequencies at the bottom the apex of the cochlea from those we detect certain events called strobe of event says a way to um make an of efficient and realistic computation of what uh right patterson has done to the stabilised auditory image which is a lot like the auditory correlogram that not come i did a lot of work on back in the eighties and nineties make these nice pictures is stabilised in the sense that um you know i'm like this one that's kinda racing by with time this one is is like a like a trigger display on on the source go for the trigger events "'cause" it to uh to stand still so as you as you have pitch pulses the this uh central feature at zero time like just stays there and as the pitch period varies these other ones that look like copies of it move back and forth to have a spacing equal to the pitch period to get these very nice looking dynamic will be here the problem that we've uh a up and had with the user the years as figure out how to go from this rich a be like representation to the some kind of features that we can use to do classification recognition tasks and things like that um a we have try started at the google in two thousand and six we were joined by samy bengio would been doing a a very interesting work in image classification using a a um sort of high dimensional sparse feature back of visual words approach and the system trained with those and and when you described to system to me i said that's that's exactly what we need to analyse the use uh these movies of sounds to get into a feature space on which we can train up classifiers so what we've done is uh this next box we called multi-scale segmentation that's kind of motivated by a lot of the work they do in the visual analysis for they try to detect features all over the image at different scales kind of different uh keypoints points or other strategies based and just looking at regions of the image and saying which of several features that region is close to and doing that a multiple scales so we came up with the a way to do that we get a bunch of um really just abstract features or there's um at their sparse so there mostly zeros and occasionally some of them get ones and that's sparse coding gives you for each frame of this movie this long vector of that has some someone's in it a lot of zeros then one we aggregate that over sound files we just basically add those that and so what you get here in the the sum of all the sparse vectors as what called a bag representation that tells you how many times each feature occurred it's just a histogram really really so it's a account of how many times each of those abstract features occurred in a sound file it's still relatively sparse and that's the kind of feature vector that we used to represent a that going of these stages and a little bit more details see see what we do in there the the peripheral role model though the coke we're simulation is uh you know if you know anything about my work you know i've spent the last thirty years or so working on a filter cascade as an approach to simulating the cochlea because it's way to connect the a the underlying uh wave propagation i dynamics to an efficient digital signal processing uh filtering architecture in a way that that let you do both the uh good models of the choirs a linear filtering as well as uh easy ways to incorporate the nonlinear effects and of that that's you get compression of dynamic range generation of cubic distortion tones and stuff like that it's basically just a cascade of these simple filter stages some uh halfway detectors that's end these represent inner hair cells and send a signal the represents instantaneous neurons firing probability and feedback network that that takes the output of the code clean and sends it back to control the parameters of the filter stage by by reducing the Q of the filters you can reduce the gain a lot in a cascade like this you don't have to change the Q or the bandwidth very much to change the gain a lot so you get pretty nice compressive result from that we stabilise the image using right patterson's technique of stroop temporal integration sort like looking at a the solar scope as i mentioned where each line of this image independently uh triggered so that at this uh uh zero time interval you get this nice stable vertical feature doesn't really mean anything it just kind of a zero point and the other stuff moves around relative to it as pitch changes and as formants go up and down and so on so this is a frame of speech where the the horizontal bands you see are are resonances of the vocal tract there or uh formants and then the pattern that repeats and the time time lag dimension or the pitch pulses other other sounds uh uh that are less periodic than speech have different looking but still very interesting and and kind of characteristic can unique kind of patterns and the problem was to try to summarise a complex sound file using some statistics of these patterns in a way that you could do recognition and retrieval and so on we we did a retrieval and recognition task um um that we've we've reported it and a couple different context but but uh a show you the results in a second the um the features that we extract from these stabilised auditory images are pulled out of a bunch of different boxes we have like you know long skinny boxes and short fat boxes and small boxes in big boxes and within each box we we um for the current uh a set of features for using we just do a uh uh sort sort row and column marginal as to reduce it to a somewhat lower dimensionality and then we vector quantise that and we and we do that at a fixed resolution like thirty two um thirty two rows and sixteen columns of gives us a forty eight dimensional feature vector for each one of those boxes and then that forty dimensions goes into a vector quantizer with um different codebook for each different box size and position so we get a whole bunch of code-books to a whole bunch of vector quantisation as size is can be quite large up to several thousand per could several hundred thousand of mentions total um sparse in the sense that only the one code but that's closest gets a one and all the others get zeros so we for each frame of the video we get uh you could take of this sparse code as being segmented one segment for each codebook and within within each segment there's a single one you you could use any kind of a sparse code here and when you accumulate that over the frames to but to make a a summary for the whole document you just a up that that's it before again that's the uh the overview of the the system um what we do at the end here we we going to this document we take this document feature vector and we train up a a uh a ranking and retrieval system on that using uh samy bengio as i'm your system that that stands for uh passive aggressive model for image retrieval so we're doing the same thing for sound retrieval um um his student david grant is the lead author on that paper and no seven uh basically computes a a scoring function between a a uh a query and uh audio document the query Q here is a uh is a bag of words that the terms in a query like if i was searching for us a fast car it would it would look for audio documents that we that have a but good score between a bag of words fast and car and the audio the back of abstract audio features that we have from that histogram and that's scoring function is computed as just a bilinear linear transformation so there's a a weight matrix that that simply maps the the audio sparse audio features in the sparse query terms into this uh score for that query and that oh we had to do is train that weight matrix and there's a a simple uh you know stochastic gradient descent method for and that and that's it's actually um a nice thing about the pen your method is that that our wear them is extremely fast we could do large data so that so we were able to run many different um experiments with variations on the features this kind of scatter plot of of uh many different uh variations on the auditory features as well as a bunch of uh mfcc variants the little axes here mostly a one codebook size but varying the number of uh mfcc coefficients and so on um and the the window length and a bunch of different things and you can see here that the the mfcc result is not too bad in terms of precision at top one and retrieval um um we can we can be that by a fair amount with a very large code-books in the auditory features but the the difference in terms of what we could do with the best mfcc and what we could do with the best uh auditory based system here was was kind of small i was a little disappointed and that so these are the results we uh reported before the line on top is kind of the convex all it shows that perhaps the most important to here is just the size of the abstract feature space a use with this method it it's nice that this matrix when you get a pure to a hundred thousand and uh a hundred thousand mentions in your sparse feature vectors a and you've got say three thousand query terms was that come out to three hundred million uh elements that we're training in that matrix you can train a three hundred million parameter system here without over fitting it due to the nature of the uh the regularized training over them and actually works quite nicely so what we did for the icassp paper was to see how the well this work in interference we actually took a um from a database of sound files with took pairs of files a random an added together so you might have a a sound file whose tags say it represents a fast car and another one that says it's a barking dog or something like that just you add the files together you got but sounds in there a person listening to it can still tell you what it is typically what the both things are so we just we take the union of the uh tags we truncated the sound file just to the length of the shorter one "'cause" we notice that in and almost all cases uh a few seconds of the sound was enough to to tell you what it was and an extra thirty seconds of fast car didn't really help you any so um just some to truncate everything so we had sort of a nominal zero db signal-to-noise ratio if you like of if you consider one of the sounds to be the true sound of the other one to be interference then we did this uh same kind of ranking and retrieval task using but me to uh so given a query like fast car again used to wanna get that file that has the barking dog in it because it has the the fast car tags you don't want that barking dog interfere too much with the retrieval so we did that that kind of test and the the results showed a a much bigger difference between the best mfcc system and the best uh auditory stabilised auditory image based system so the the punch line of this experiment is that the the sparse coded stabilised auditory images show a bigger advantage over mfcc for sounds in interfere interference then for clean sounds this is what we had hoped based on the idea that these boxes and the stabilised auditory image focusing on different regions well well sometimes pick up regions where the sounds of separate out so the a certain combinations of sort of frequency bands and time like patterns that will be robust that will represent that carton will still lead to the same uh code words the same sparse feature and you know same way with the dog so that features for both sounds will be present in the sparse vector even though other features may be wiped out by fair so it's this locality and this higher dimensional space that we believe is the was the motivation and we believe is the explanation for this so we don't have a a great way to test or prove that yeah so in conclusion the uh auditory representations do work pretty well mfccs work pretty well to a few run through an appropriately powerful training system and the sparse coding works well and you when you put these together sparse codes that are somewhat localised in the stabilised auditory image space take advantage of how the auditory system separates certain features of sound at least at a fairly low level thank you any questions i i think i so for a sparse this to well oh and you kind of role i condition the signal in a to have some after then somehow be some property in the the original or that um that the statistics all those makes information formation is ice information a such that can facilitate the spot cult to get we don't know i used you know i for information it was like a a lot more not which even we think here i mean so each case will come back to the way for i all went that i question how i what to this this these two different i yeah so the yeah of the informations all there in the waveform but it's not easily accessible directly to get a a a a a a sparse code that corresponds to features of how you hear the sound similarly and um and a short-time spectral representation like the mfccs the a lot of the information is there but some of it is lost by that sorta noncoherent spectral detection stuff so that you you no longer have a easy way to take advantage of uh separation of things by pitch and the different regions of the auditory image uh pictures or other characteristic time patterns in these non periodic sounds so i um the idea was that this you know this uh duplex representation that like later proposed before for pitch captures a lot of different aspects of uh psychological pitch perception and we thought that would be a good at a better starting point then either short-time spectral or wave form as a way to uh a lot abstract features that correlate with what you hear and what not trying to suppress the interfering sounds were just trying to get a bag of features in which both sounds have some features come through click a question most what's um map types correct yeah down so we can use talk yeah so i'm i was just taking off phone your calm night at the start so we see now in text scanlon i and you've got your stabilise a very you maybe is um and you you want them are now there is a would will lead his he's he's model B and something thing that you would consider trying to see how well it performs a uh yes and no i some my representations presentations are sort of uh um mid brain to to a very or can y'all is probably the cortical stuff that me my has a um it's amazing how well it works like in his human experiments we can resynthesize the spectrograms of the sound that the human is paying attention to that's sort suggests that it's a representation that comes after some sound separation process and i think that is a layer we need to put in somehow explicitly before for the uh cortical spectro-temporal receptive fields really make a lot of sense if you if you get there directly without doing the separation first i think you'll have the same problem as the other short time spectral techniques and that interfering sounds won't be um um the representation won't capture the differences between the you know the features of the interface interfering sound have to do some separation before you um before you give up the fine time structure and go to a purely spectral or short time spectral spectro-temporal approach uh a something else it has to be done in there so um i just talking gonna morgan about that earlier and we've talked me man i've talked before to and i we do want to figure out a way to put all these ideas together just not clear exactly how to how to bridge that right now but do you turn use the area of the spectral versus temporal or is going on for a long time and and and settled yeah X

SPARSE CODING OF AUDITORY FEATURES FOR MACHINE HEARING IN INTERFERENCE

Innovative Representations of Audio

Presented by: Richard Lyon, Author(s): Richard Lyon, Jay Ponte, Gal Chechik, Google Inc., United States