0:00:13was gonna continue and the theme um
0:00:15looking at what we can get from biologically inspired auditory processing
0:00:18um his in doing a lot of work on
0:00:20figure out how you can extract
0:00:22useful features from auditory models
0:00:25um that can be used to do interesting task making in this case exposed to use of sparsity
0:00:29thanks to
0:00:33make josh
0:00:34i to see so many uh all the new friends here
0:00:37of friends like now malcolm
0:00:38a friends like josh action
0:00:40so many people attending
0:00:42um
0:00:44the system i'm discussing today and the representations of audio that it uses were reported in uh considerable detail in
0:00:51uh i think september
0:00:53neural computation magazine and then discussed again and uh signal processing magazine column about that time
0:00:59what's new in today's paper uh uh
0:01:02the last couple words in the title the says in interference so we we've done some more tests
0:01:07to see
0:01:09to what extent these uh
0:01:11kind auditory model based representations uh
0:01:14perform better than more conventional mfcc like representations
0:01:19in mixed sounds
0:01:21so let me just describe what the representation is since that's
0:01:24the yeah main topic of this session
0:01:28uh
0:01:29we start off with a uh a we have simulation that produces this uh
0:01:33this thing we call the uh
0:01:35neural activity pattern
0:01:37a
0:01:38purse round screens
0:01:40um
0:01:41the neural activity pattern you can think of that as sort of a a a cochlea gram or a a
0:01:45representation of uh
0:01:48firing rate or firing probability of primary auditory neurons
0:01:52as a function of
0:01:53time and place
0:01:55with a high frequencies that the
0:01:57top here representing the base of the cochlea and low frequencies at the bottom
0:02:01the apex of the cochlea
0:02:04from those we detect certain events called strobe of event says a way to um
0:02:10make an of efficient and realistic computation of what uh right patterson has done to the stabilised auditory image which
0:02:16is a lot like the auditory correlogram that not come i did a lot of work on back in the
0:02:21eighties and nineties
0:02:23make these nice pictures is stabilised in the sense that um you know i'm like this one that's kinda racing
0:02:29by with time this one is is like a
0:02:32like a trigger display on on the source go for the trigger events "'cause" it to uh to stand still
0:02:37so as you as you have pitch pulses the
0:02:40this uh central feature at zero time like just
0:02:43stays there and as the pitch period
0:02:45varies these other ones that look like copies of it move back and forth
0:02:49to have a spacing equal to the pitch period
0:02:52to get these very nice looking dynamic will be here
0:02:55the problem that we've uh a up and had with the user the years as figure out how to go
0:02:59from this
0:03:00rich
0:03:02a be like representation to the some kind of features that we can use to do
0:03:06classification recognition tasks and things like that
0:03:10um
0:03:11a we have try started at the google in two thousand and six we were joined by samy bengio would
0:03:17been doing a a very interesting work in
0:03:19image classification
0:03:21using a a um
0:03:23sort of
0:03:23high dimensional sparse feature
0:03:26back of visual words approach and the system trained with those and
0:03:31and when you described to system to me i said that's that's exactly what we need
0:03:35to analyse the use uh these movies of sounds
0:03:38to get into a feature space on which we can train up
0:03:41classifiers
0:03:43so
0:03:44what we've done is uh this next box we called multi-scale segmentation that's kind of motivated by a lot of
0:03:51the work they do in the visual analysis for they try to detect
0:03:55features all over the image at different scales
0:03:58kind of different uh
0:03:59keypoints points or other strategies based and just looking at regions of the image
0:04:04and saying which of
0:04:06several features that region is close to and doing that a multiple scales so we came up with the a
0:04:10way to do that
0:04:12we get a bunch of um
0:04:14really just abstract features or there's um
0:04:17at their sparse so there mostly zeros and occasionally some of them get ones and
0:04:21that's sparse coding gives you for each
0:04:23frame of this movie
0:04:25this long vector of
0:04:27that has some someone's in it a lot of zeros
0:04:29then one we aggregate that over sound files we just basically add those that and so what you get here
0:04:34in the
0:04:35the sum of all the sparse vectors as
0:04:37what called a bag representation that tells you how many times each feature occurred it's just a histogram really really
0:04:44so it's a account of how many times each of those abstract features occurred in a sound file
0:04:48it's still relatively sparse and that's the kind of feature vector that we used to represent
0:04:53a that
0:04:54going of these stages and a little bit
0:04:56more details see see what we do in there
0:04:58the
0:04:59the peripheral role model though the coke we're
0:05:01simulation is uh you know if you know anything about my work you know i've spent the last thirty years
0:05:06or so working on a filter cascade as an approach to
0:05:09simulating the cochlea because it's way to connect the
0:05:13a the underlying uh wave propagation i dynamics to an efficient digital signal processing uh filtering architecture
0:05:21in a way that
0:05:22that let you do both the uh
0:05:24good models of the choirs a linear filtering as well as uh easy ways to incorporate the nonlinear effects and
0:05:30of that that's you get
0:05:31compression of dynamic range
0:05:33generation of cubic distortion tones and stuff like that
0:05:37it's basically just a cascade of these simple filter stages some uh halfway detectors that's end
0:05:42these represent inner hair cells and send a signal the represents instantaneous neurons firing probability
0:05:49and feedback network that that takes the output of the code clean and sends it back to control the parameters
0:05:53of the filter stage
0:05:55by
0:05:55by reducing the Q of the filters you can reduce the gain a lot in a cascade like this you
0:06:00don't have to change the Q or the bandwidth very much to
0:06:02change the gain a lot so you get
0:06:04pretty nice compressive
0:06:06result from that
0:06:09we stabilise the image using right patterson's technique of stroop temporal integration
0:06:13sort like looking at a the solar scope as i mentioned where
0:06:16each line of this image independently uh triggered so that at this uh uh zero time interval you get this
0:06:23nice stable vertical feature doesn't really mean anything it just kind of a zero point
0:06:27and the other stuff
0:06:28moves around relative to it as pitch changes and as formants go up and down and so on
0:06:32so this is a frame of speech where the the horizontal bands you see are are resonances of the vocal
0:06:37tract there or uh formants
0:06:39and then the
0:06:41pattern that repeats and the time time lag dimension
0:06:44or the pitch pulses
0:06:45other other sounds uh uh that are less
0:06:48periodic than speech have
0:06:50different looking but still very interesting and and kind of
0:06:53characteristic can unique kind of patterns
0:06:55and the problem was to try to
0:06:57summarise a complex sound file
0:07:00using some statistics of these patterns in a way that
0:07:02you could do recognition and retrieval and so on
0:07:05we we did a
0:07:06retrieval and recognition task
0:07:10um um
0:07:10that we've we've reported it and a couple different context
0:07:13but but uh
0:07:14a show you the results in a second
0:07:17the um
0:07:18the features that we extract from these stabilised auditory images are pulled out of a bunch of different boxes we
0:07:23have like you know long skinny boxes and short fat boxes and
0:07:26small boxes in big boxes and
0:07:28within each box we
0:07:30we
0:07:31um
0:07:31for the current
0:07:32uh a set of features for using we just
0:07:35do a uh
0:07:37uh
0:07:37sort sort row and column marginal as to reduce it to a somewhat lower dimensionality and then we vector quantise
0:07:42that and we and we do that at a fixed
0:07:45resolution like thirty two
0:07:47um
0:07:48thirty two rows and sixteen columns of gives us a forty eight dimensional
0:07:53feature vector for each one of those boxes and then that forty dimensions goes into a vector quantizer with
0:07:58um
0:07:59different codebook for each different box
0:08:01size and position so we get a whole bunch of code-books to a whole bunch of vector quantisation as
0:08:06size is can be quite large up to several thousand per could
0:08:10several hundred thousand of mentions total
0:08:13um
0:08:13sparse in the sense that only the one code but that's closest gets a one and all the others get
0:08:18zeros
0:08:20so we for each frame of the video we get
0:08:24uh you could take of this sparse code as being segmented one segment for each codebook and within within each
0:08:29segment there's a single one
0:08:30you you could use any kind of a sparse code here
0:08:33and when you accumulate that over the frames to but to make a a summary for the whole document you
0:08:38just a up that
0:08:39that's it before
0:08:42again that's the uh
0:08:43the overview of the
0:08:45the system
0:08:47um what we do at the end here we we going to this document
0:08:51we take this document feature vector and we train up a a uh a ranking and retrieval system on that
0:08:57using uh samy bengio as i'm your system that that stands for uh passive aggressive model for
0:09:03image retrieval
0:09:04so we're doing the same thing for sound retrieval
0:09:08um um
0:09:09his student david grant is the lead author on that
0:09:12paper and no seven
0:09:14uh basically computes a a scoring function between a a uh
0:09:18a query and uh audio document the query Q here is a uh is a bag of words that
0:09:24the
0:09:25terms in a query
0:09:27like if i was searching for us a fast car
0:09:31it would
0:09:32it would look for audio documents that we that have a
0:09:35but good score between
0:09:38a bag of words fast and car and the audio the back of abstract audio features that we have from
0:09:43that histogram
0:09:44and that's scoring function is computed as just a bilinear linear transformation so there's a a weight matrix that
0:09:50that simply
0:09:52maps the the audio sparse audio features in the sparse query terms into this uh
0:09:57score for that query and that
0:10:01oh we had to do is train that weight matrix and there's a
0:10:03a simple uh you know stochastic gradient descent method for and that and that's
0:10:07it's actually um a nice thing about the pen your method is that
0:10:11that our wear them is
0:10:13extremely fast we could do large data so that
0:10:15so we were able to run many different um
0:10:18experiments with variations on the features this
0:10:21kind of scatter plot
0:10:23of of uh many different
0:10:25uh variations on the auditory features as well as a bunch of uh mfcc variants
0:10:30the little axes here
0:10:32mostly a one codebook size but varying the number of uh
0:10:36mfcc coefficients and so on um
0:10:39and the the window length
0:10:41and a bunch of different things and you can see here that the the mfcc result is not too bad
0:10:46in terms of precision at top one and retrieval
0:10:49um um we can we can be that by a fair amount with a very large code-books in the auditory
0:10:54features but the the difference in terms of what we could do with the best mfcc and what we could
0:10:59do with the best
0:11:00uh auditory based system here was was kind of small i was a little disappointed and that
0:11:04so these are the results we uh reported before
0:11:07the line on top is kind of the
0:11:09convex all it shows that
0:11:11perhaps the most important to
0:11:13here is just the size of the abstract feature space a use with this method
0:11:17it it's nice that
0:11:19this matrix when you get a pure to a hundred thousand and uh
0:11:22a hundred thousand mentions in your sparse feature vectors a and you've got
0:11:26say three thousand query terms
0:11:28was that come out to three hundred million uh
0:11:31elements that we're training in that matrix
0:11:33you can train a three hundred million parameter system here without over fitting it due to the nature of the
0:11:38uh
0:11:39the regularized training over them
0:11:41and actually works quite nicely
0:11:44so what we did for the icassp paper was to
0:11:47see how the well this work in interference we actually took a um
0:11:51from a database of sound files with took pairs of files a random an added together
0:11:55so you might have a
0:11:57a sound file
0:11:58whose
0:11:59tags say it represents a fast car and another one that says it's a barking dog or something like that
0:12:04just you add the files together you got but sounds in there
0:12:07a person listening to it can still tell you what it is typically
0:12:11what the both things are so we just we take the union of the uh tags
0:12:15we truncated the sound file just to the length of the shorter one "'cause" we notice that in
0:12:20and almost all cases uh a few seconds of the sound was enough to
0:12:24to tell you what it was and
0:12:26an extra thirty seconds of fast car didn't really help you any so
0:12:30um
0:12:31just some to truncate everything so we had sort of a nominal zero db signal-to-noise ratio if you like of
0:12:36if you consider one of the sounds to be the true sound of the other one to be interference
0:12:40then we did this uh same kind of ranking and retrieval task using but me to uh
0:12:45so given a query like fast car again used to wanna get that file that has the barking dog in
0:12:50it because it has the
0:12:51the fast car tags
0:12:52you don't want that barking dog interfere too much with the retrieval
0:12:57so we did that that kind of test and the
0:12:59the results showed a
0:13:01a much bigger difference between the best mfcc system and the best uh auditory stabilised auditory image based system
0:13:08so the the punch line of this experiment is that the the sparse coded
0:13:12stabilised auditory images show
0:13:14a bigger advantage over mfcc for sounds in interfere interference
0:13:18then for clean sounds
0:13:20this is what we had hoped based on the idea that these boxes and the stabilised auditory image focusing on
0:13:26different regions well
0:13:28well sometimes pick up regions
0:13:30where the sounds of separate out so the
0:13:32a certain combinations of sort of frequency bands and time like patterns that will be robust
0:13:38that will represent that carton will still lead to the same uh code words the same sparse feature
0:13:44and you know same way with the dog so that
0:13:46features for both sounds will be present
0:13:48in the sparse vector even though other features may be wiped out by fair
0:13:54so it's this locality and this higher dimensional space that we believe is the
0:13:58was the motivation and we believe is the explanation for this so we don't have a
0:14:02a great way to test or prove that yeah
0:14:05so
0:14:06in conclusion the uh
0:14:09auditory representations do work pretty well
0:14:12mfccs work
0:14:13pretty well to a few run through an appropriately powerful training system
0:14:17and the sparse coding works well and you when you put these together sparse codes that are somewhat localised in
0:14:22the stabilised auditory image space
0:14:24take advantage of how the auditory system separates
0:14:27certain features of sound at least at a fairly low level
0:14:31thank you
0:14:40any questions
0:14:45i i think i
0:14:45so for a sparse this to well
0:14:48oh and you kind of role i condition the signal
0:14:50in a to have some after then somehow be some property in the the original or
0:14:55that
0:14:57um
0:14:59that the statistics
0:15:00all those makes
0:15:02information formation is ice information
0:15:05a such that
0:15:07can facilitate the spot cult to get we don't know i used
0:15:11you know
0:15:12i for information
0:15:14it was like a a lot more
0:15:15not which even we think here
0:15:18i mean so each case will come back to the way for
0:15:20i all went that i question how i what to this this
0:15:24these two different
0:15:25i
0:15:26yeah so the
0:15:28yeah of the informations all there in the waveform but it's not
0:15:32easily accessible directly to get a a a a a a sparse code that
0:15:37corresponds to features of
0:15:39how you hear the sound similarly and um and a short-time spectral representation like the mfccs the a lot of
0:15:46the information is there but some of it is lost by that sorta
0:15:50noncoherent spectral detection
0:15:52stuff so that you you no longer have a easy way to take advantage of uh
0:15:57separation of things by pitch and the different
0:16:00regions of the auditory image
0:16:02uh pictures or other characteristic time patterns in these non periodic sounds
0:16:07so i
0:16:08um
0:16:10the idea was that this you know this uh
0:16:12duplex representation that like later proposed before
0:16:15for pitch captures a lot of different aspects of
0:16:18uh psychological pitch perception and we thought that would be a good at
0:16:22a better starting point then either short-time spectral or wave form as a way to uh
0:16:28a lot abstract features that correlate with what you hear
0:16:31and what not trying to suppress the interfering sounds were just trying to get a bag of features in which
0:16:35both sounds have
0:16:37some features come through
0:16:41click a question most what's um map types
0:16:43correct
0:16:44yeah down so we can use talk
0:16:46yeah so i'm i was just taking off phone your calm night at the start so we see now in
0:16:53text
0:16:54scanlon i
0:16:55and you've got your stabilise a very you maybe is um
0:16:58and you you want them are now there is a would will lead his he's
0:17:03he's model B
0:17:05and
0:17:06something thing that you would consider trying to see how well it performs
0:17:12a uh yes and no
0:17:13i
0:17:16some my representations presentations are sort of uh
0:17:19um
0:17:19mid brain to to a very or can y'all is probably
0:17:24the cortical stuff that me my has a um
0:17:29it's amazing how well it works like in his human experiments we can resynthesize the spectrograms of the sound that
0:17:34the human is paying attention to
0:17:36that's sort suggests that it's a representation that comes after
0:17:40some sound separation process
0:17:43and
0:17:43i think that is a layer we need to put in
0:17:46somehow explicitly before for the uh cortical
0:17:50spectro-temporal receptive fields
0:17:53really make a lot of sense if you if you get there directly without doing the separation first i think
0:17:58you'll have the same problem as the other short time spectral techniques and that interfering sounds won't be
0:18:04um um the representation won't capture the differences between
0:18:08the you know the features of the interface interfering sound
0:18:12have to do some separation before you
0:18:16um before you give up the fine time structure and go to a purely spectral
0:18:20or short time spectral spectro-temporal approach
0:18:23uh a something else it has to be done in there so
0:18:26um
0:18:27i just talking gonna morgan about that earlier and we've talked me man i've talked before to and i
0:18:32we do want to figure out a way to put all these ideas together
0:18:34just not clear exactly
0:18:37how to
0:18:37how to bridge that right now
0:18:40but do you turn use the area of the spectral versus temporal or is going on for a long time
0:18:45and
0:18:45and and settled yeah
0:18:47X