Speech Transcript - SPARSE CODING OF AUDITORY FEATURES FOR MACHINE HEARING IN INTERFERENCE

0:00:13	was gonna continue and the theme um
0:00:15	looking at what we can get from biologically inspired auditory processing
0:00:18	um his in doing a lot of work on
0:00:20	figure out how you can extract
0:00:22	useful features from auditory models
0:00:25	um that can be used to do interesting task making in this case exposed to use of sparsity
0:00:29	thanks to
0:00:33	make josh
0:00:34	i to see so many uh all the new friends here
0:00:37	of friends like now malcolm
0:00:38	a friends like josh action
0:00:40	so many people attending
0:00:42	um
0:00:44	the system i'm discussing today and the representations of audio that it uses were reported in uh considerable detail in
0:00:51	uh i think september
0:00:53	neural computation magazine and then discussed again and uh signal processing magazine column about that time
0:00:59	what's new in today's paper uh uh
0:01:02	the last couple words in the title the says in interference so we we've done some more tests
0:01:07	to see
0:01:09	to what extent these uh
0:01:11	kind auditory model based representations uh
0:01:14	perform better than more conventional mfcc like representations
0:01:19	in mixed sounds
0:01:21	so let me just describe what the representation is since that's
0:01:24	the yeah main topic of this session
0:01:28	uh
0:01:29	we start off with a uh a we have simulation that produces this uh
0:01:33	this thing we call the uh
0:01:35	neural activity pattern
0:01:37	a
0:01:38	purse round screens
0:01:40	um
0:01:41	the neural activity pattern you can think of that as sort of a a a cochlea gram or a a
0:01:45	representation of uh
0:01:48	firing rate or firing probability of primary auditory neurons
0:01:52	as a function of
0:01:53	time and place
0:01:55	with a high frequencies that the
0:01:57	top here representing the base of the cochlea and low frequencies at the bottom
0:02:01	the apex of the cochlea
0:02:04	from those we detect certain events called strobe of event says a way to um
0:02:10	make an of efficient and realistic computation of what uh right patterson has done to the stabilised auditory image which
0:02:16	is a lot like the auditory correlogram that not come i did a lot of work on back in the
0:02:21	eighties and nineties
0:02:23	make these nice pictures is stabilised in the sense that um you know i'm like this one that's kinda racing
0:02:29	by with time this one is is like a
0:02:32	like a trigger display on on the source go for the trigger events "'cause" it to uh to stand still
0:02:37	so as you as you have pitch pulses the
0:02:40	this uh central feature at zero time like just
0:02:43	stays there and as the pitch period
0:02:45	varies these other ones that look like copies of it move back and forth
0:02:49	to have a spacing equal to the pitch period
0:02:52	to get these very nice looking dynamic will be here
0:02:55	the problem that we've uh a up and had with the user the years as figure out how to go
0:02:59	from this
0:03:00	rich
0:03:02	a be like representation to the some kind of features that we can use to do
0:03:06	classification recognition tasks and things like that
0:03:10	um
0:03:11	a we have try started at the google in two thousand and six we were joined by samy bengio would
0:03:17	been doing a a very interesting work in
0:03:19	image classification
0:03:21	using a a um
0:03:23	sort of
0:03:23	high dimensional sparse feature
0:03:26	back of visual words approach and the system trained with those and
0:03:31	and when you described to system to me i said that's that's exactly what we need
0:03:35	to analyse the use uh these movies of sounds
0:03:38	to get into a feature space on which we can train up
0:03:41	classifiers
0:03:43	so
0:03:44	what we've done is uh this next box we called multi-scale segmentation that's kind of motivated by a lot of
0:03:51	the work they do in the visual analysis for they try to detect
0:03:55	features all over the image at different scales
0:03:58	kind of different uh
0:03:59	keypoints points or other strategies based and just looking at regions of the image
0:04:04	and saying which of
0:04:06	several features that region is close to and doing that a multiple scales so we came up with the a
0:04:10	way to do that
0:04:12	we get a bunch of um
0:04:14	really just abstract features or there's um
0:04:17	at their sparse so there mostly zeros and occasionally some of them get ones and
0:04:21	that's sparse coding gives you for each
0:04:23	frame of this movie
0:04:25	this long vector of
0:04:27	that has some someone's in it a lot of zeros
0:04:29	then one we aggregate that over sound files we just basically add those that and so what you get here
0:04:34	in the
0:04:35	the sum of all the sparse vectors as
0:04:37	what called a bag representation that tells you how many times each feature occurred it's just a histogram really really
0:04:44	so it's a account of how many times each of those abstract features occurred in a sound file
0:04:48	it's still relatively sparse and that's the kind of feature vector that we used to represent
0:04:53	a that
0:04:54	going of these stages and a little bit
0:04:56	more details see see what we do in there
0:04:58	the
0:04:59	the peripheral role model though the coke we're
0:05:01	simulation is uh you know if you know anything about my work you know i've spent the last thirty years
0:05:06	or so working on a filter cascade as an approach to
0:05:09	simulating the cochlea because it's way to connect the
0:05:13	a the underlying uh wave propagation i dynamics to an efficient digital signal processing uh filtering architecture
0:05:21	in a way that
0:05:22	that let you do both the uh
0:05:24	good models of the choirs a linear filtering as well as uh easy ways to incorporate the nonlinear effects and
0:05:30	of that that's you get
0:05:31	compression of dynamic range
0:05:33	generation of cubic distortion tones and stuff like that
0:05:37	it's basically just a cascade of these simple filter stages some uh halfway detectors that's end
0:05:42	these represent inner hair cells and send a signal the represents instantaneous neurons firing probability
0:05:49	and feedback network that that takes the output of the code clean and sends it back to control the parameters
0:05:53	of the filter stage
0:05:55	by
0:05:55	by reducing the Q of the filters you can reduce the gain a lot in a cascade like this you
0:06:00	don't have to change the Q or the bandwidth very much to
0:06:02	change the gain a lot so you get
0:06:04	pretty nice compressive
0:06:06	result from that
0:06:09	we stabilise the image using right patterson's technique of stroop temporal integration
0:06:13	sort like looking at a the solar scope as i mentioned where
0:06:16	each line of this image independently uh triggered so that at this uh uh zero time interval you get this
0:06:23	nice stable vertical feature doesn't really mean anything it just kind of a zero point
0:06:27	and the other stuff
0:06:28	moves around relative to it as pitch changes and as formants go up and down and so on
0:06:32	so this is a frame of speech where the the horizontal bands you see are are resonances of the vocal
0:06:37	tract there or uh formants
0:06:39	and then the
0:06:41	pattern that repeats and the time time lag dimension
0:06:44	or the pitch pulses
0:06:45	other other sounds uh uh that are less
0:06:48	periodic than speech have
0:06:50	different looking but still very interesting and and kind of
0:06:53	characteristic can unique kind of patterns
0:06:55	and the problem was to try to
0:06:57	summarise a complex sound file
0:07:00	using some statistics of these patterns in a way that
0:07:02	you could do recognition and retrieval and so on
0:07:05	we we did a
0:07:06	retrieval and recognition task
0:07:10	um um
0:07:10	that we've we've reported it and a couple different context
0:07:13	but but uh
0:07:14	a show you the results in a second
0:07:17	the um
0:07:18	the features that we extract from these stabilised auditory images are pulled out of a bunch of different boxes we
0:07:23	have like you know long skinny boxes and short fat boxes and
0:07:26	small boxes in big boxes and
0:07:28	within each box we
0:07:30	we
0:07:31	um
0:07:31	for the current
0:07:32	uh a set of features for using we just
0:07:35	do a uh
0:07:37	uh
0:07:37	sort sort row and column marginal as to reduce it to a somewhat lower dimensionality and then we vector quantise
0:07:42	that and we and we do that at a fixed
0:07:45	resolution like thirty two
0:07:47	um
0:07:48	thirty two rows and sixteen columns of gives us a forty eight dimensional
0:07:53	feature vector for each one of those boxes and then that forty dimensions goes into a vector quantizer with
0:07:58	um
0:07:59	different codebook for each different box
0:08:01	size and position so we get a whole bunch of code-books to a whole bunch of vector quantisation as
0:08:06	size is can be quite large up to several thousand per could
0:08:10	several hundred thousand of mentions total
0:08:13	um
0:08:13	sparse in the sense that only the one code but that's closest gets a one and all the others get
0:08:18	zeros
0:08:20	so we for each frame of the video we get
0:08:24	uh you could take of this sparse code as being segmented one segment for each codebook and within within each
0:08:29	segment there's a single one
0:08:30	you you could use any kind of a sparse code here
0:08:33	and when you accumulate that over the frames to but to make a a summary for the whole document you
0:08:38	just a up that
0:08:39	that's it before
0:08:42	again that's the uh
0:08:43	the overview of the
0:08:45	the system
0:08:47	um what we do at the end here we we going to this document
0:08:51	we take this document feature vector and we train up a a uh a ranking and retrieval system on that
0:08:57	using uh samy bengio as i'm your system that that stands for uh passive aggressive model for
0:09:03	image retrieval
0:09:04	so we're doing the same thing for sound retrieval
0:09:08	um um
0:09:09	his student david grant is the lead author on that
0:09:12	paper and no seven
0:09:14	uh basically computes a a scoring function between a a uh
0:09:18	a query and uh audio document the query Q here is a uh is a bag of words that
0:09:24	the
0:09:25	terms in a query
0:09:27	like if i was searching for us a fast car
0:09:31	it would
0:09:32	it would look for audio documents that we that have a
0:09:35	but good score between
0:09:38	a bag of words fast and car and the audio the back of abstract audio features that we have from
0:09:43	that histogram
0:09:44	and that's scoring function is computed as just a bilinear linear transformation so there's a a weight matrix that
0:09:50	that simply
0:09:52	maps the the audio sparse audio features in the sparse query terms into this uh
0:09:57	score for that query and that
0:10:01	oh we had to do is train that weight matrix and there's a
0:10:03	a simple uh you know stochastic gradient descent method for and that and that's
0:10:07	it's actually um a nice thing about the pen your method is that
0:10:11	that our wear them is
0:10:13	extremely fast we could do large data so that
0:10:15	so we were able to run many different um
0:10:18	experiments with variations on the features this
0:10:21	kind of scatter plot
0:10:23	of of uh many different
0:10:25	uh variations on the auditory features as well as a bunch of uh mfcc variants
0:10:30	the little axes here
0:10:32	mostly a one codebook size but varying the number of uh
0:10:36	mfcc coefficients and so on um
0:10:39	and the the window length
0:10:41	and a bunch of different things and you can see here that the the mfcc result is not too bad
0:10:46	in terms of precision at top one and retrieval
0:10:49	um um we can we can be that by a fair amount with a very large code-books in the auditory
0:10:54	features but the the difference in terms of what we could do with the best mfcc and what we could
0:10:59	do with the best
0:11:00	uh auditory based system here was was kind of small i was a little disappointed and that
0:11:04	so these are the results we uh reported before
0:11:07	the line on top is kind of the
0:11:09	convex all it shows that
0:11:11	perhaps the most important to
0:11:13	here is just the size of the abstract feature space a use with this method
0:11:17	it it's nice that
0:11:19	this matrix when you get a pure to a hundred thousand and uh
0:11:22	a hundred thousand mentions in your sparse feature vectors a and you've got
0:11:26	say three thousand query terms
0:11:28	was that come out to three hundred million uh
0:11:31	elements that we're training in that matrix
0:11:33	you can train a three hundred million parameter system here without over fitting it due to the nature of the
0:11:38	uh
0:11:39	the regularized training over them
0:11:41	and actually works quite nicely
0:11:44	so what we did for the icassp paper was to
0:11:47	see how the well this work in interference we actually took a um
0:11:51	from a database of sound files with took pairs of files a random an added together
0:11:55	so you might have a
0:11:57	a sound file
0:11:58	whose
0:11:59	tags say it represents a fast car and another one that says it's a barking dog or something like that
0:12:04	just you add the files together you got but sounds in there
0:12:07	a person listening to it can still tell you what it is typically
0:12:11	what the both things are so we just we take the union of the uh tags
0:12:15	we truncated the sound file just to the length of the shorter one "'cause" we notice that in
0:12:20	and almost all cases uh a few seconds of the sound was enough to
0:12:24	to tell you what it was and
0:12:26	an extra thirty seconds of fast car didn't really help you any so
0:12:30	um
0:12:31	just some to truncate everything so we had sort of a nominal zero db signal-to-noise ratio if you like of
0:12:36	if you consider one of the sounds to be the true sound of the other one to be interference
0:12:40	then we did this uh same kind of ranking and retrieval task using but me to uh
0:12:45	so given a query like fast car again used to wanna get that file that has the barking dog in
0:12:50	it because it has the
0:12:51	the fast car tags
0:12:52	you don't want that barking dog interfere too much with the retrieval
0:12:57	so we did that that kind of test and the
0:12:59	the results showed a
0:13:01	a much bigger difference between the best mfcc system and the best uh auditory stabilised auditory image based system
0:13:08	so the the punch line of this experiment is that the the sparse coded
0:13:12	stabilised auditory images show
0:13:14	a bigger advantage over mfcc for sounds in interfere interference
0:13:18	then for clean sounds
0:13:20	this is what we had hoped based on the idea that these boxes and the stabilised auditory image focusing on
0:13:26	different regions well
0:13:28	well sometimes pick up regions
0:13:30	where the sounds of separate out so the
0:13:32	a certain combinations of sort of frequency bands and time like patterns that will be robust
0:13:38	that will represent that carton will still lead to the same uh code words the same sparse feature
0:13:44	and you know same way with the dog so that
0:13:46	features for both sounds will be present
0:13:48	in the sparse vector even though other features may be wiped out by fair
0:13:54	so it's this locality and this higher dimensional space that we believe is the
0:13:58	was the motivation and we believe is the explanation for this so we don't have a
0:14:02	a great way to test or prove that yeah
0:14:05	so
0:14:06	in conclusion the uh
0:14:09	auditory representations do work pretty well
0:14:12	mfccs work
0:14:13	pretty well to a few run through an appropriately powerful training system
0:14:17	and the sparse coding works well and you when you put these together sparse codes that are somewhat localised in
0:14:22	the stabilised auditory image space
0:14:24	take advantage of how the auditory system separates
0:14:27	certain features of sound at least at a fairly low level
0:14:31	thank you
0:14:40	any questions
0:14:45	i i think i
0:14:45	so for a sparse this to well
0:14:48	oh and you kind of role i condition the signal
0:14:50	in a to have some after then somehow be some property in the the original or
0:14:55	that
0:14:57	um
0:14:59	that the statistics
0:15:00	all those makes
0:15:02	information formation is ice information
0:15:05	a such that
0:15:07	can facilitate the spot cult to get we don't know i used
0:15:11	you know
0:15:12	i for information
0:15:14	it was like a a lot more
0:15:15	not which even we think here
0:15:18	i mean so each case will come back to the way for
0:15:20	i all went that i question how i what to this this
0:15:24	these two different
0:15:25	i
0:15:26	yeah so the
0:15:28	yeah of the informations all there in the waveform but it's not
0:15:32	easily accessible directly to get a a a a a a sparse code that
0:15:37	corresponds to features of
0:15:39	how you hear the sound similarly and um and a short-time spectral representation like the mfccs the a lot of
0:15:46	the information is there but some of it is lost by that sorta
0:15:50	noncoherent spectral detection
0:15:52	stuff so that you you no longer have a easy way to take advantage of uh
0:15:57	separation of things by pitch and the different
0:16:00	regions of the auditory image
0:16:02	uh pictures or other characteristic time patterns in these non periodic sounds
0:16:07	so i
0:16:08	um
0:16:10	the idea was that this you know this uh
0:16:12	duplex representation that like later proposed before
0:16:15	for pitch captures a lot of different aspects of
0:16:18	uh psychological pitch perception and we thought that would be a good at
0:16:22	a better starting point then either short-time spectral or wave form as a way to uh
0:16:28	a lot abstract features that correlate with what you hear
0:16:31	and what not trying to suppress the interfering sounds were just trying to get a bag of features in which
0:16:35	both sounds have
0:16:37	some features come through
0:16:41	click a question most what's um map types
0:16:43	correct
0:16:44	yeah down so we can use talk
0:16:46	yeah so i'm i was just taking off phone your calm night at the start so we see now in
0:16:53	text
0:16:54	scanlon i
0:16:55	and you've got your stabilise a very you maybe is um
0:16:58	and you you want them are now there is a would will lead his he's
0:17:03	he's model B
0:17:05	and
0:17:06	something thing that you would consider trying to see how well it performs
0:17:12	a uh yes and no
0:17:13	i
0:17:16	some my representations presentations are sort of uh
0:17:19	um
0:17:19	mid brain to to a very or can y'all is probably
0:17:24	the cortical stuff that me my has a um
0:17:29	it's amazing how well it works like in his human experiments we can resynthesize the spectrograms of the sound that
0:17:34	the human is paying attention to
0:17:36	that's sort suggests that it's a representation that comes after
0:17:40	some sound separation process
0:17:43	and
0:17:43	i think that is a layer we need to put in
0:17:46	somehow explicitly before for the uh cortical
0:17:50	spectro-temporal receptive fields
0:17:53	really make a lot of sense if you if you get there directly without doing the separation first i think
0:17:58	you'll have the same problem as the other short time spectral techniques and that interfering sounds won't be
0:18:04	um um the representation won't capture the differences between
0:18:08	the you know the features of the interface interfering sound
0:18:12	have to do some separation before you
0:18:16	um before you give up the fine time structure and go to a purely spectral
0:18:20	or short time spectral spectro-temporal approach
0:18:23	uh a something else it has to be done in there so
0:18:26	um
0:18:27	i just talking gonna morgan about that earlier and we've talked me man i've talked before to and i
0:18:32	we do want to figure out a way to put all these ideas together
0:18:34	just not clear exactly
0:18:37	how to
0:18:37	how to bridge that right now
0:18:40	but do you turn use the area of the spectral versus temporal or is going on for a long time
0:18:45	and
0:18:45	and and settled yeah
0:18:47	X

SPARSE CODING OF AUDITORY FEATURES FOR MACHINE HEARING IN INTERFERENCE

Innovative Representations of Audio

Presented by: Richard Lyon, Author(s): Richard Lyon, Jay Ponte, Gal Chechik, Google Inc., United States