Speech Transcript - CLASSIFYING SOUNDTRACKS WITH AUDIO TEXTURE FEATURES

0:00:13	so now we move away from um the perceptual um side of things to it to the statistical properties the
0:00:18	sound
0:00:19	you know the sound of wind um person the sound of building block
0:00:22	um is a not formants you know it's that it's to ms to new deterministic signal
0:00:26	which a lot of as to with and terms speech
0:00:28	so how do you recognise the statistical behavior of a sound
0:00:31	um dan ellis a columbia and just term and what you
0:00:34	will describe their um sound texture representation
0:00:37	in talk about use the sound classification
0:00:40	well
0:00:40	alright right malcolm and that thanks everybody for showing up to our uh of session
0:00:45	um select like comes that i'm gonna talk about texture
0:00:47	today
0:00:48	um and and textures are sounds a result from large numbers of acoustic events main a include things you hear
0:00:53	all the time at sound of rain
0:00:59	win
0:01:02	birds
0:01:05	running water
0:01:09	in
0:01:13	crowd noise
0:01:16	applause
0:01:18	a
0:01:19	and fire
0:01:23	so these kinds of sounds are common in the world and it seems like to are important for a lot
0:01:26	of task that humans have perform in that we might wanna get machines to perform
0:01:30	like figure out where you are or what the weather's like for instance
0:01:34	um but in contrast to the vast literature on visual texture um both in human a machine vision um sound
0:01:39	textures are largely on study
0:01:41	so
0:01:42	uh the question that we've been looking into is how can textures be represented
0:01:46	and recognise um so that there is some previous work on modeling sound texture on this is probably not a
0:01:51	completely exhaustive list of the publications but certainly have a big chunk of them so it's it's a pretty small
0:01:56	literature
0:01:57	as also a lot of work on environmental sounds at off an inclusive of texture
0:02:00	um um of the work done be talking about is
0:02:02	different a little bit from these approaches
0:02:04	uh and that our perspective is that machine recognition might be able to get some some clues from human texture
0:02:09	perception and so on this end
0:02:10	this is very much in this period of the work the new talked about um and in what did was
0:02:14	just talking about um for me
0:02:17	so
0:02:18	are we we've been looking in to how humans represent and recognise textures and
0:02:22	i we'll the starting point for the work is the observation that unlike
0:02:26	uh the sounds are made by individual of bands like a spoken word
0:02:29	textures are stationary so there essential properties don't change over time and that that sort of one of the defining
0:02:34	properties so where as
0:02:35	on the waveform of a word um here clearly has a beginning in an and then temporal evolution
0:02:41	on the sound of rain is just kind of there
0:02:42	right so that the call it make it rain
0:02:45	don't change over time
0:02:46	and so the key proposals that because they're stationary textures can be captured by statistics
0:02:51	that is just time averages of acoustic measurement
0:02:54	at the thing doesn't change we can just makes the measurement average them over time and not a to do
0:02:57	a good job of capturing its qualities
0:03:00	so what what we propose is that
0:03:01	when you recognise the sound of fire or the sound of rain
0:03:05	you're recognising these summaries to test
0:03:08	and what whatever statistics you your auditory system are measuring are presumably derived from for full auditory representations that we
0:03:15	know something about a you've you've heard a bit about this and the first you talks
0:03:19	so we know that that sound is filter by the coke clean you can think of
0:03:22	yeah output of the coke the sort of a sub type representation
0:03:26	know that a lot of information in sub-bands is conveyed
0:03:28	um by to their amplitude envelopes
0:03:30	um after they've been compressed but by the coke the
0:03:33	um i is now quite a bit of evidence that
0:03:35	um the a the on blocks are the spectrogram like representation that that
0:03:39	um they comprise
0:03:41	are then
0:03:41	subsequently filtered by
0:03:43	a another stage of filters that are often called modulation filter and so the
0:03:46	the things that new more showing those receptive fields of cortical neurons
0:03:50	are like this a although he would he was showing examples from the cortex where the tuning is a little
0:03:54	bit more complicated new C
0:03:56	patterns in in both frequency and time
0:03:58	the modulation filled is that we typically
0:04:00	look look at are those that mimic the things you find certain cortical in the inferior click a listen and
0:04:05	foul mess where things are
0:04:06	primarily tuned for temporal modulation so these little things here
0:04:09	represent
0:04:10	um
0:04:11	the the past and and temporal modulation
0:04:13	um frequency
0:04:15	so who the question that that we've been looking into is how much of texture perception can be captured with
0:04:21	relatively simple summary statistics of representations like these that we believe to be present a biological auditory systems
0:04:27	and what these
0:04:27	summers to is with then be useful for machine recognition task
0:04:32	so the methodological proposal that underlies most the work
0:04:35	is that synthesis is a very powerful way to test a perceptual theory
0:04:39	and the notion is that if you're brain represents sounds with some set of measurements like statistics
0:04:45	then signals that have the same values of those measurements are to some the same to you
0:04:49	any particular
0:04:50	sounds that we synthesise
0:04:52	to have the same measurements as some real-world recording
0:04:55	or to sound like another example of the same kind of thing if the measurements that we used it is
0:05:00	synthesis
0:05:01	or are like the ones of the brain uses to represent sound
0:05:04	in we we've been taking as approach with with sound texture perception synthesizing textures from statistics measure
0:05:09	in real-world sound
0:05:11	so the basic idea is to take some examples signal like
0:05:14	a recording of rain
0:05:15	measure some statistics and then send the size new signals
0:05:18	constraining them only to have the same statistics and another respect making them as random as possible
0:05:23	and the approach that that we've taken here is very much inspired by work it was done
0:05:27	i'm quite a while back on visual texture and this the some of the authors of which
0:05:31	uh are mentioned here
0:05:33	so
0:05:34	i'm just gonna give you of a very simple toy example to just illustrate the logic let's suppose that we
0:05:38	want to has the all the power spectrum
0:05:40	you might think that power spectrum place on texture so we do is we measure the spectrum of some real-world
0:05:44	world texture like this
0:05:49	now we just one this as a random signal the same spectrum obvious is really easy we just filter noise
0:05:53	and then we listen to them and see with they sound like and
0:05:56	unfortunately for um had of the power spectrum and they re of texture
0:06:00	i'm things generally sound like noise when you do this
0:06:04	a certain like grant sounds like noise
0:06:08	i
0:06:09	i
0:06:11	and is is as opposed to this
0:06:18	i right so this this is not realistic
0:06:20	and this tells us that we're not simply write registering the spectrum only recognise textures
0:06:24	alright so the question is
0:06:26	well additional simple statistic do any better
0:06:28	um and so we been mostly looking at statistics of these two stages of representations on the on lot of
0:06:34	subband
0:06:35	and the modulation bands you can drive from them with with simple linear filters
0:06:40	and we've looked in to how far we can get with with very simple statistics things like marginal moments like
0:06:44	the variance in this you the kurtosis and to the mean
0:06:47	as was pairwise correlations between different pieces of the representation for since different
0:06:52	on a difference of dance
0:06:54	or different modulation band
0:06:56	these statistics are generic
0:06:58	they're not tailored any specific natural sound
0:07:00	um but they are simple and their easy to measure
0:07:02	on the other hand because of this is not obvious that they would account for much of sound recognition but
0:07:06	maybe there are reasonable place to start
0:07:09	now for pretty statistics to have any hope would being useful for recognition what a minimum a have to yield
0:07:14	different values for different types of sounds and so
0:07:16	when i'm gonna quickly do just give you a couple of examples to give you some intuition for what kinds
0:07:20	of things uh these might capture
0:07:22	so it's quickly look
0:07:23	at some of the marginal moments of uh coke clear on blocks the are able to cope there's sub ban
0:07:29	these moments again things like the mean and the variance in this skew
0:07:32	i statistics to describe how the up is distributed so you take
0:07:36	a stripe of a of a
0:07:37	cochlear their spectrogram
0:07:39	you take the on below
0:07:40	um collapse that across time to give you a histogram they give you the frequency of occurrence of different amplitudes
0:07:45	um in this is a a very simple sort of representation of sound but as many you will know these
0:07:50	kinds of ample to distributions generally differ
0:07:53	for natural sounds and for noise and they they vary between different kinds of natural sounds so you just a
0:07:57	quick example
0:07:58	these are ample dude histograms for noise is uh recording of a stream
0:08:02	and according of geese
0:08:04	from one particular channel
0:08:06	and the the thing to note here is that although these distributions have about the same mean
0:08:10	indicating a there's roughly the same acoustic power this channel
0:08:13	the is reasons that different shape
0:08:16	and you can see also see this visually
0:08:17	if you just look at the spectrograms jeans that the pink noise is mostly grey
0:08:22	where is the stream and the geese of got more black and white and so in this case the white
0:08:25	we've we down here
0:08:26	and the black would be up here so they they deviate more
0:08:29	from the mean with more high able to the more low able to
0:08:33	so many you probably recognise that
0:08:35	this is an indication of of the common observation that natural signals are sparse the noise
0:08:39	so the intuition is that natural sound can in events like raindrops in these calls and
0:08:43	is a are infrequent but when they occur they produce large sample to someone they don't occur the amplitude
0:08:48	uh tends to be low
0:08:49	and the sparsity behaviour
0:08:51	which alters the shape of these histograms
0:08:53	is reflected in pretty simple statistics like the variance which measures the spread of the distribution
0:08:58	and this you it measures the asymmetry about the mean
0:09:02	alright so one more example let's take a quick look at
0:09:04	what kind of correlations we can observe
0:09:06	between on of different channels
0:09:09	and these things also very across sounds and one of the main reasons for this is the presence of broadband
0:09:13	events
0:09:13	so if you listen to the sound of fire
0:09:19	fire that what the crackles and pops and clicks
0:09:21	and those crackles and pops are visible on spectrogram as these vertical street
0:09:27	so these broad band of and these dependencies between channels because they excite them all at once and you can
0:09:31	see it is if you look at correlations between channels so is just a a big matrix of correlation coefficients
0:09:36	between pairs of
0:09:37	cochlear filters
0:09:39	going from low frequency to high and low to high
0:09:41	so the tag one here's got gotta be one
0:09:43	but they are bad channels can be whatever and you can see that for fire there's a lotta yellow
0:09:47	and a lot red indicating at there this correlations between channels and not all sounds are like this use a
0:09:52	stream
0:09:53	and you can see that there's mostly green here oh looks yellow on the screen but
0:09:56	just me green
0:09:58	um mess we because for a lot of water sounds on the of channels are mostly correlated
0:10:02	okay so these statistics although though there's simple they capture variation across sound
0:10:07	um and of the question we we're
0:10:08	trying to get at is whether they actually capture the sound of real-world text
0:10:12	so gonna strategy
0:10:13	is the synthesized signal constraint only to have the same statistics as some real-world world sound
0:10:18	um but in other respects being as random as possible way we do that
0:10:21	is by starting with the noise signal
0:10:23	and then adjusting the noise signal to get it to have the desired statistics
0:10:26	training in to some new signal
0:10:28	the basic idea
0:10:29	um is to to uh filter to the noise with the same set of filters giving you a subband representation
0:10:34	and then to adjust the subband envelopes via a gradient descent
0:10:38	the "'cause" and to have the desired statistical properties and so
0:10:41	the statistics are just function as they a but we can compute their gradient
0:10:44	and then change the a open a great direction till we get the desired statistics so that gives as new
0:10:49	subbands and we add them back up to get a new sound signal we can listen to
0:10:53	there is just a a a a flowchart i won't give you all the details here but
0:10:56	and the the basic strategy is the first measure the statistics of a real world sound texture i'm after processing
0:11:02	it in the auditory model
0:11:03	um and then processing noise in the same way and altering its on blocks to give the same statistics an
0:11:08	an iterative process that you have to do to get this to converge
0:11:11	but the end result as the sound signal that shares the statistics of a real world sound so the question
0:11:15	is questions had they sound
0:11:16	um so
0:11:17	we we're asking this question again because of the statistics account for texture perception will then the synthetic signal should
0:11:22	sound like new examples of the real thing
0:11:24	um
0:11:26	and interestingly in in many cases they do some in play you
0:11:28	a sequence of synthetic sounds are just generated from noise
0:11:32	um by forcing the noise to have some of the same statistics as various real world sounds so you get
0:11:36	things a sound like rain
0:11:41	streams
0:11:44	bubbles
0:11:47	fire
0:11:50	applause
0:11:53	when
0:11:54	i
0:11:57	in
0:12:01	birds
0:12:05	a crowd noise
0:12:09	so it also works for a lot of a natural sounds things like rustling paper
0:12:16	or a jackhammer
0:12:21	i so the success of the is the suggest these statistics could underlie the representation and recognition of text
0:12:27	so we did it a quick experiment to to see whether this was true and human listeners
0:12:31	uh people were presented with a five second sound clip and had identified from five choices so chance performance here
0:12:37	is twenty percent
0:12:38	and we presented them with
0:12:40	uh synthetic signals that we're synthesized with different numbers of statistical constraints
0:12:44	as well as the original
0:12:46	you can see here that when we just match the power spectrum i'm people are are above chance but not
0:12:50	very good
0:12:51	but the performance improves as we add in more statistics and to with the force set
0:12:55	um of
0:12:55	the the model that i should you previously
0:12:57	um you all as gets with the originals
0:13:01	so this all states that these simple statistics can in fact support recognition of of real world text
0:13:07	another point that that's just worth quickly mentioning is that the scent is here is not simply reproducing the original
0:13:11	waveform um so because the procedure
0:13:14	is initialised with noise
0:13:15	it turns out a different sound single every time that share only the statistical properties and these are just three
0:13:20	examples of
0:13:21	waves that we're synthesized from a single set of statistics measured and a single recording and you get a a
0:13:26	very different thing each time and you can make as many these as you want so this it is a
0:13:30	really capturing
0:13:31	i'm some more abstract property of the sound signal
0:13:34	alright so uh one other in question is whether these texture statistics that seem to be implicated in human texture
0:13:39	perception would will also be useful for machine recognition
0:13:43	and at present we don't really have an ideal task
0:13:45	um with with which to test this because we we need lots and lots of label textures and of any
0:13:49	of you have those i be interested to get them
0:13:52	uh but then S i'm had an idea that a in interesting potential application for this
0:13:56	um might be video soundtrack classification so as a everybody knows there's lots of interest these days
0:14:02	in
0:14:02	um being able to search for video clips um depending on the on their content
0:14:06	and the got is hands on
0:14:08	i dataset courtesy of a colleague at can be a you gain G
0:14:12	where you getting
0:14:13	and had a a a bunch of people view video clips in an interface like this
0:14:16	so they would
0:14:17	watch something like this and then
0:14:21	uh
0:14:22	but they would hear something like this
0:14:25	blue
0:14:30	funny here
0:14:32	powerpoint got a low
0:14:34	a
0:14:37	i
0:14:37	alright so that was the soundtrack and then they would look at this thing and a that the check all
0:14:41	the boxes that applied and so some of these things are attribute you pry can seem in the back but
0:14:45	some of them are attribute of the video
0:14:46	others there's describe the audience of this case the person is check cheering and clapping
0:14:51	they said this was a outdoor in a all environments as a whole bunch of labels a get attached to
0:14:55	each of these videos
0:14:57	and so the idea is that the texture statistics can be used as uh features for um svm classifications you
0:15:03	can train up svms
0:15:04	um to to recognise these particular labels and distinguish them from others
0:15:09	an order for this to work
0:15:10	course the statistics have to give you different values for different labels and what this um plot shows is just
0:15:15	the average values of the different statistics for the different labels and in the set
0:15:20	i'm so the labels are going on the vertical axis and the different statistics on the horizontal axis and the
0:15:25	point here is just that as you scan down the columns the the colours change
0:15:28	right so
0:15:29	the different labels are on average associated with different statistics
0:15:33	um
0:15:33	and so
0:15:34	we find that uh you you can do some degree of classification with with these statistics
0:15:39	um this is the average performance across all the categories
0:15:42	which is overall modest but one one of the things to point out here is that
0:15:46	you know the see a pattern it's qualitative like what you see in in the human uh observers that is
0:15:50	performance is not that great when you just match the mean value of the on loves that is the the
0:15:55	spectrum
0:15:56	we gets better as you add in more statistic
0:15:59	um some kind click labels get categories better than others the speech or music are pretty easy as a lot
0:16:03	you probably know
0:16:04	um i'm showing here part
0:16:06	this it to show that the pattern that you get is a little different for different categories so for music
0:16:11	for instance
0:16:12	on the modulation power in in the modulation band seems to matter a lot you get a in classification there
0:16:17	was for speech
0:16:18	um the cross and correlations
0:16:20	on that measure things like a modulation is more
0:16:24	um performance is is poor for some of the labels ending in part because they're acoustically heterogeneous that is it
0:16:28	just not really well suited to
0:16:30	the representation we have so one of the labels is er and you can imagine that that consists of like
0:16:35	a lot of different kinds of sound textures
0:16:37	and so this that is you are not great there
0:16:39	so it it's not really ideal task it sort of just more of a proof of concept
0:16:43	um i think to really use these statistics for classifying semantic categories like this like urban
0:16:48	you probably have to first recognise particular textures like
0:16:51	traffic or crowd noise
0:16:52	and then link those labels to the category
0:16:55	so take a message is here um the first thing is just that text use are you back what is
0:17:00	and i i think
0:17:01	important and worth studying and i i think they may involve
0:17:04	a unique form of representation relative to other um kinds of of auditory phenomena i'm namely summary statistics
0:17:11	and so we find that
0:17:12	naturalistic textures can be generated from
0:17:15	relatively simple summary statistics of early auditory representations marginal moments and and pairwise correlations of cochlear modulation filters
0:17:23	and the suggestion is that listeners are using similar statistics
0:17:26	to recognise sound texture so when when you remember the the sound of a fire or the sound of rain
0:17:30	we think you're just remembering
0:17:32	some of these summary statistics
0:17:33	um
0:17:34	and the suggestion is that some statistics um should be useful for machine recognition of textures that something that will
0:17:39	continue to explore
0:17:40	thanks
0:17:46	thus
0:17:47	question
0:17:52	i one question
0:17:53	do you expect the same kind of sit is is useful and in recognizing speech or or other to a
0:17:58	six signal
0:17:59	that
0:17:59	different
0:18:00	something between you know something completely that
0:18:02	yes i think one one interesting notion
0:18:05	um is that of so textures the things stationary right and so it makes sense to compute be summers that
0:18:09	this six where you're averaging things over the
0:18:12	length of the signal
0:18:13	um for so signals where you're interested in the nonstationary structure
0:18:17	um
0:18:18	what you might wanna do is
0:18:20	i'm Q those
0:18:20	the statistics but averaged over a local time windows um so that the statistics would give you sort of a
0:18:25	trajectory
0:18:26	uh over time
0:18:27	of uh the way that the local structure changes in so there's a lot of sounds
0:18:32	that kind of locally or texture like um but they have some kind of temporal evolution
0:18:36	yeah so like
0:18:37	yeah when you when you get in the bad
0:18:39	like the sound
0:18:40	you make is you kind of go over the sheets and that everything's rustling
0:18:44	you know those are those are different textures but they're sort of sequence them like a particular way
0:18:48	and
0:18:48	um
0:18:49	i don't know how use that would be for something like speech but certainly for i think some kinds of
0:18:53	nonstationary sounds
0:18:55	that sort of approach
0:18:56	of looking at the temporal evolution of the detection
0:18:59	may maybe be used
0:19:08	oh
0:19:08	rating
0:19:09	yeah it's very question um
0:19:11	so
0:19:13	i i'm a very interested in this i think that um
0:19:16	what you might need to do is actually before so all all these statistics
0:19:20	like i guess their time averages right so like the correlation is an average of a product right or whatever
0:19:24	that the variance the average of deviation
0:19:27	and before you do the time average what you might need to do is do some kind of clustering
0:19:31	um
0:19:31	because
0:19:32	you know for instance in in the case of
0:19:34	um like let's see have you know we and
0:19:36	which
0:19:37	um
0:19:38	is it really very correlated it across channels and like clapping that is
0:19:42	right
0:19:43	so you're gonna have sort of a a a a a a mixture of these two very different kinds of
0:19:46	events
0:19:47	um some of which will be close to hunt sent correlated another with as a which one not
0:19:50	so if you just combine as you're gonna get a correlation of point five which is not
0:19:54	really good representation
0:19:59	yeah
0:20:07	yeah so i i i think it's a it's interesting problem and i mean it is related to
0:20:10	um
0:20:11	some of the ways that people are thinking about
0:20:14	sound segregation um
0:20:15	in terms of clustering um
0:20:17	and so you you may really in you may have to do some kind of segregation lean or a model
0:20:22	um there are cases where works but
0:20:24	case word
0:20:28	i i uh i just have a quick question for all their experiments especially this subject and classification uh what
0:20:34	was the sampling rate of the so
0:20:35	or clips to use
0:20:37	what was the sampling rate of the sound signals
0:20:39	uh pride
0:20:40	when T K
0:20:47	yes
0:20:48	yeah
0:20:49	that's correct yeah
0:20:50	i mean the the on looks so you know this this is are can on on of a course those
0:20:54	are
0:20:55	you know there's a low things right so
0:20:58	they have an effective sampling rate of something much lower like you know
0:21:01	a few hundred or
0:21:04	that's right the actual sound files have have you know pretty high
0:21:06	normal sent
0:21:08	it you check to see down
0:21:10	fact and the accuracy at a statistical nature its um
0:21:14	especially by don't links then that K to attack channel i
0:21:17	it's higher arg statistics really stack failing we you the link it's right or yeah yeah i a great question
0:21:23	how
0:21:24	how line line where your files that channel and so i'd i stand really work a pretty long things like
0:21:29	five seconds but um i have looked at things shorter and the
0:21:34	most of this this six are robust down to a pretty short lengths of
0:21:38	when you start measuring things like kurtosis and stuff then it gets a bit less robust but
0:21:42	that's actually not that and that not being that important for the synthesis not free yeah yeah
0:21:47	well well the could of the a of um so it the variance kind does most the work there
0:21:53	and the correlations stuff half i did you yeah why that order statistics
0:21:57	uh nothing more than kurtosis yeah
0:22:00	i think
0:22:01	yeah
0:22:02	yeah
0:22:03	it looks like that way you want
0:22:05	a of you all take non parametric
0:22:08	a man
0:22:09	where you collect all somewhere i
0:22:11	a to anyone at all
0:22:12	yeah
0:22:13	now have a look at that you and power macs
0:22:17	actually may
0:22:19	all
0:22:20	oh oh in
0:22:22	each it
0:22:22	or
0:22:23	you
0:22:23	if
0:22:25	right
0:22:25	it's a little
0:22:26	you know which
0:22:28	oh
0:22:28	what
0:22:31	also
0:22:33	you
0:22:33	kernel
0:22:35	i can say what
0:22:36	you you have and from
0:22:38	which
0:22:39	you
0:22:40	actually
0:22:41	oh those i one what
0:22:43	i
0:22:45	yeah i think that that's an interesting direction um i mean
0:22:49	i i we we've been you moving in in and directions like that haven't tried yet
0:22:53	that
0:22:55	and the uh oh tutorial on music signal processing the as a discussion of texture like if X and the
0:23:00	perception of
0:23:02	combination tones and chord
0:23:04	if you look good any correlations uh between your work and my kind of
0:23:10	um
0:23:11	i i don't know exactly which are referring to but i i i mean i know that i mean the
0:23:14	word textures used a lot on the context of music
0:23:17	um
0:23:18	where they tip they're talking about kind a higher level
0:23:20	types of a
0:23:21	um
0:23:22	i mean i have tried to the size musical textures
0:23:25	i done work that well
0:23:26	um
0:23:27	and there's
0:23:28	a lot of interesting reasons for why i think that is um i we we need some more complicated statistics
0:23:33	essentially for that
0:23:34	um but music is one of the things that really doesn't work very well on in general things that
0:23:38	things are composed of sounds with pitch and the not
0:23:40	as that
0:23:40	well
0:23:41	so it works great on like environmental sounds and
0:23:44	um you machine noise and things like that
0:23:47	thanks to us
0:23:50	i

CLASSIFYING SOUNDTRACKS WITH AUDIO TEXTURE FEATURES

Innovative Representations of Audio

Presented by: Josh McDermott, Author(s): Daniel P.W. Ellis, Xiaohong Zeng, Columbia University, United States; Josh McDermott, New York University, United States