0:00:13so now we move away from um the perceptual um side of things to it to the statistical properties the
0:00:19you know the sound of wind um person the sound of building block
0:00:22um is a not formants you know it's that it's to ms to new deterministic signal
0:00:26which a lot of as to with and terms speech
0:00:28so how do you recognise the statistical behavior of a sound
0:00:31um dan ellis a columbia and just term and what you
0:00:34will describe their um sound texture representation
0:00:37in talk about use the sound classification
0:00:40alright right malcolm and that thanks everybody for showing up to our uh of session
0:00:45um select like comes that i'm gonna talk about texture
0:00:48um and and textures are sounds a result from large numbers of acoustic events main a include things you hear
0:00:53all the time at sound of rain
0:01:05running water
0:01:13crowd noise
0:01:19and fire
0:01:23so these kinds of sounds are common in the world and it seems like to are important for a lot
0:01:26of task that humans have perform in that we might wanna get machines to perform
0:01:30like figure out where you are or what the weather's like for instance
0:01:34um but in contrast to the vast literature on visual texture um both in human a machine vision um sound
0:01:39textures are largely on study
0:01:42uh the question that we've been looking into is how can textures be represented
0:01:46and recognise um so that there is some previous work on modeling sound texture on this is probably not a
0:01:51completely exhaustive list of the publications but certainly have a big chunk of them so it's it's a pretty small
0:01:57as also a lot of work on environmental sounds at off an inclusive of texture
0:02:00um um of the work done be talking about is
0:02:02different a little bit from these approaches
0:02:04uh and that our perspective is that machine recognition might be able to get some some clues from human texture
0:02:09perception and so on this end
0:02:10this is very much in this period of the work the new talked about um and in what did was
0:02:14just talking about um for me
0:02:18are we we've been looking in to how humans represent and recognise textures and
0:02:22i we'll the starting point for the work is the observation that unlike
0:02:26uh the sounds are made by individual of bands like a spoken word
0:02:29textures are stationary so there essential properties don't change over time and that that sort of one of the defining
0:02:34properties so where as
0:02:35on the waveform of a word um here clearly has a beginning in an and then temporal evolution
0:02:41on the sound of rain is just kind of there
0:02:42right so that the call it make it rain
0:02:45don't change over time
0:02:46and so the key proposals that because they're stationary textures can be captured by statistics
0:02:51that is just time averages of acoustic measurement
0:02:54at the thing doesn't change we can just makes the measurement average them over time and not a to do
0:02:57a good job of capturing its qualities
0:03:00so what what we propose is that
0:03:01when you recognise the sound of fire or the sound of rain
0:03:05you're recognising these summaries to test
0:03:08and what whatever statistics you your auditory system are measuring are presumably derived from for full auditory representations that we
0:03:15know something about a you've you've heard a bit about this and the first you talks
0:03:19so we know that that sound is filter by the coke clean you can think of
0:03:22yeah output of the coke the sort of a sub type representation
0:03:26know that a lot of information in sub-bands is conveyed
0:03:28um by to their amplitude envelopes
0:03:30um after they've been compressed but by the coke the
0:03:33um i is now quite a bit of evidence that
0:03:35um the a the on blocks are the spectrogram like representation that that
0:03:39um they comprise
0:03:41are then
0:03:41subsequently filtered by
0:03:43a another stage of filters that are often called modulation filter and so the
0:03:46the things that new more showing those receptive fields of cortical neurons
0:03:50are like this a although he would he was showing examples from the cortex where the tuning is a little
0:03:54bit more complicated new C
0:03:56patterns in in both frequency and time
0:03:58the modulation filled is that we typically
0:04:00look look at are those that mimic the things you find certain cortical in the inferior click a listen and
0:04:05foul mess where things are
0:04:06primarily tuned for temporal modulation so these little things here
0:04:11the the past and and temporal modulation
0:04:13um frequency
0:04:15so who the question that that we've been looking into is how much of texture perception can be captured with
0:04:21relatively simple summary statistics of representations like these that we believe to be present a biological auditory systems
0:04:27and what these
0:04:27summers to is with then be useful for machine recognition task
0:04:32so the methodological proposal that underlies most the work
0:04:35is that synthesis is a very powerful way to test a perceptual theory
0:04:39and the notion is that if you're brain represents sounds with some set of measurements like statistics
0:04:45then signals that have the same values of those measurements are to some the same to you
0:04:49any particular
0:04:50sounds that we synthesise
0:04:52to have the same measurements as some real-world recording
0:04:55or to sound like another example of the same kind of thing if the measurements that we used it is
0:05:01or are like the ones of the brain uses to represent sound
0:05:04in we we've been taking as approach with with sound texture perception synthesizing textures from statistics measure
0:05:09in real-world sound
0:05:11so the basic idea is to take some examples signal like
0:05:14a recording of rain
0:05:15measure some statistics and then send the size new signals
0:05:18constraining them only to have the same statistics and another respect making them as random as possible
0:05:23and the approach that that we've taken here is very much inspired by work it was done
0:05:27i'm quite a while back on visual texture and this the some of the authors of which
0:05:31uh are mentioned here
0:05:34i'm just gonna give you of a very simple toy example to just illustrate the logic let's suppose that we
0:05:38want to has the all the power spectrum
0:05:40you might think that power spectrum place on texture so we do is we measure the spectrum of some real-world
0:05:44world texture like this
0:05:49now we just one this as a random signal the same spectrum obvious is really easy we just filter noise
0:05:53and then we listen to them and see with they sound like and
0:05:56unfortunately for um had of the power spectrum and they re of texture
0:06:00i'm things generally sound like noise when you do this
0:06:04a certain like grant sounds like noise
0:06:11and is is as opposed to this
0:06:18i right so this this is not realistic
0:06:20and this tells us that we're not simply write registering the spectrum only recognise textures
0:06:24alright so the question is
0:06:26well additional simple statistic do any better
0:06:28um and so we been mostly looking at statistics of these two stages of representations on the on lot of
0:06:35and the modulation bands you can drive from them with with simple linear filters
0:06:40and we've looked in to how far we can get with with very simple statistics things like marginal moments like
0:06:44the variance in this you the kurtosis and to the mean
0:06:47as was pairwise correlations between different pieces of the representation for since different
0:06:52on a difference of dance
0:06:54or different modulation band
0:06:56these statistics are generic
0:06:58they're not tailored any specific natural sound
0:07:00um but they are simple and their easy to measure
0:07:02on the other hand because of this is not obvious that they would account for much of sound recognition but
0:07:06maybe there are reasonable place to start
0:07:09now for pretty statistics to have any hope would being useful for recognition what a minimum a have to yield
0:07:14different values for different types of sounds and so
0:07:16when i'm gonna quickly do just give you a couple of examples to give you some intuition for what kinds
0:07:20of things uh these might capture
0:07:22so it's quickly look
0:07:23at some of the marginal moments of uh coke clear on blocks the are able to cope there's sub ban
0:07:29these moments again things like the mean and the variance in this skew
0:07:32i statistics to describe how the up is distributed so you take
0:07:36a stripe of a of a
0:07:37cochlear their spectrogram
0:07:39you take the on below
0:07:40um collapse that across time to give you a histogram they give you the frequency of occurrence of different amplitudes
0:07:45um in this is a a very simple sort of representation of sound but as many you will know these
0:07:50kinds of ample to distributions generally differ
0:07:53for natural sounds and for noise and they they vary between different kinds of natural sounds so you just a
0:07:57quick example
0:07:58these are ample dude histograms for noise is uh recording of a stream
0:08:02and according of geese
0:08:04from one particular channel
0:08:06and the the thing to note here is that although these distributions have about the same mean
0:08:10indicating a there's roughly the same acoustic power this channel
0:08:13the is reasons that different shape
0:08:16and you can see also see this visually
0:08:17if you just look at the spectrograms jeans that the pink noise is mostly grey
0:08:22where is the stream and the geese of got more black and white and so in this case the white
0:08:25we've we down here
0:08:26and the black would be up here so they they deviate more
0:08:29from the mean with more high able to the more low able to
0:08:33so many you probably recognise that
0:08:35this is an indication of of the common observation that natural signals are sparse the noise
0:08:39so the intuition is that natural sound can in events like raindrops in these calls and
0:08:43is a are infrequent but when they occur they produce large sample to someone they don't occur the amplitude
0:08:48uh tends to be low
0:08:49and the sparsity behaviour
0:08:51which alters the shape of these histograms
0:08:53is reflected in pretty simple statistics like the variance which measures the spread of the distribution
0:08:58and this you it measures the asymmetry about the mean
0:09:02alright so one more example let's take a quick look at
0:09:04what kind of correlations we can observe
0:09:06between on of different channels
0:09:09and these things also very across sounds and one of the main reasons for this is the presence of broadband
0:09:13so if you listen to the sound of fire
0:09:19fire that what the crackles and pops and clicks
0:09:21and those crackles and pops are visible on spectrogram as these vertical street
0:09:27so these broad band of and these dependencies between channels because they excite them all at once and you can
0:09:31see it is if you look at correlations between channels so is just a a big matrix of correlation coefficients
0:09:36between pairs of
0:09:37cochlear filters
0:09:39going from low frequency to high and low to high
0:09:41so the tag one here's got gotta be one
0:09:43but they are bad channels can be whatever and you can see that for fire there's a lotta yellow
0:09:47and a lot red indicating at there this correlations between channels and not all sounds are like this use a
0:09:53and you can see that there's mostly green here oh looks yellow on the screen but
0:09:56just me green
0:09:58um mess we because for a lot of water sounds on the of channels are mostly correlated
0:10:02okay so these statistics although though there's simple they capture variation across sound
0:10:07um and of the question we we're
0:10:08trying to get at is whether they actually capture the sound of real-world text
0:10:12so gonna strategy
0:10:13is the synthesized signal constraint only to have the same statistics as some real-world world sound
0:10:18um but in other respects being as random as possible way we do that
0:10:21is by starting with the noise signal
0:10:23and then adjusting the noise signal to get it to have the desired statistics
0:10:26training in to some new signal
0:10:28the basic idea
0:10:29um is to to uh filter to the noise with the same set of filters giving you a subband representation
0:10:34and then to adjust the subband envelopes via a gradient descent
0:10:38the "'cause" and to have the desired statistical properties and so
0:10:41the statistics are just function as they a but we can compute their gradient
0:10:44and then change the a open a great direction till we get the desired statistics so that gives as new
0:10:49subbands and we add them back up to get a new sound signal we can listen to
0:10:53there is just a a a a flowchart i won't give you all the details here but
0:10:56and the the basic strategy is the first measure the statistics of a real world sound texture i'm after processing
0:11:02it in the auditory model
0:11:03um and then processing noise in the same way and altering its on blocks to give the same statistics an
0:11:08an iterative process that you have to do to get this to converge
0:11:11but the end result as the sound signal that shares the statistics of a real world sound so the question
0:11:15is questions had they sound
0:11:16um so
0:11:17we we're asking this question again because of the statistics account for texture perception will then the synthetic signal should
0:11:22sound like new examples of the real thing
0:11:26and interestingly in in many cases they do some in play you
0:11:28a sequence of synthetic sounds are just generated from noise
0:11:32um by forcing the noise to have some of the same statistics as various real world sounds so you get
0:11:36things a sound like rain
0:12:05a crowd noise
0:12:09so it also works for a lot of a natural sounds things like rustling paper
0:12:16or a jackhammer
0:12:21i so the success of the is the suggest these statistics could underlie the representation and recognition of text
0:12:27so we did it a quick experiment to to see whether this was true and human listeners
0:12:31uh people were presented with a five second sound clip and had identified from five choices so chance performance here
0:12:37is twenty percent
0:12:38and we presented them with
0:12:40uh synthetic signals that we're synthesized with different numbers of statistical constraints
0:12:44as well as the original
0:12:46you can see here that when we just match the power spectrum i'm people are are above chance but not
0:12:50very good
0:12:51but the performance improves as we add in more statistics and to with the force set
0:12:55um of
0:12:55the the model that i should you previously
0:12:57um you all as gets with the originals
0:13:01so this all states that these simple statistics can in fact support recognition of of real world text
0:13:07another point that that's just worth quickly mentioning is that the scent is here is not simply reproducing the original
0:13:11waveform um so because the procedure
0:13:14is initialised with noise
0:13:15it turns out a different sound single every time that share only the statistical properties and these are just three
0:13:20examples of
0:13:21waves that we're synthesized from a single set of statistics measured and a single recording and you get a a
0:13:26very different thing each time and you can make as many these as you want so this it is a
0:13:30really capturing
0:13:31i'm some more abstract property of the sound signal
0:13:34alright so uh one other in question is whether these texture statistics that seem to be implicated in human texture
0:13:39perception would will also be useful for machine recognition
0:13:43and at present we don't really have an ideal task
0:13:45um with with which to test this because we we need lots and lots of label textures and of any
0:13:49of you have those i be interested to get them
0:13:52uh but then S i'm had an idea that a in interesting potential application for this
0:13:56um might be video soundtrack classification so as a everybody knows there's lots of interest these days
0:14:02um being able to search for video clips um depending on the on their content
0:14:06and the got is hands on
0:14:08i dataset courtesy of a colleague at can be a you gain G
0:14:12where you getting
0:14:13and had a a a bunch of people view video clips in an interface like this
0:14:16so they would
0:14:17watch something like this and then
0:14:22but they would hear something like this
0:14:30funny here
0:14:32powerpoint got a low
0:14:37alright so that was the soundtrack and then they would look at this thing and a that the check all
0:14:41the boxes that applied and so some of these things are attribute you pry can seem in the back but
0:14:45some of them are attribute of the video
0:14:46others there's describe the audience of this case the person is check cheering and clapping
0:14:51they said this was a outdoor in a all environments as a whole bunch of labels a get attached to
0:14:55each of these videos
0:14:57and so the idea is that the texture statistics can be used as uh features for um svm classifications you
0:15:03can train up svms
0:15:04um to to recognise these particular labels and distinguish them from others
0:15:09an order for this to work
0:15:10course the statistics have to give you different values for different labels and what this um plot shows is just
0:15:15the average values of the different statistics for the different labels and in the set
0:15:20i'm so the labels are going on the vertical axis and the different statistics on the horizontal axis and the
0:15:25point here is just that as you scan down the columns the the colours change
0:15:28right so
0:15:29the different labels are on average associated with different statistics
0:15:33and so
0:15:34we find that uh you you can do some degree of classification with with these statistics
0:15:39um this is the average performance across all the categories
0:15:42which is overall modest but one one of the things to point out here is that
0:15:46you know the see a pattern it's qualitative like what you see in in the human uh observers that is
0:15:50performance is not that great when you just match the mean value of the on loves that is the the
0:15:56we gets better as you add in more statistic
0:15:59um some kind click labels get categories better than others the speech or music are pretty easy as a lot
0:16:03you probably know
0:16:04um i'm showing here part
0:16:06this it to show that the pattern that you get is a little different for different categories so for music
0:16:11for instance
0:16:12on the modulation power in in the modulation band seems to matter a lot you get a in classification there
0:16:17was for speech
0:16:18um the cross and correlations
0:16:20on that measure things like a modulation is more
0:16:24um performance is is poor for some of the labels ending in part because they're acoustically heterogeneous that is it
0:16:28just not really well suited to
0:16:30the representation we have so one of the labels is er and you can imagine that that consists of like
0:16:35a lot of different kinds of sound textures
0:16:37and so this that is you are not great there
0:16:39so it it's not really ideal task it sort of just more of a proof of concept
0:16:43um i think to really use these statistics for classifying semantic categories like this like urban
0:16:48you probably have to first recognise particular textures like
0:16:51traffic or crowd noise
0:16:52and then link those labels to the category
0:16:55so take a message is here um the first thing is just that text use are you back what is
0:17:00and i i think
0:17:01important and worth studying and i i think they may involve
0:17:04a unique form of representation relative to other um kinds of of auditory phenomena i'm namely summary statistics
0:17:11and so we find that
0:17:12naturalistic textures can be generated from
0:17:15relatively simple summary statistics of early auditory representations marginal moments and and pairwise correlations of cochlear modulation filters
0:17:23and the suggestion is that listeners are using similar statistics
0:17:26to recognise sound texture so when when you remember the the sound of a fire or the sound of rain
0:17:30we think you're just remembering
0:17:32some of these summary statistics
0:17:34and the suggestion is that some statistics um should be useful for machine recognition of textures that something that will
0:17:39continue to explore
0:17:52i one question
0:17:53do you expect the same kind of sit is is useful and in recognizing speech or or other to a
0:17:58six signal
0:18:00something between you know something completely that
0:18:02yes i think one one interesting notion
0:18:05um is that of so textures the things stationary right and so it makes sense to compute be summers that
0:18:09this six where you're averaging things over the
0:18:12length of the signal
0:18:13um for so signals where you're interested in the nonstationary structure
0:18:18what you might wanna do is
0:18:20i'm Q those
0:18:20the statistics but averaged over a local time windows um so that the statistics would give you sort of a
0:18:26uh over time
0:18:27of uh the way that the local structure changes in so there's a lot of sounds
0:18:32that kind of locally or texture like um but they have some kind of temporal evolution
0:18:36yeah so like
0:18:37yeah when you when you get in the bad
0:18:39like the sound
0:18:40you make is you kind of go over the sheets and that everything's rustling
0:18:44you know those are those are different textures but they're sort of sequence them like a particular way
0:18:49i don't know how use that would be for something like speech but certainly for i think some kinds of
0:18:53nonstationary sounds
0:18:55that sort of approach
0:18:56of looking at the temporal evolution of the detection
0:18:59may maybe be used
0:19:09yeah it's very question um
0:19:13i i'm a very interested in this i think that um
0:19:16what you might need to do is actually before so all all these statistics
0:19:20like i guess their time averages right so like the correlation is an average of a product right or whatever
0:19:24that the variance the average of deviation
0:19:27and before you do the time average what you might need to do is do some kind of clustering
0:19:32you know for instance in in the case of
0:19:34um like let's see have you know we and
0:19:38is it really very correlated it across channels and like clapping that is
0:19:43so you're gonna have sort of a a a a a a mixture of these two very different kinds of
0:19:47um some of which will be close to hunt sent correlated another with as a which one not
0:19:50so if you just combine as you're gonna get a correlation of point five which is not
0:19:54really good representation
0:20:07yeah so i i i think it's a it's interesting problem and i mean it is related to
0:20:11some of the ways that people are thinking about
0:20:14sound segregation um
0:20:15in terms of clustering um
0:20:17and so you you may really in you may have to do some kind of segregation lean or a model
0:20:22um there are cases where works but
0:20:24case word
0:20:28i i uh i just have a quick question for all their experiments especially this subject and classification uh what
0:20:34was the sampling rate of the so
0:20:35or clips to use
0:20:37what was the sampling rate of the sound signals
0:20:39uh pride
0:20:40when T K
0:20:49that's correct yeah
0:20:50i mean the the on looks so you know this this is are can on on of a course those
0:20:55you know there's a low things right so
0:20:58they have an effective sampling rate of something much lower like you know
0:21:01a few hundred or
0:21:04that's right the actual sound files have have you know pretty high
0:21:06normal sent
0:21:08it you check to see down
0:21:10fact and the accuracy at a statistical nature its um
0:21:14especially by don't links then that K to attack channel i
0:21:17it's higher arg statistics really stack failing we you the link it's right or yeah yeah i a great question
0:21:24how line line where your files that channel and so i'd i stand really work a pretty long things like
0:21:29five seconds but um i have looked at things shorter and the
0:21:34most of this this six are robust down to a pretty short lengths of
0:21:38when you start measuring things like kurtosis and stuff then it gets a bit less robust but
0:21:42that's actually not that and that not being that important for the synthesis not free yeah yeah
0:21:47well well the could of the a of um so it the variance kind does most the work there
0:21:53and the correlations stuff half i did you yeah why that order statistics
0:21:57uh nothing more than kurtosis yeah
0:22:00i think
0:22:03it looks like that way you want
0:22:05a of you all take non parametric
0:22:08a man
0:22:09where you collect all somewhere i
0:22:11a to anyone at all
0:22:13now have a look at that you and power macs
0:22:17actually may
0:22:20oh oh in
0:22:22each it
0:22:25it's a little
0:22:26you know which
0:22:35i can say what
0:22:36you you have and from
0:22:41oh those i one what
0:22:45yeah i think that that's an interesting direction um i mean
0:22:49i i we we've been you moving in in and directions like that haven't tried yet
0:22:55and the uh oh tutorial on music signal processing the as a discussion of texture like if X and the
0:23:00perception of
0:23:02combination tones and chord
0:23:04if you look good any correlations uh between your work and my kind of
0:23:11i i don't know exactly which are referring to but i i i mean i know that i mean the
0:23:14word textures used a lot on the context of music
0:23:18where they tip they're talking about kind a higher level
0:23:20types of a
0:23:22i mean i have tried to the size musical textures
0:23:25i done work that well
0:23:27and there's
0:23:28a lot of interesting reasons for why i think that is um i we we need some more complicated statistics
0:23:33essentially for that
0:23:34um but music is one of the things that really doesn't work very well on in general things that
0:23:38things are composed of sounds with pitch and the not
0:23:40as that
0:23:41so it works great on like environmental sounds and
0:23:44um you machine noise and things like that
0:23:47thanks to us