Speech Transcript - CLASSIFYING SOUNDTRACKS WITH AUDIO TEXTURE FEATURES

so now we move away from um the perceptual um side of things to it to the statistical properties the sound you know the sound of wind um person the sound of building block um is a not formants you know it's that it's to ms to new deterministic signal which a lot of as to with and terms speech so how do you recognise the statistical behavior of a sound um dan ellis a columbia and just term and what you will describe their um sound texture representation in talk about use the sound classification well alright right malcolm and that thanks everybody for showing up to our uh of session um select like comes that i'm gonna talk about texture today um and and textures are sounds a result from large numbers of acoustic events main a include things you hear all the time at sound of rain win birds running water in crowd noise applause a and fire so these kinds of sounds are common in the world and it seems like to are important for a lot of task that humans have perform in that we might wanna get machines to perform like figure out where you are or what the weather's like for instance um but in contrast to the vast literature on visual texture um both in human a machine vision um sound textures are largely on study so uh the question that we've been looking into is how can textures be represented and recognise um so that there is some previous work on modeling sound texture on this is probably not a completely exhaustive list of the publications but certainly have a big chunk of them so it's it's a pretty small literature as also a lot of work on environmental sounds at off an inclusive of texture um um of the work done be talking about is different a little bit from these approaches uh and that our perspective is that machine recognition might be able to get some some clues from human texture perception and so on this end this is very much in this period of the work the new talked about um and in what did was just talking about um for me so are we we've been looking in to how humans represent and recognise textures and i we'll the starting point for the work is the observation that unlike uh the sounds are made by individual of bands like a spoken word textures are stationary so there essential properties don't change over time and that that sort of one of the defining properties so where as on the waveform of a word um here clearly has a beginning in an and then temporal evolution on the sound of rain is just kind of there right so that the call it make it rain don't change over time and so the key proposals that because they're stationary textures can be captured by statistics that is just time averages of acoustic measurement at the thing doesn't change we can just makes the measurement average them over time and not a to do a good job of capturing its qualities so what what we propose is that when you recognise the sound of fire or the sound of rain you're recognising these summaries to test and what whatever statistics you your auditory system are measuring are presumably derived from for full auditory representations that we know something about a you've you've heard a bit about this and the first you talks so we know that that sound is filter by the coke clean you can think of yeah output of the coke the sort of a sub type representation know that a lot of information in sub-bands is conveyed um by to their amplitude envelopes um after they've been compressed but by the coke the um i is now quite a bit of evidence that um the a the on blocks are the spectrogram like representation that that um they comprise are then subsequently filtered by a another stage of filters that are often called modulation filter and so the the things that new more showing those receptive fields of cortical neurons are like this a although he would he was showing examples from the cortex where the tuning is a little bit more complicated new C patterns in in both frequency and time the modulation filled is that we typically look look at are those that mimic the things you find certain cortical in the inferior click a listen and foul mess where things are primarily tuned for temporal modulation so these little things here represent um the the past and and temporal modulation um frequency so who the question that that we've been looking into is how much of texture perception can be captured with relatively simple summary statistics of representations like these that we believe to be present a biological auditory systems and what these summers to is with then be useful for machine recognition task so the methodological proposal that underlies most the work is that synthesis is a very powerful way to test a perceptual theory and the notion is that if you're brain represents sounds with some set of measurements like statistics then signals that have the same values of those measurements are to some the same to you any particular sounds that we synthesise to have the same measurements as some real-world recording or to sound like another example of the same kind of thing if the measurements that we used it is synthesis or are like the ones of the brain uses to represent sound in we we've been taking as approach with with sound texture perception synthesizing textures from statistics measure in real-world sound so the basic idea is to take some examples signal like a recording of rain measure some statistics and then send the size new signals constraining them only to have the same statistics and another respect making them as random as possible and the approach that that we've taken here is very much inspired by work it was done i'm quite a while back on visual texture and this the some of the authors of which uh are mentioned here so i'm just gonna give you of a very simple toy example to just illustrate the logic let's suppose that we want to has the all the power spectrum you might think that power spectrum place on texture so we do is we measure the spectrum of some real-world world texture like this now we just one this as a random signal the same spectrum obvious is really easy we just filter noise and then we listen to them and see with they sound like and unfortunately for um had of the power spectrum and they re of texture i'm things generally sound like noise when you do this a certain like grant sounds like noise i i and is is as opposed to this i right so this this is not realistic and this tells us that we're not simply write registering the spectrum only recognise textures alright so the question is well additional simple statistic do any better um and so we been mostly looking at statistics of these two stages of representations on the on lot of subband and the modulation bands you can drive from them with with simple linear filters and we've looked in to how far we can get with with very simple statistics things like marginal moments like the variance in this you the kurtosis and to the mean as was pairwise correlations between different pieces of the representation for since different on a difference of dance or different modulation band these statistics are generic they're not tailored any specific natural sound um but they are simple and their easy to measure on the other hand because of this is not obvious that they would account for much of sound recognition but maybe there are reasonable place to start now for pretty statistics to have any hope would being useful for recognition what a minimum a have to yield different values for different types of sounds and so when i'm gonna quickly do just give you a couple of examples to give you some intuition for what kinds of things uh these might capture so it's quickly look at some of the marginal moments of uh coke clear on blocks the are able to cope there's sub ban these moments again things like the mean and the variance in this skew i statistics to describe how the up is distributed so you take a stripe of a of a cochlear their spectrogram you take the on below um collapse that across time to give you a histogram they give you the frequency of occurrence of different amplitudes um in this is a a very simple sort of representation of sound but as many you will know these kinds of ample to distributions generally differ for natural sounds and for noise and they they vary between different kinds of natural sounds so you just a quick example these are ample dude histograms for noise is uh recording of a stream and according of geese from one particular channel and the the thing to note here is that although these distributions have about the same mean indicating a there's roughly the same acoustic power this channel the is reasons that different shape and you can see also see this visually if you just look at the spectrograms jeans that the pink noise is mostly grey where is the stream and the geese of got more black and white and so in this case the white we've we down here and the black would be up here so they they deviate more from the mean with more high able to the more low able to so many you probably recognise that this is an indication of of the common observation that natural signals are sparse the noise so the intuition is that natural sound can in events like raindrops in these calls and is a are infrequent but when they occur they produce large sample to someone they don't occur the amplitude uh tends to be low and the sparsity behaviour which alters the shape of these histograms is reflected in pretty simple statistics like the variance which measures the spread of the distribution and this you it measures the asymmetry about the mean alright so one more example let's take a quick look at what kind of correlations we can observe between on of different channels and these things also very across sounds and one of the main reasons for this is the presence of broadband events so if you listen to the sound of fire fire that what the crackles and pops and clicks and those crackles and pops are visible on spectrogram as these vertical street so these broad band of and these dependencies between channels because they excite them all at once and you can see it is if you look at correlations between channels so is just a a big matrix of correlation coefficients between pairs of cochlear filters going from low frequency to high and low to high so the tag one here's got gotta be one but they are bad channels can be whatever and you can see that for fire there's a lotta yellow and a lot red indicating at there this correlations between channels and not all sounds are like this use a stream and you can see that there's mostly green here oh looks yellow on the screen but just me green um mess we because for a lot of water sounds on the of channels are mostly correlated okay so these statistics although though there's simple they capture variation across sound um and of the question we we're trying to get at is whether they actually capture the sound of real-world text so gonna strategy is the synthesized signal constraint only to have the same statistics as some real-world world sound um but in other respects being as random as possible way we do that is by starting with the noise signal and then adjusting the noise signal to get it to have the desired statistics training in to some new signal the basic idea um is to to uh filter to the noise with the same set of filters giving you a subband representation and then to adjust the subband envelopes via a gradient descent the "'cause" and to have the desired statistical properties and so the statistics are just function as they a but we can compute their gradient and then change the a open a great direction till we get the desired statistics so that gives as new subbands and we add them back up to get a new sound signal we can listen to there is just a a a a flowchart i won't give you all the details here but and the the basic strategy is the first measure the statistics of a real world sound texture i'm after processing it in the auditory model um and then processing noise in the same way and altering its on blocks to give the same statistics an an iterative process that you have to do to get this to converge but the end result as the sound signal that shares the statistics of a real world sound so the question is questions had they sound um so we we're asking this question again because of the statistics account for texture perception will then the synthetic signal should sound like new examples of the real thing um and interestingly in in many cases they do some in play you a sequence of synthetic sounds are just generated from noise um by forcing the noise to have some of the same statistics as various real world sounds so you get things a sound like rain streams bubbles fire applause when i in birds a crowd noise so it also works for a lot of a natural sounds things like rustling paper or a jackhammer i so the success of the is the suggest these statistics could underlie the representation and recognition of text so we did it a quick experiment to to see whether this was true and human listeners uh people were presented with a five second sound clip and had identified from five choices so chance performance here is twenty percent and we presented them with uh synthetic signals that we're synthesized with different numbers of statistical constraints as well as the original you can see here that when we just match the power spectrum i'm people are are above chance but not very good but the performance improves as we add in more statistics and to with the force set um of the the model that i should you previously um you all as gets with the originals so this all states that these simple statistics can in fact support recognition of of real world text another point that that's just worth quickly mentioning is that the scent is here is not simply reproducing the original waveform um so because the procedure is initialised with noise it turns out a different sound single every time that share only the statistical properties and these are just three examples of waves that we're synthesized from a single set of statistics measured and a single recording and you get a a very different thing each time and you can make as many these as you want so this it is a really capturing i'm some more abstract property of the sound signal alright so uh one other in question is whether these texture statistics that seem to be implicated in human texture perception would will also be useful for machine recognition and at present we don't really have an ideal task um with with which to test this because we we need lots and lots of label textures and of any of you have those i be interested to get them uh but then S i'm had an idea that a in interesting potential application for this um might be video soundtrack classification so as a everybody knows there's lots of interest these days in um being able to search for video clips um depending on the on their content and the got is hands on i dataset courtesy of a colleague at can be a you gain G where you getting and had a a a bunch of people view video clips in an interface like this so they would watch something like this and then uh but they would hear something like this blue funny here powerpoint got a low a i alright so that was the soundtrack and then they would look at this thing and a that the check all the boxes that applied and so some of these things are attribute you pry can seem in the back but some of them are attribute of the video others there's describe the audience of this case the person is check cheering and clapping they said this was a outdoor in a all environments as a whole bunch of labels a get attached to each of these videos and so the idea is that the texture statistics can be used as uh features for um svm classifications you can train up svms um to to recognise these particular labels and distinguish them from others an order for this to work course the statistics have to give you different values for different labels and what this um plot shows is just the average values of the different statistics for the different labels and in the set i'm so the labels are going on the vertical axis and the different statistics on the horizontal axis and the point here is just that as you scan down the columns the the colours change right so the different labels are on average associated with different statistics um and so we find that uh you you can do some degree of classification with with these statistics um this is the average performance across all the categories which is overall modest but one one of the things to point out here is that you know the see a pattern it's qualitative like what you see in in the human uh observers that is performance is not that great when you just match the mean value of the on loves that is the the spectrum we gets better as you add in more statistic um some kind click labels get categories better than others the speech or music are pretty easy as a lot you probably know um i'm showing here part this it to show that the pattern that you get is a little different for different categories so for music for instance on the modulation power in in the modulation band seems to matter a lot you get a in classification there was for speech um the cross and correlations on that measure things like a modulation is more um performance is is poor for some of the labels ending in part because they're acoustically heterogeneous that is it just not really well suited to the representation we have so one of the labels is er and you can imagine that that consists of like a lot of different kinds of sound textures and so this that is you are not great there so it it's not really ideal task it sort of just more of a proof of concept um i think to really use these statistics for classifying semantic categories like this like urban you probably have to first recognise particular textures like traffic or crowd noise and then link those labels to the category so take a message is here um the first thing is just that text use are you back what is and i i think important and worth studying and i i think they may involve a unique form of representation relative to other um kinds of of auditory phenomena i'm namely summary statistics and so we find that naturalistic textures can be generated from relatively simple summary statistics of early auditory representations marginal moments and and pairwise correlations of cochlear modulation filters and the suggestion is that listeners are using similar statistics to recognise sound texture so when when you remember the the sound of a fire or the sound of rain we think you're just remembering some of these summary statistics um and the suggestion is that some statistics um should be useful for machine recognition of textures that something that will continue to explore thanks thus question i one question do you expect the same kind of sit is is useful and in recognizing speech or or other to a six signal that different something between you know something completely that yes i think one one interesting notion um is that of so textures the things stationary right and so it makes sense to compute be summers that this six where you're averaging things over the length of the signal um for so signals where you're interested in the nonstationary structure um what you might wanna do is i'm Q those the statistics but averaged over a local time windows um so that the statistics would give you sort of a trajectory uh over time of uh the way that the local structure changes in so there's a lot of sounds that kind of locally or texture like um but they have some kind of temporal evolution yeah so like yeah when you when you get in the bad like the sound you make is you kind of go over the sheets and that everything's rustling you know those are those are different textures but they're sort of sequence them like a particular way and um i don't know how use that would be for something like speech but certainly for i think some kinds of nonstationary sounds that sort of approach of looking at the temporal evolution of the detection may maybe be used oh rating yeah it's very question um so i i'm a very interested in this i think that um what you might need to do is actually before so all all these statistics like i guess their time averages right so like the correlation is an average of a product right or whatever that the variance the average of deviation and before you do the time average what you might need to do is do some kind of clustering um because you know for instance in in the case of um like let's see have you know we and which um is it really very correlated it across channels and like clapping that is right so you're gonna have sort of a a a a a a mixture of these two very different kinds of events um some of which will be close to hunt sent correlated another with as a which one not so if you just combine as you're gonna get a correlation of point five which is not really good representation yeah yeah so i i i think it's a it's interesting problem and i mean it is related to um some of the ways that people are thinking about sound segregation um in terms of clustering um and so you you may really in you may have to do some kind of segregation lean or a model um there are cases where works but case word i i uh i just have a quick question for all their experiments especially this subject and classification uh what was the sampling rate of the so or clips to use what was the sampling rate of the sound signals uh pride when T K yes yeah that's correct yeah i mean the the on looks so you know this this is are can on on of a course those are you know there's a low things right so they have an effective sampling rate of something much lower like you know a few hundred or that's right the actual sound files have have you know pretty high normal sent it you check to see down fact and the accuracy at a statistical nature its um especially by don't links then that K to attack channel i it's higher arg statistics really stack failing we you the link it's right or yeah yeah i a great question how how line line where your files that channel and so i'd i stand really work a pretty long things like five seconds but um i have looked at things shorter and the most of this this six are robust down to a pretty short lengths of when you start measuring things like kurtosis and stuff then it gets a bit less robust but that's actually not that and that not being that important for the synthesis not free yeah yeah well well the could of the a of um so it the variance kind does most the work there and the correlations stuff half i did you yeah why that order statistics uh nothing more than kurtosis yeah i think yeah yeah it looks like that way you want a of you all take non parametric a man where you collect all somewhere i a to anyone at all yeah now have a look at that you and power macs actually may all oh oh in each it or you if right it's a little you know which oh what also you kernel i can say what you you have and from which you actually oh those i one what i yeah i think that that's an interesting direction um i mean i i we we've been you moving in in and directions like that haven't tried yet that and the uh oh tutorial on music signal processing the as a discussion of texture like if X and the perception of combination tones and chord if you look good any correlations uh between your work and my kind of um i i don't know exactly which are referring to but i i i mean i know that i mean the word textures used a lot on the context of music um where they tip they're talking about kind a higher level types of a um i mean i have tried to the size musical textures i done work that well um and there's a lot of interesting reasons for why i think that is um i we we need some more complicated statistics essentially for that um but music is one of the things that really doesn't work very well on in general things that things are composed of sounds with pitch and the not as that well so it works great on like environmental sounds and um you machine noise and things like that thanks to us i

CLASSIFYING SOUNDTRACKS WITH AUDIO TEXTURE FEATURES

Innovative Representations of Audio

Presented by: Josh McDermott, Author(s): Daniel P.W. Ellis, Xiaohong Zeng, Columbia University, United States; Josh McDermott, New York University, United States