| 0:00:13 | so now we move away from um the perceptual um side of things to it to the statistical properties the | 
|---|
| 0:00:18 | sound | 
|---|
| 0:00:19 | you know the sound of wind um person the sound of building block | 
|---|
| 0:00:22 | um is a not formants you know it's that it's to ms to new deterministic signal | 
|---|
| 0:00:26 | which a lot of as to with and terms speech | 
|---|
| 0:00:28 | so how do you recognise the statistical behavior of a sound | 
|---|
| 0:00:31 | um dan ellis a columbia and just term and what you | 
|---|
| 0:00:34 | will describe their um sound texture representation | 
|---|
| 0:00:37 | in talk about use the sound classification | 
|---|
| 0:00:40 | well | 
|---|
| 0:00:40 | alright right malcolm and that thanks everybody for showing up to our uh of session | 
|---|
| 0:00:45 | um select like comes that i'm gonna talk about texture | 
|---|
| 0:00:47 | today | 
|---|
| 0:00:48 | um and and textures are sounds a result from large numbers of acoustic events main a include things you hear | 
|---|
| 0:00:53 | all the time at sound of rain | 
|---|
| 0:00:59 | win | 
|---|
| 0:01:02 | birds | 
|---|
| 0:01:05 | running water | 
|---|
| 0:01:09 | in | 
|---|
| 0:01:13 | crowd noise | 
|---|
| 0:01:16 | applause | 
|---|
| 0:01:18 | a | 
|---|
| 0:01:19 | and fire | 
|---|
| 0:01:23 | so these kinds of sounds are common in the world and it seems like to are important for a lot | 
|---|
| 0:01:26 | of task that humans have perform in that we might wanna get machines to perform | 
|---|
| 0:01:30 | like figure out where you are or what the weather's like for instance | 
|---|
| 0:01:34 | um but in contrast to the vast literature on visual texture um both in human a machine vision um sound | 
|---|
| 0:01:39 | textures are largely on study | 
|---|
| 0:01:41 | so | 
|---|
| 0:01:42 | uh the question that we've been looking into is how can textures be represented | 
|---|
| 0:01:46 | and recognise um so that there is some previous work on modeling sound texture on this is probably not a | 
|---|
| 0:01:51 | completely exhaustive list of the publications but certainly have a big chunk of them so it's it's a pretty small | 
|---|
| 0:01:56 | literature | 
|---|
| 0:01:57 | as also a lot of work on environmental sounds at off an inclusive of texture | 
|---|
| 0:02:00 | um um of the work done be talking about is | 
|---|
| 0:02:02 | different a little bit from these approaches | 
|---|
| 0:02:04 | uh and that our perspective is that machine recognition might be able to get some some clues from human texture | 
|---|
| 0:02:09 | perception and so on this end | 
|---|
| 0:02:10 | this is very much in this period of the work the new talked about um and in what did was | 
|---|
| 0:02:14 | just talking about um for me | 
|---|
| 0:02:17 | so | 
|---|
| 0:02:18 | are we we've been looking in to how humans represent and recognise textures and | 
|---|
| 0:02:22 | i we'll the starting point for the work is the observation that unlike | 
|---|
| 0:02:26 | uh the sounds are made by individual of bands like a spoken word | 
|---|
| 0:02:29 | textures are stationary so there essential properties don't change over time and that that sort of one of the defining | 
|---|
| 0:02:34 | properties so where as | 
|---|
| 0:02:35 | on the waveform of a word um here clearly has a beginning in an and then temporal evolution | 
|---|
| 0:02:41 | on the sound of rain is just kind of there | 
|---|
| 0:02:42 | right so that the call it make it rain | 
|---|
| 0:02:45 | don't change over time | 
|---|
| 0:02:46 | and so the key proposals that because they're stationary textures can be captured by statistics | 
|---|
| 0:02:51 | that is just time averages of acoustic measurement | 
|---|
| 0:02:54 | at the thing doesn't change we can just makes the measurement average them over time and not a to do | 
|---|
| 0:02:57 | a good job of capturing its qualities | 
|---|
| 0:03:00 | so what what we propose is that | 
|---|
| 0:03:01 | when you recognise the sound of fire or the sound of rain | 
|---|
| 0:03:05 | you're recognising these summaries to test | 
|---|
| 0:03:08 | and what whatever statistics you your auditory system are measuring are presumably derived from for full auditory representations that we | 
|---|
| 0:03:15 | know something about a you've you've heard a bit about this and the first you talks | 
|---|
| 0:03:19 | so we know that that sound is filter by the coke clean you can think of | 
|---|
| 0:03:22 | yeah output of the coke the sort of a sub type representation | 
|---|
| 0:03:26 | know that a lot of information in sub-bands is conveyed | 
|---|
| 0:03:28 | um by to their amplitude envelopes | 
|---|
| 0:03:30 | um after they've been compressed but by the coke the | 
|---|
| 0:03:33 | um i is now quite a bit of evidence that | 
|---|
| 0:03:35 | um the a the on blocks are the spectrogram like representation that that | 
|---|
| 0:03:39 | um they comprise | 
|---|
| 0:03:41 | are then | 
|---|
| 0:03:41 | subsequently filtered by | 
|---|
| 0:03:43 | a another stage of filters that are often called modulation filter and so the | 
|---|
| 0:03:46 | the things that new more showing those receptive fields of cortical neurons | 
|---|
| 0:03:50 | are like this a although he would he was showing examples from the cortex where the tuning is a little | 
|---|
| 0:03:54 | bit more complicated new C | 
|---|
| 0:03:56 | patterns in in both frequency and time | 
|---|
| 0:03:58 | the modulation filled is that we typically | 
|---|
| 0:04:00 | look look at are those that mimic the things you find certain cortical in the inferior click a listen and | 
|---|
| 0:04:05 | foul mess where things are | 
|---|
| 0:04:06 | primarily tuned for temporal modulation so these little things here | 
|---|
| 0:04:09 | represent | 
|---|
| 0:04:10 | um | 
|---|
| 0:04:11 | the the past and and temporal modulation | 
|---|
| 0:04:13 | um frequency | 
|---|
| 0:04:15 | so who the question that that we've been looking into is how much of texture perception can be captured with | 
|---|
| 0:04:21 | relatively simple summary statistics of representations like these that we believe to be present a biological auditory systems | 
|---|
| 0:04:27 | and what these | 
|---|
| 0:04:27 | summers to is with then be useful for machine recognition task | 
|---|
| 0:04:32 | so the methodological proposal that underlies most the work | 
|---|
| 0:04:35 | is that synthesis is a very powerful way to test a perceptual theory | 
|---|
| 0:04:39 | and the notion is that if you're brain represents sounds with some set of measurements like statistics | 
|---|
| 0:04:45 | then signals that have the same values of those measurements are to some the same to you | 
|---|
| 0:04:49 | any particular | 
|---|
| 0:04:50 | sounds that we synthesise | 
|---|
| 0:04:52 | to have the same measurements as some real-world recording | 
|---|
| 0:04:55 | or to sound like another example of the same kind of thing if the measurements that we used it is | 
|---|
| 0:05:00 | synthesis | 
|---|
| 0:05:01 | or are like the ones of the brain uses to represent sound | 
|---|
| 0:05:04 | in we we've been taking as approach with with sound texture perception synthesizing textures from statistics measure | 
|---|
| 0:05:09 | in real-world sound | 
|---|
| 0:05:11 | so the basic idea is to take some examples signal like | 
|---|
| 0:05:14 | a recording of rain | 
|---|
| 0:05:15 | measure some statistics and then send the size new signals | 
|---|
| 0:05:18 | constraining them only to have the same statistics and another respect making them as random as possible | 
|---|
| 0:05:23 | and the approach that that we've taken here is very much inspired by work it was done | 
|---|
| 0:05:27 | i'm quite a while back on visual texture and this the some of the authors of which | 
|---|
| 0:05:31 | uh are mentioned here | 
|---|
| 0:05:33 | so | 
|---|
| 0:05:34 | i'm just gonna give you of a very simple toy example to just illustrate the logic let's suppose that we | 
|---|
| 0:05:38 | want to has the all the power spectrum | 
|---|
| 0:05:40 | you might think that power spectrum place on texture so we do is we measure the spectrum of some real-world | 
|---|
| 0:05:44 | world texture like this | 
|---|
| 0:05:49 | now we just one this as a random signal the same spectrum obvious is really easy we just filter noise | 
|---|
| 0:05:53 | and then we listen to them and see with they sound like and | 
|---|
| 0:05:56 | unfortunately for um had of the power spectrum and they re of texture | 
|---|
| 0:06:00 | i'm things generally sound like noise when you do this | 
|---|
| 0:06:04 | a certain like grant sounds like noise | 
|---|
| 0:06:08 | i | 
|---|
| 0:06:09 | i | 
|---|
| 0:06:11 | and is is as opposed to this | 
|---|
| 0:06:18 | i right so this this is not realistic | 
|---|
| 0:06:20 | and this tells us that we're not simply write registering the spectrum only recognise textures | 
|---|
| 0:06:24 | alright so the question is | 
|---|
| 0:06:26 | well additional simple statistic do any better | 
|---|
| 0:06:28 | um and so we been mostly looking at statistics of these two stages of representations on the on lot of | 
|---|
| 0:06:34 | subband | 
|---|
| 0:06:35 | and the modulation bands you can drive from them with with simple linear filters | 
|---|
| 0:06:40 | and we've looked in to how far we can get with with very simple statistics things like marginal moments like | 
|---|
| 0:06:44 | the variance in this you the kurtosis and to the mean | 
|---|
| 0:06:47 | as was pairwise correlations between different pieces of the representation for since different | 
|---|
| 0:06:52 | on a difference of dance | 
|---|
| 0:06:54 | or different modulation band | 
|---|
| 0:06:56 | these statistics are generic | 
|---|
| 0:06:58 | they're not tailored any specific natural sound | 
|---|
| 0:07:00 | um but they are simple and their easy to measure | 
|---|
| 0:07:02 | on the other hand because of this is not obvious that they would account for much of sound recognition but | 
|---|
| 0:07:06 | maybe there are reasonable place to start | 
|---|
| 0:07:09 | now for pretty statistics to have any hope would being useful for recognition what a minimum a have to yield | 
|---|
| 0:07:14 | different values for different types of sounds and so | 
|---|
| 0:07:16 | when i'm gonna quickly do just give you a couple of examples to give you some intuition for what kinds | 
|---|
| 0:07:20 | of things uh these might capture | 
|---|
| 0:07:22 | so it's quickly look | 
|---|
| 0:07:23 | at some of the marginal moments of uh coke clear on blocks the are able to cope there's sub ban | 
|---|
| 0:07:29 | these moments again things like the mean and the variance in this skew | 
|---|
| 0:07:32 | i statistics to describe how the up is distributed so you take | 
|---|
| 0:07:36 | a stripe of a of a | 
|---|
| 0:07:37 | cochlear their spectrogram | 
|---|
| 0:07:39 | you take the on below | 
|---|
| 0:07:40 | um collapse that across time to give you a histogram they give you the frequency of occurrence of different amplitudes | 
|---|
| 0:07:45 | um in this is a a very simple sort of representation of sound but as many you will know these | 
|---|
| 0:07:50 | kinds of ample to distributions generally differ | 
|---|
| 0:07:53 | for natural sounds and for noise and they they vary between different kinds of natural sounds so you just a | 
|---|
| 0:07:57 | quick example | 
|---|
| 0:07:58 | these are ample dude histograms for noise is uh recording of a stream | 
|---|
| 0:08:02 | and according of geese | 
|---|
| 0:08:04 | from one particular channel | 
|---|
| 0:08:06 | and the the thing to note here is that although these distributions have about the same mean | 
|---|
| 0:08:10 | indicating a there's roughly the same acoustic power this channel | 
|---|
| 0:08:13 | the is reasons that different shape | 
|---|
| 0:08:16 | and you can see also see this visually | 
|---|
| 0:08:17 | if you just look at the spectrograms jeans that the pink noise is mostly grey | 
|---|
| 0:08:22 | where is the stream and the geese of got more black and white and so in this case the white | 
|---|
| 0:08:25 | we've we down here | 
|---|
| 0:08:26 | and the black would be up here so they they deviate more | 
|---|
| 0:08:29 | from the mean with more high able to the more low able to | 
|---|
| 0:08:33 | so many you probably recognise that | 
|---|
| 0:08:35 | this is an indication of of the common observation that natural signals are sparse the noise | 
|---|
| 0:08:39 | so the intuition is that natural sound can in events like raindrops in these calls and | 
|---|
| 0:08:43 | is a are infrequent but when they occur they produce large sample to someone they don't occur the amplitude | 
|---|
| 0:08:48 | uh tends to be low | 
|---|
| 0:08:49 | and the sparsity behaviour | 
|---|
| 0:08:51 | which alters the shape of these histograms | 
|---|
| 0:08:53 | is reflected in pretty simple statistics like the variance which measures the spread of the distribution | 
|---|
| 0:08:58 | and this you it measures the asymmetry about the mean | 
|---|
| 0:09:02 | alright so one more example let's take a quick look at | 
|---|
| 0:09:04 | what kind of correlations we can observe | 
|---|
| 0:09:06 | between on of different channels | 
|---|
| 0:09:09 | and these things also very across sounds and one of the main reasons for this is the presence of broadband | 
|---|
| 0:09:13 | events | 
|---|
| 0:09:13 | so if you listen to the sound of fire | 
|---|
| 0:09:19 | fire that what the crackles and pops and clicks | 
|---|
| 0:09:21 | and those crackles and pops are visible on spectrogram as these vertical street | 
|---|
| 0:09:27 | so these broad band of and these dependencies between channels because they excite them all at once and you can | 
|---|
| 0:09:31 | see it is if you look at correlations between channels so is just a a big matrix of correlation coefficients | 
|---|
| 0:09:36 | between pairs of | 
|---|
| 0:09:37 | cochlear filters | 
|---|
| 0:09:39 | going from low frequency to high and low to high | 
|---|
| 0:09:41 | so the tag one here's got gotta be one | 
|---|
| 0:09:43 | but they are bad channels can be whatever and you can see that for fire there's a lotta yellow | 
|---|
| 0:09:47 | and a lot red indicating at there this correlations between channels and not all sounds are like this use a | 
|---|
| 0:09:52 | stream | 
|---|
| 0:09:53 | and you can see that there's mostly green here oh looks yellow on the screen but | 
|---|
| 0:09:56 | just me green | 
|---|
| 0:09:58 | um mess we because for a lot of water sounds on the of channels are mostly correlated | 
|---|
| 0:10:02 | okay so these statistics although though there's simple they capture variation across sound | 
|---|
| 0:10:07 | um and of the question we we're | 
|---|
| 0:10:08 | trying to get at is whether they actually capture the sound of real-world text | 
|---|
| 0:10:12 | so gonna strategy | 
|---|
| 0:10:13 | is the synthesized signal constraint only to have the same statistics as some real-world world sound | 
|---|
| 0:10:18 | um but in other respects being as random as possible way we do that | 
|---|
| 0:10:21 | is by starting with the noise signal | 
|---|
| 0:10:23 | and then adjusting the noise signal to get it to have the desired statistics | 
|---|
| 0:10:26 | training in to some new signal | 
|---|
| 0:10:28 | the basic idea | 
|---|
| 0:10:29 | um is to to uh filter to the noise with the same set of filters giving you a subband representation | 
|---|
| 0:10:34 | and then to adjust the subband envelopes via a gradient descent | 
|---|
| 0:10:38 | the "'cause" and to have the desired statistical properties and so | 
|---|
| 0:10:41 | the statistics are just function as they a but we can compute their gradient | 
|---|
| 0:10:44 | and then change the a open a great direction till we get the desired statistics so that gives as new | 
|---|
| 0:10:49 | subbands and we add them back up to get a new sound signal we can listen to | 
|---|
| 0:10:53 | there is just a a a a flowchart i won't give you all the details here but | 
|---|
| 0:10:56 | and the the basic strategy is the first measure the statistics of a real world sound texture i'm after processing | 
|---|
| 0:11:02 | it in the auditory model | 
|---|
| 0:11:03 | um and then processing noise in the same way and altering its on blocks to give the same statistics an | 
|---|
| 0:11:08 | an iterative process that you have to do to get this to converge | 
|---|
| 0:11:11 | but the end result as the sound signal that shares the statistics of a real world sound so the question | 
|---|
| 0:11:15 | is questions had they sound | 
|---|
| 0:11:16 | um so | 
|---|
| 0:11:17 | we we're asking this question again because of the statistics account for texture perception will then the synthetic signal should | 
|---|
| 0:11:22 | sound like new examples of the real thing | 
|---|
| 0:11:24 | um | 
|---|
| 0:11:26 | and interestingly in in many cases they do some in play you | 
|---|
| 0:11:28 | a sequence of synthetic sounds are just generated from noise | 
|---|
| 0:11:32 | um by forcing the noise to have some of the same statistics as various real world sounds so you get | 
|---|
| 0:11:36 | things a sound like rain | 
|---|
| 0:11:41 | streams | 
|---|
| 0:11:44 | bubbles | 
|---|
| 0:11:47 | fire | 
|---|
| 0:11:50 | applause | 
|---|
| 0:11:53 | when | 
|---|
| 0:11:54 | i | 
|---|
| 0:11:57 | in | 
|---|
| 0:12:01 | birds | 
|---|
| 0:12:05 | a crowd noise | 
|---|
| 0:12:09 | so it also works for a lot of a natural sounds things like rustling paper | 
|---|
| 0:12:16 | or a jackhammer | 
|---|
| 0:12:21 | i so the success of the is the suggest these statistics could underlie the representation and recognition of text | 
|---|
| 0:12:27 | so we did it a quick experiment to to see whether this was true and human listeners | 
|---|
| 0:12:31 | uh people were presented with a five second sound clip and had identified from five choices so chance performance here | 
|---|
| 0:12:37 | is twenty percent | 
|---|
| 0:12:38 | and we presented them with | 
|---|
| 0:12:40 | uh synthetic signals that we're synthesized with different numbers of statistical constraints | 
|---|
| 0:12:44 | as well as the original | 
|---|
| 0:12:46 | you can see here that when we just match the power spectrum i'm people are are above chance but not | 
|---|
| 0:12:50 | very good | 
|---|
| 0:12:51 | but the performance improves as we add in more statistics and to with the force set | 
|---|
| 0:12:55 | um of | 
|---|
| 0:12:55 | the the model that i should you previously | 
|---|
| 0:12:57 | um you all as gets with the originals | 
|---|
| 0:13:01 | so this all states that these simple statistics can in fact support recognition of of real world text | 
|---|
| 0:13:07 | another point that that's just worth quickly mentioning is that the scent is here is not simply reproducing the original | 
|---|
| 0:13:11 | waveform um so because the procedure | 
|---|
| 0:13:14 | is initialised with noise | 
|---|
| 0:13:15 | it turns out a different sound single every time that share only the statistical properties and these are just three | 
|---|
| 0:13:20 | examples of | 
|---|
| 0:13:21 | waves that we're synthesized from a single set of statistics measured and a single recording and you get a a | 
|---|
| 0:13:26 | very different thing each time and you can make as many these as you want so this it is a | 
|---|
| 0:13:30 | really capturing | 
|---|
| 0:13:31 | i'm some more abstract property of the sound signal | 
|---|
| 0:13:34 | alright so uh one other in question is whether these texture statistics that seem to be implicated in human texture | 
|---|
| 0:13:39 | perception would will also be useful for machine recognition | 
|---|
| 0:13:43 | and at present we don't really have an ideal task | 
|---|
| 0:13:45 | um with with which to test this because we we need lots and lots of label textures and of any | 
|---|
| 0:13:49 | of you have those i be interested to get them | 
|---|
| 0:13:52 | uh but then S i'm had an idea that a in interesting potential application for this | 
|---|
| 0:13:56 | um might be video soundtrack classification so as a everybody knows there's lots of interest these days | 
|---|
| 0:14:02 | in | 
|---|
| 0:14:02 | um being able to search for video clips um depending on the on their content | 
|---|
| 0:14:06 | and the got is hands on | 
|---|
| 0:14:08 | i dataset courtesy of a colleague at can be a you gain G | 
|---|
| 0:14:12 | where you getting | 
|---|
| 0:14:13 | and had a a a bunch of people view video clips in an interface like this | 
|---|
| 0:14:16 | so they would | 
|---|
| 0:14:17 | watch something like this and then | 
|---|
| 0:14:21 | uh | 
|---|
| 0:14:22 | but they would hear something like this | 
|---|
| 0:14:25 | blue | 
|---|
| 0:14:30 | funny here | 
|---|
| 0:14:32 | powerpoint got a low | 
|---|
| 0:14:34 | a | 
|---|
| 0:14:37 | i | 
|---|
| 0:14:37 | alright so that was the soundtrack and then they would look at this thing and a that the check all | 
|---|
| 0:14:41 | the boxes that applied and so some of these things are attribute you pry can seem in the back but | 
|---|
| 0:14:45 | some of them are attribute of the video | 
|---|
| 0:14:46 | others there's describe the audience of this case the person is check cheering and clapping | 
|---|
| 0:14:51 | they said this was a outdoor in a all environments as a whole bunch of labels a get attached to | 
|---|
| 0:14:55 | each of these videos | 
|---|
| 0:14:57 | and so the idea is that the texture statistics can be used as uh features for um svm classifications you | 
|---|
| 0:15:03 | can train up svms | 
|---|
| 0:15:04 | um to to recognise these particular labels and distinguish them from others | 
|---|
| 0:15:09 | an order for this to work | 
|---|
| 0:15:10 | course the statistics have to give you different values for different labels and what this um plot shows is just | 
|---|
| 0:15:15 | the average values of the different statistics for the different labels and in the set | 
|---|
| 0:15:20 | i'm so the labels are going on the vertical axis and the different statistics on the horizontal axis and the | 
|---|
| 0:15:25 | point here is just that as you scan down the columns the the colours change | 
|---|
| 0:15:28 | right so | 
|---|
| 0:15:29 | the different labels are on average associated with different statistics | 
|---|
| 0:15:33 | um | 
|---|
| 0:15:33 | and so | 
|---|
| 0:15:34 | we find that uh you you can do some degree of classification with with these statistics | 
|---|
| 0:15:39 | um this is the average performance across all the categories | 
|---|
| 0:15:42 | which is overall modest but one one of the things to point out here is that | 
|---|
| 0:15:46 | you know the see a pattern it's qualitative like what you see in in the human uh observers that is | 
|---|
| 0:15:50 | performance is not that great when you just match the mean value of the on loves that is the the | 
|---|
| 0:15:55 | spectrum | 
|---|
| 0:15:56 | we gets better as you add in more statistic | 
|---|
| 0:15:59 | um some kind click labels get categories better than others the speech or music are pretty easy as a lot | 
|---|
| 0:16:03 | you probably know | 
|---|
| 0:16:04 | um i'm showing here part | 
|---|
| 0:16:06 | this it to show that the pattern that you get is a little different for different categories so for music | 
|---|
| 0:16:11 | for instance | 
|---|
| 0:16:12 | on the modulation power in in the modulation band seems to matter a lot you get a in classification there | 
|---|
| 0:16:17 | was for speech | 
|---|
| 0:16:18 | um the cross and correlations | 
|---|
| 0:16:20 | on that measure things like a modulation is more | 
|---|
| 0:16:24 | um performance is is poor for some of the labels ending in part because they're acoustically heterogeneous that is it | 
|---|
| 0:16:28 | just not really well suited to | 
|---|
| 0:16:30 | the representation we have so one of the labels is er and you can imagine that that consists of like | 
|---|
| 0:16:35 | a lot of different kinds of sound textures | 
|---|
| 0:16:37 | and so this that is you are not great there | 
|---|
| 0:16:39 | so it it's not really ideal task it sort of just more of a proof of concept | 
|---|
| 0:16:43 | um i think to really use these statistics for classifying semantic categories like this like urban | 
|---|
| 0:16:48 | you probably have to first recognise particular textures like | 
|---|
| 0:16:51 | traffic or crowd noise | 
|---|
| 0:16:52 | and then link those labels to the category | 
|---|
| 0:16:55 | so take a message is here um the first thing is just that text use are you back what is | 
|---|
| 0:17:00 | and i i think | 
|---|
| 0:17:01 | important and worth studying and i i think they may involve | 
|---|
| 0:17:04 | a unique form of representation relative to other um kinds of of auditory phenomena i'm namely summary statistics | 
|---|
| 0:17:11 | and so we find that | 
|---|
| 0:17:12 | naturalistic textures can be generated from | 
|---|
| 0:17:15 | relatively simple summary statistics of early auditory representations marginal moments and and pairwise correlations of cochlear modulation filters | 
|---|
| 0:17:23 | and the suggestion is that listeners are using similar statistics | 
|---|
| 0:17:26 | to recognise sound texture so when when you remember the the sound of a fire or the sound of rain | 
|---|
| 0:17:30 | we think you're just remembering | 
|---|
| 0:17:32 | some of these summary statistics | 
|---|
| 0:17:33 | um | 
|---|
| 0:17:34 | and the suggestion is that some statistics um should be useful for machine recognition of textures that something that will | 
|---|
| 0:17:39 | continue to explore | 
|---|
| 0:17:40 | thanks | 
|---|
| 0:17:46 | thus | 
|---|
| 0:17:47 | question | 
|---|
| 0:17:52 | i one question | 
|---|
| 0:17:53 | do you expect the same kind of sit is is useful and in recognizing speech or or other to a | 
|---|
| 0:17:58 | six signal | 
|---|
| 0:17:59 | that | 
|---|
| 0:17:59 | different | 
|---|
| 0:18:00 | something between you know something completely that | 
|---|
| 0:18:02 | yes i think one one interesting notion | 
|---|
| 0:18:05 | um is that of so textures the things stationary right and so it makes sense to compute be summers that | 
|---|
| 0:18:09 | this six where you're averaging things over the | 
|---|
| 0:18:12 | length of the signal | 
|---|
| 0:18:13 | um for so signals where you're interested in the nonstationary structure | 
|---|
| 0:18:17 | um | 
|---|
| 0:18:18 | what you might wanna do is | 
|---|
| 0:18:20 | i'm Q those | 
|---|
| 0:18:20 | the statistics but averaged over a local time windows um so that the statistics would give you sort of a | 
|---|
| 0:18:25 | trajectory | 
|---|
| 0:18:26 | uh over time | 
|---|
| 0:18:27 | of uh the way that the local structure changes in so there's a lot of sounds | 
|---|
| 0:18:32 | that kind of locally or texture like um but they have some kind of temporal evolution | 
|---|
| 0:18:36 | yeah so like | 
|---|
| 0:18:37 | yeah when you when you get in the bad | 
|---|
| 0:18:39 | like the sound | 
|---|
| 0:18:40 | you make is you kind of go over the sheets and that everything's rustling | 
|---|
| 0:18:44 | you know those are those are different textures but they're sort of sequence them like a particular way | 
|---|
| 0:18:48 | and | 
|---|
| 0:18:48 | um | 
|---|
| 0:18:49 | i don't know how use that would be for something like speech but certainly for i think some kinds of | 
|---|
| 0:18:53 | nonstationary sounds | 
|---|
| 0:18:55 | that sort of approach | 
|---|
| 0:18:56 | of looking at the temporal evolution of the detection | 
|---|
| 0:18:59 | may maybe be used | 
|---|
| 0:19:08 | oh | 
|---|
| 0:19:08 | rating | 
|---|
| 0:19:09 | yeah it's very question um | 
|---|
| 0:19:11 | so | 
|---|
| 0:19:13 | i i'm a very interested in this i think that um | 
|---|
| 0:19:16 | what you might need to do is actually before so all all these statistics | 
|---|
| 0:19:20 | like i guess their time averages right so like the correlation is an average of a product right or whatever | 
|---|
| 0:19:24 | that the variance the average of deviation | 
|---|
| 0:19:27 | and before you do the time average what you might need to do is do some kind of clustering | 
|---|
| 0:19:31 | um | 
|---|
| 0:19:31 | because | 
|---|
| 0:19:32 | you know for instance in in the case of | 
|---|
| 0:19:34 | um like let's see have you know we and | 
|---|
| 0:19:36 | which | 
|---|
| 0:19:37 | um | 
|---|
| 0:19:38 | is it really very correlated it across channels and like clapping that is | 
|---|
| 0:19:42 | right | 
|---|
| 0:19:43 | so you're gonna have sort of a a a a a a mixture of these two very different kinds of | 
|---|
| 0:19:46 | events | 
|---|
| 0:19:47 | um some of which will be close to hunt sent correlated another with as a which one not | 
|---|
| 0:19:50 | so if you just combine as you're gonna get a correlation of point five which is not | 
|---|
| 0:19:54 | really good representation | 
|---|
| 0:19:59 | yeah | 
|---|
| 0:20:07 | yeah so i i i think it's a it's interesting problem and i mean it is related to | 
|---|
| 0:20:10 | um | 
|---|
| 0:20:11 | some of the ways that people are thinking about | 
|---|
| 0:20:14 | sound segregation um | 
|---|
| 0:20:15 | in terms of clustering um | 
|---|
| 0:20:17 | and so you you may really in you may have to do some kind of segregation lean or a model | 
|---|
| 0:20:22 | um there are cases where works but | 
|---|
| 0:20:24 | case word | 
|---|
| 0:20:28 | i i uh i just have a quick question for all their experiments especially this subject and classification uh what | 
|---|
| 0:20:34 | was the sampling rate of the so | 
|---|
| 0:20:35 | or clips to use | 
|---|
| 0:20:37 | what was the sampling rate of the sound signals | 
|---|
| 0:20:39 | uh pride | 
|---|
| 0:20:40 | when T K | 
|---|
| 0:20:47 | yes | 
|---|
| 0:20:48 | yeah | 
|---|
| 0:20:49 | that's correct yeah | 
|---|
| 0:20:50 | i mean the the on looks so you know this this is are can on on of a course those | 
|---|
| 0:20:54 | are | 
|---|
| 0:20:55 | you know there's a low things right so | 
|---|
| 0:20:58 | they have an effective sampling rate of something much lower like you know | 
|---|
| 0:21:01 | a few hundred or | 
|---|
| 0:21:04 | that's right the actual sound files have have you know pretty high | 
|---|
| 0:21:06 | normal sent | 
|---|
| 0:21:08 | it you check to see down | 
|---|
| 0:21:10 | fact and the accuracy at a statistical nature its um | 
|---|
| 0:21:14 | especially by don't links then that K to attack channel i | 
|---|
| 0:21:17 | it's higher arg statistics really stack failing we you the link it's right or yeah yeah i a great question | 
|---|
| 0:21:23 | how | 
|---|
| 0:21:24 | how line line where your files that channel and so i'd i stand really work a pretty long things like | 
|---|
| 0:21:29 | five seconds but um i have looked at things shorter and the | 
|---|
| 0:21:34 | most of this this six are robust down to a pretty short lengths of | 
|---|
| 0:21:38 | when you start measuring things like kurtosis and stuff then it gets a bit less robust but | 
|---|
| 0:21:42 | that's actually not that and that not being that important for the synthesis not free yeah yeah | 
|---|
| 0:21:47 | well well the could of the a of um so it the variance kind does most the work there | 
|---|
| 0:21:53 | and the correlations stuff half i did you yeah why that order statistics | 
|---|
| 0:21:57 | uh nothing more than kurtosis yeah | 
|---|
| 0:22:00 | i think | 
|---|
| 0:22:01 | yeah | 
|---|
| 0:22:02 | yeah | 
|---|
| 0:22:03 | it looks like that way you want | 
|---|
| 0:22:05 | a of you all take non parametric | 
|---|
| 0:22:08 | a man | 
|---|
| 0:22:09 | where you collect all somewhere i | 
|---|
| 0:22:11 | a to anyone at all | 
|---|
| 0:22:12 | yeah | 
|---|
| 0:22:13 | now have a look at that you and power macs | 
|---|
| 0:22:17 | actually may | 
|---|
| 0:22:19 | all | 
|---|
| 0:22:20 | oh oh in | 
|---|
| 0:22:22 | each it | 
|---|
| 0:22:22 | or | 
|---|
| 0:22:23 | you | 
|---|
| 0:22:23 | if | 
|---|
| 0:22:25 | right | 
|---|
| 0:22:25 | it's a little | 
|---|
| 0:22:26 | you know which | 
|---|
| 0:22:28 | oh | 
|---|
| 0:22:28 | what | 
|---|
| 0:22:31 | also | 
|---|
| 0:22:33 | you | 
|---|
| 0:22:33 | kernel | 
|---|
| 0:22:35 | i can say what | 
|---|
| 0:22:36 | you you have and from | 
|---|
| 0:22:38 | which | 
|---|
| 0:22:39 | you | 
|---|
| 0:22:40 | actually | 
|---|
| 0:22:41 | oh those i one what | 
|---|
| 0:22:43 | i | 
|---|
| 0:22:45 | yeah i think that that's an interesting direction um i mean | 
|---|
| 0:22:49 | i i we we've been you moving in in and directions like that haven't tried yet | 
|---|
| 0:22:53 | that | 
|---|
| 0:22:55 | and the uh oh tutorial on music signal processing the as a discussion of texture like if X and the | 
|---|
| 0:23:00 | perception of | 
|---|
| 0:23:02 | combination tones and chord | 
|---|
| 0:23:04 | if you look good any correlations uh between your work and my kind of | 
|---|
| 0:23:10 | um | 
|---|
| 0:23:11 | i i don't know exactly which are referring to but i i i mean i know that i mean the | 
|---|
| 0:23:14 | word textures used a lot on the context of music | 
|---|
| 0:23:17 | um | 
|---|
| 0:23:18 | where they tip they're talking about kind a higher level | 
|---|
| 0:23:20 | types of a | 
|---|
| 0:23:21 | um | 
|---|
| 0:23:22 | i mean i have tried to the size musical textures | 
|---|
| 0:23:25 | i done work that well | 
|---|
| 0:23:26 | um | 
|---|
| 0:23:27 | and there's | 
|---|
| 0:23:28 | a lot of interesting reasons for why i think that is um i we we need some more complicated statistics | 
|---|
| 0:23:33 | essentially for that | 
|---|
| 0:23:34 | um but music is one of the things that really doesn't work very well on in general things that | 
|---|
| 0:23:38 | things are composed of sounds with pitch and the not | 
|---|
| 0:23:40 | as that | 
|---|
| 0:23:40 | well | 
|---|
| 0:23:41 | so it works great on like environmental sounds and | 
|---|
| 0:23:44 | um you machine noise and things like that | 
|---|
| 0:23:47 | thanks to us | 
|---|
| 0:23:50 | i | 
|---|