Speech Transcript - TIME-FREQUENCY SEGMENTATION OF BIRD SONG IN NOISY ACOUSTIC ENVIRONMENTS

oh okay so so the this presentation is an a completely different project this is far less uh and political far less convergence proofs um this is basically we we have a by acoustics group there there is interested in um learning species distributions basically a connection between a signal processing machine learning and you college and we we form between myself and and we for an a group that looks at doing that problem um microphones placing microphones in the force and listening to birds uh michael there's here are lawrence knee of is i think it the time you wrote this paper was a a third year uh and grad student you know group and it was very enthusiastic about this problem so you want to help out then we got a of this problem in very quickly caught up with the signal processing the machine learning techniques use the that for ratings is a P D student in the group and basically self and mike collaborate okay so so the bird my acoustic project deals with um do with a variety of questions ah ecological questions and how to solve them using audio i don't set a microphone and force this to the audio and the term species distributions so on individual be so um the uh the other motivation for this particular paper is coming from just audio segmentation and the focus is rather than a classical one the audio segmentation we're is typically used in this a type of data uh we present a time free "'cause" is so way to D uh audio segmentation which is a rather than a threshold meant method is based on a classification so learning from examples um specifically oh well i'll present a segmentation system that is using the random of classifier um to segment and we will show some results so here's the general idea so i think this this particular picture was taken at the harvard broke a long term ecological research site i and the idea is the following what people interested in exploring species distribution for the purpose of ecology research this end an individual there will go and in each of these points this is actually a topographic map and the will standing individual we'll send individual the will stand each of those points for ten minute and we'll move on to the next point sounds that pretty exhausting something they send multiple individuals to do that and for a sampling of ten minutes there in the day uh the idea was to estimate the distribution of species during and they're of course everybody here we no sampling realise the something must be very wrong with just ten minutes during the day to so be obtain a a an accurate uh distribution of speech and of course part of the a here is to uh keep these maps across the years and learn how uh how species distribution change over time how of for example once C estimate the distribution um a how to integrate that we've ecological research in other words how to connect the distributions to environmental parameter so in in this project we actually have placed microphones in the A J and use um long term ecological research side we're using the song leader which is commonly using in on sound analysis and placing in and a variety of places i think in this this time this preliminary search we put in fifteen places um a what happens is this is not a fully automatic yeah somebody has to replace the batteries every two week "'cause" uh because of the time of the recording and also replace them every chips because it takes a lot of memory to record over oh couple weeks so so the system that we developed in order to do a species distribution involves or a collection of automated uh sorry placing automated recorders collection of uh data from these recorded the recording uh converted to spectrograms perform segmentation in the sake spectrograms uh and you know order to extract syllables and the reason why we do that is because um burt's tend to vocalise a simultaneously and so one B a segmentation would have been uh useful he and so the idea is that to you you get those syllables you extract features are each syllables and then you you build a you probabilistic model which is probably um um so the the goal i and and the focus of of this project not this particular paper how to build a probabilistic model that takes recordings a collection of recordings each recording is labeled with the speech is present or subset of the recordings uh yes is labeled with the speech is present and species absent and the idea that here is to learn from the collection of the recording and to be able to analyse and you're recording which doesn't have labels in other words we just been pointing in the recording which syllable be long to which bird and what birds are present in that recording so as i mentioned before um birds tend to have many independent vocalisations and open these vocalisation a overlap in time another other word or ten in order to communicate birds to tend to pick different frequency channel so that they're not overlapping and da da here's communication used for i just caring off and i means which are short simple calls or make a and so the idea they what's the point of communicating with a a your mate when they can hear so the D is birds tend to space themselves in frequency as well so so as i mentioned one D segmentation may look like this when there's no overlap oh birds singing but in practice a lot of the recordings that we get uh would look something like this some of the stuff use barely visible uh but and what i'd like to point out is that uh this is the frequency axis and this is the time axis and one can see that at individual time instance there's an overlap between syllable for one bird species syllables from other speech and so in this particular paper the focus is how to one good segmentation as i mentioned it apparently even though segmentation is is not as fine for for us to this signal processing is is developing the models and how to a how to estimate species distribution an your recording this and models it is probably one of the most important problems in for a and an important aspects of this project when you break a syllable either over segment of syllable or you know segment syllables jointly um a lot of the models seem to fail so the process is the following uh we get these spectrograms and we have somebody sit and use either paint or other program to going mark uh a what is noise and what is signal and the a D here is will give the computer enough examples of this and basically will uh will take the or to generate spectrograms um extract features feed it into a classifier a along with the labels with those particular mass the were created and hope that a given in your recording the algorithm will uh uh with the us to classify a particular picks on the spectrogram is belonging longing to background of foreground so the the the framework is very similar to those of you are in image post processing were computers so we we borrow some of the principles from there so the details are as follows we we we obtain the spectrum grants what you can see on the screen as well as on my on the lap of here is that uh after converting the recordings in the spectrogram there are um there's is a stationary background noise it it corresponding to stream or other environment noise that exist low frequency and so we're applying a whitening filter to get rid of the and then at this point we take the spectrogram in we extract feature for each a a pixel in the spectrogram we collect the neighborhood of values and extract features for that neighbourhood of values and once we of the features we applied the random forest class to like i will skip the details of of the classifier so da da here is to then be able to predict done in your recording the right uh the right label either foreground or background another aspect is um another advantage of the random forest classifier which is very suitable for segmentation is a for example to svm other classifiers is the fact that the classifier can give you a weighted threshold the performance in a a forest classifier you you know you obtain multiple trees each tree provides a classification so either zero or one like taking for example fifty of those and averaging it can get the probability all be long to one class of the other which allows you to then threshold and helps in segmentation and so he here an example how after segmenting a after applying to class for segmenting we obtain uh we obtain this result and of course we filter out small uh small segments a small connected components and the is up to you do that nick and then i should be extract individual syllables from the recordings which is really any input to all the work that we to late so the data that we have uh the we worked on for the experiment is steer P C and data recorded at sixteen khz we will increase that in the future "'cause" some birds articulate uh if frequencies higher than eight khz are we have dataset set of six are twenty five audio segments that we collected fifteen seconds each um basically two segments the the way the that it was course to segments out twenty four hours per side from thirteen sites so there's a mix of sites and mix of hours of the day and each spectrogram in order to do this evaluation properly we had to label all six hundred sixty five spectrograms and then to see whether the a segmentation algorithm can predict the human labelling and in this particular we use forty trees for each other in of forest and we use well we do is we cut out um and random neighbour for this we can use all the patches in the spectrogram so we cut a five and five sorry half a million exam and we're considering to to evaluations uh of the roc C one is in terms of time-frequency area the number of spectrogram units correctly classified you look at the number of pixels the work are like could correctly classify and i another one is you consider the energy weighting in other words um there could be these pixels that are hard to get and perhaps you want take that into account so how how obvious as a picks a how high the energy there um is incorporated in the second form so next i'm presenting the you results the results uh we the classifier and scan through the threshold in order to obtain the uh receiver operating characteristic the C and what we can see is that you know our intuition why not just use of the first question is why don't you just use an energy thresholding and that the performs that we get with energy thresholding so okay so energy thresholding work pixel wise and that could be inaccurate accuracy it to use a a thresholding and well or perhaps to take advantage of neighborhood and in this case oh we try that as well and then of course we compare that to the classifier and the close to the R C is this line or that i'm the better it is and we can see that the classifier does a far better then energy thresholding which are the common myth and of course we look yeah well as well as uh at the our C in terms of total acoustic energy uh rather than calculate pixels as zero and one we signed the weight a corresponding to the pixel on a spectrum and once again we see the same relationship that um basically the classified does better than thresholding and blurring and thresholding and blurring does some of better than um just simply threshold so i think part of the idea is that once we do this classification in terms of future work our goal is to use and and these syllables is a dictionary and once we have a dictionary there's a lot of um a lot of all the can be applied um once you have an dictionary so for example topic models are some some of our research interest and were interested in applying topic models to identifying a bird species in these recordings you are some examples of a how visible the are of class is that were form after are um after applying the segmentation you can see a fairly a repetitive patterns will in each class there or you can barely see i guess you to the different contrary so so the point here is that we're getting a segmentation that is to the level that were interested in we're not over segmenting or under segmenting and and then allows us to perform class there is and to basically convert these audio into two docking questions this is a very interesting problem uh do you happen to have a a a priori the number of a classes of words that you're looking at or this is an open set so yes and also so we we do have a they're going here and and at this but we think the humans a better than the computer um so we trust the what they tell is right so they do know about the species that are present in that area and the no part is that there some rare species that often get into that area that maybe present and vocalise in the recordings of will not be able to that yeah because you know my my idea is that if you know the class you probably can a a to the segmentations to classifier me monitor right right if you know what right out the we want to be i mean one of the problems to the interesting to us is new class detection and we would like to avoid relying on the class of that we know since for the presentations really uh really impressive work and uh uh so i would like to know who with to um some yeah this pretty good the look kernel with the the the random forest and number would uh could you describe the the the kind of phones but it to use the that you good because since you're already have like really good results like goes the focus would be like oh can you even a a or that i think that's a good question um i i don't know that i can answer directly or question about what i wanna point out is that um you know this is highly dependent how people segment the data in this part we we've tried this supports many times with different features um and and basically different people who signal of what we notice is that the results are highly dependent in terms of form on how because and we still see the same relationship between the methods but the kurds themselves would very quite a lot depending on what features you're speaking uh to estimate syllables and depending on how you're segmenting in fact you know we're using a uh um forget get in in in my of right a simple as a as it goes to to mark the syllables and at everybody uses a different a different brush so so it the actually does generate different different oh estimate to classify different E but i i can talk to have to to get to the point of a i don't know i don't know maybe it's worth to consider different um different features a or different representation that we haven't tried that i know that in this field people use a coral eloquence as well and another method we simply do not right yeah we will make it available so we we we basically have a a paper in submission right now once it gets one one it gets except that all we will will make the data available for everyone to try now do that the man

TIME-FREQUENCY SEGMENTATION OF BIRD SONG IN NOISY ACOUSTIC ENVIRONMENTS

Machine Learning Methods and Applications

Presented by: Raviv Raich, Author(s): Lawrence Neal, Forrest Briggs, Raviv Raich, Xiaoli Fern, Oregon State University, United States