0:00:13oh
0:00:28okay
0:00:28so
0:00:29so the this presentation is an a completely different project this is far less
0:00:34uh and political far less convergence proofs
0:00:38um this is basically we we have a
0:00:41by acoustics group there there is interested in
0:00:44um
0:00:44learning
0:00:45species distributions basically a connection between
0:00:48a signal processing machine learning and you college
0:00:52and
0:00:54we we form between myself
0:00:56and and we for an a group that
0:00:58looks at doing that problem
0:01:01um
0:01:01microphones
0:01:02placing microphones in the force and listening to birds
0:01:05uh
0:01:06michael there's here are lawrence knee of is
0:01:09i think it the time you wrote this paper was a a
0:01:12third year
0:01:13uh and grad student
0:01:14you know group
0:01:16and it was very enthusiastic about this problem so
0:01:20you want to help out then we got a of this problem in very quickly caught up with
0:01:24the signal processing the machine learning techniques use the that
0:01:27for ratings
0:01:28is a P D
0:01:29student in the group
0:01:31and
0:01:32basically self and mike
0:01:34collaborate
0:01:37okay so
0:01:38so the bird my acoustic project
0:01:40deals with um
0:01:43do with a variety of questions
0:01:45ah
0:01:46ecological questions and how to solve them using
0:01:49audio
0:01:50i don't
0:01:50set a microphone and force this to
0:01:52the audio and
0:01:54the term
0:01:55species distributions so on
0:01:58individual be
0:01:59so
0:02:01um the uh the other motivation for this particular paper is coming from just audio segmentation
0:02:07and the focus is
0:02:08rather than a classical one the audio segmentation we're is typically used in this
0:02:13a type of data
0:02:14uh we present a time free "'cause" is so way to D
0:02:17uh audio segmentation
0:02:19which is a rather than a threshold meant method is based on
0:02:22a classification so learning from examples
0:02:26um
0:02:27specifically
0:02:29oh well i'll present a segmentation system that is
0:02:31using the random of classifier
0:02:34um
0:02:35to segment
0:02:36and we will show some results
0:02:38so here's the general idea so
0:02:41i think this this particular
0:02:43picture was taken at the harvard broke
0:02:45a long term ecological research site
0:02:48i and the idea is the following
0:02:50what people interested in exploring
0:02:52species distribution for the purpose of ecology research
0:02:55this end an individual
0:02:57there will go and in each of these points this is actually a topographic map
0:03:01and the will standing individual we'll send individual the will stand each of those points
0:03:05for ten minute
0:03:07and we'll move on to the next point sounds that pretty exhausting something they send multiple individuals to do that
0:03:13and for
0:03:14a sampling of ten minutes there in the day
0:03:17uh
0:03:17the idea was to estimate
0:03:19the distribution of species during and they're of course
0:03:22everybody here we no sampling realise the something must be very wrong with
0:03:26just ten minutes during the day to
0:03:28so be obtain a a an accurate
0:03:31uh distribution of speech
0:03:34and of course
0:03:35part of the a here is to
0:03:37uh keep these maps across the years and learn how
0:03:40uh
0:03:41how species distribution change over time
0:03:43how of for example once C estimate the distribution
0:03:46um
0:03:48a how to
0:03:49integrate that we've ecological research in other words
0:03:52how to connect the distributions to environmental parameter
0:03:56so
0:03:57in in this project we actually have placed microphones in the A J and use um
0:04:02long term ecological research side
0:04:05we're using the song leader which is
0:04:06commonly using in on sound analysis and placing in and
0:04:10a variety of places i think in this
0:04:12this time this
0:04:13preliminary search we put
0:04:15in fifteen places
0:04:16um
0:04:17a what happens is this is not a fully automatic yeah
0:04:20somebody has to replace the batteries every two week
0:04:24"'cause" uh because of the
0:04:25time of the recording and also replace them every chips because it takes
0:04:29a lot of memory to record over
0:04:31oh
0:04:32couple weeks
0:04:34so so the system that we developed in order to do a species distribution involves
0:04:39or a collection of automated
0:04:41uh sorry placing automated recorders collection of
0:04:45uh data from these recorded the recording
0:04:48uh converted to spectrograms
0:04:50perform segmentation
0:04:52in the sake spectrograms
0:04:53uh and you know order to extract syllables and the reason why we do that
0:04:57is because um
0:04:59burt's tend to vocalise
0:05:01a simultaneously
0:05:03and so one B
0:05:04a segmentation would have been
0:05:06uh
0:05:06useful he
0:05:08and so the idea is that to you you get those syllables you extract features are each syllables
0:05:13and then you you build a you probabilistic model which is probably
0:05:16um um so the the goal
0:05:18i and and the focus of of this project
0:05:21not this particular paper
0:05:22how to build a probabilistic model
0:05:24that takes recordings a collection of recordings
0:05:27each recording is labeled
0:05:29with the speech is present or subset of the recordings
0:05:33uh yes
0:05:34is labeled with the speech is present and species absent
0:05:37and
0:05:39the idea that here is to learn from the collection of the recording and to be able to analyse
0:05:43and you're recording which doesn't have labels in other words
0:05:46we just been pointing in the recording which syllable be long to which bird and what birds are present in
0:05:51that recording
0:05:55so as i mentioned before um birds
0:05:59tend to have many independent vocalisations
0:06:01and
0:06:02open these vocalisation
0:06:04a overlap in time another other word
0:06:07or ten in order to communicate birds to tend to pick different frequency channel
0:06:11so that they're not overlapping and da da here's communication used for
0:06:15i just caring off and i means
0:06:17which are short simple calls or
0:06:20make a
0:06:21and so the idea they what's the point of
0:06:23communicating with
0:06:24a a your mate
0:06:26when they can hear so the D is
0:06:28birds tend to space themselves in frequency as well
0:06:33so so as i mentioned one D segmentation may look like this when there's no overlap
0:06:38oh birds singing
0:06:39but in practice
0:06:40a lot of the recordings that we get
0:06:42uh would look something like this some of the stuff use
0:06:45barely visible
0:06:47uh but and what i'd like to point out is that
0:06:50uh this is the
0:06:51frequency axis and this is the time axis and one can see that at
0:06:55individual
0:06:56time instance there's an overlap between syllable for one bird species
0:07:00syllables from other
0:07:01speech
0:07:04and so
0:07:05in this particular paper the focus is how to one
0:07:08good segmentation as i mentioned
0:07:10it apparently even though segmentation is
0:07:13is not as fine
0:07:14for for us to this signal processing is is developing the models
0:07:17and how to a how to estimate
0:07:19species distribution an your recording this and models
0:07:22it is
0:07:23probably one of the most important problems in
0:07:25for a and an important aspects of this project
0:07:28when you break a syllable either
0:07:31over segment of syllable or
0:07:33you know segment syllables jointly
0:07:35um a lot of the models
0:07:37seem to fail
0:07:39so the process is the following
0:07:41uh we get these spectrograms and we have somebody
0:07:45sit and use
0:07:46either paint
0:07:47or other program to going mark
0:07:49uh a what is noise and what is signal and the a D here is
0:07:53will give the computer enough examples of this
0:07:56and
0:07:57basically will
0:07:59uh will take the or to generate spectrograms
0:08:01um
0:08:03extract
0:08:04features feed it into a classifier a along with the labels with
0:08:07those particular mass the were created
0:08:09and hope that
0:08:11a given in your recording
0:08:12the algorithm will uh uh with the us to classify a particular
0:08:16picks on the spectrogram is belonging longing to
0:08:19background of foreground
0:08:20so the the the framework is very similar to those of you are
0:08:23in image post processing were computers
0:08:27so we we borrow
0:08:28some of the principles from there
0:08:32so
0:08:34the details are as follows we we
0:08:36we obtain the spectrum grants what you can see on the screen as well as on my on the lap
0:08:40of here is that
0:08:41uh after converting the recordings in the spectrogram
0:08:44there are um there's is a stationary background noise
0:08:48it it corresponding to
0:08:50stream or other
0:08:51environment noise
0:08:52that exist low frequency and so we're applying a whitening filter to get rid of the
0:08:57and then at this point
0:08:58we take the spectrogram in we extract feature
0:09:01for each
0:09:01a a pixel in the spectrogram we collect the neighborhood of values
0:09:05and extract features for that neighbourhood of values
0:09:08and
0:09:09once we of the features we applied the random forest class to like i will skip the details of of
0:09:14the classifier
0:09:16so
0:09:17da da here is to then
0:09:19be able to
0:09:20predict done in your recording
0:09:22the right uh
0:09:24the right label either foreground or background
0:09:27another aspect is
0:09:28um another advantage of the random forest classifier which is very suitable for segmentation is a for example to svm
0:09:35other classifiers
0:09:36is the fact that the classifier can give you a weighted threshold the performance
0:09:40in a a forest classifier you you know
0:09:42you obtain multiple trees each tree provides a classification so either zero or one
0:09:47like taking for example fifty of those
0:09:49and averaging it can get the probability
0:09:52all be long to one class of the other
0:09:54which allows you to then threshold
0:09:56and helps in segmentation
0:09:59and so he here an example how
0:10:01after segmenting a after applying to class for segmenting we obtain
0:10:05uh we obtain this result and of course we filter out
0:10:09small
0:10:11uh
0:10:12small segments
0:10:13a small connected components
0:10:16and the is up to you do that nick and then i should be extract
0:10:20individual syllables from the recordings
0:10:22which is really any input to all the work that we to late
0:10:27so the data that we have
0:10:29uh the we worked on
0:10:31for the experiment is
0:10:32steer P C and data recorded at sixteen khz we will increase that in the future "'cause" some birds
0:10:37articulate
0:10:39uh if frequencies higher than eight khz
0:10:41are we have
0:10:43dataset set of six are twenty five audio segments that we collected fifteen seconds each
0:10:48um
0:10:49basically
0:10:51two segments the the way the that it was course to segments out twenty four hours per side from thirteen
0:10:56sites
0:10:56so there's a mix of sites and mix of hours of the day
0:11:00and
0:11:01each spectrogram in order to do this evaluation properly we had to label all six hundred sixty five spectrograms
0:11:07and then to see whether the
0:11:09a segmentation algorithm can predict
0:11:11the human labelling
0:11:13and
0:11:14in this particular we use forty trees for each other in of forest and we use
0:11:18well we do is we cut out
0:11:20um and random neighbour for this we can use
0:11:23all the patches in the spectrogram so we cut a
0:11:25five and five
0:11:26sorry half a million
0:11:27exam
0:11:31and we're considering
0:11:33to to evaluations uh of the roc C one is in terms of time-frequency area the number of spectrogram units
0:11:39correctly classified you look at the number of pixels the work are like could correctly classify
0:11:44and i another one is you consider the energy weighting in other words
0:11:48um there could be these pixels that are hard to get
0:11:52and perhaps you want take that into account
0:11:54so how how obvious as a picks a how high the energy there
0:11:57um
0:11:59is incorporated in the second form
0:12:02so next
0:12:03i'm presenting the you results the results uh we
0:12:06the classifier
0:12:07and scan through the threshold in order to obtain
0:12:10the uh receiver operating characteristic the C
0:12:13and what we can see is that you know our intuition why not just use of the first question is
0:12:18why don't you just use an energy thresholding
0:12:20and that the performs that we get with energy
0:12:23thresholding
0:12:24so okay so energy thresholding work pixel wise and that could be inaccurate accuracy it to use
0:12:29a a thresholding and well or perhaps to take advantage of
0:12:33neighborhood and in this case
0:12:35oh we try that as well and then of course we compare that to the classifier and
0:12:40the close to the R C is
0:12:42this line or that i'm the better it is and we can see that
0:12:46the classifier
0:12:47does a far better
0:12:49then energy thresholding which are the common myth
0:12:54and of course we look
0:12:55yeah well as well as
0:12:56uh
0:12:57at the our C in terms of total acoustic energy
0:13:00uh rather than
0:13:01calculate pixels as zero and one we signed the weight
0:13:04a corresponding to the pixel on a spectrum
0:13:07and once again we see the same relationship that
0:13:11um
0:13:12basically the classified does better than thresholding and blurring and thresholding and blurring does
0:13:17some of better than um just simply threshold
0:13:22so
0:13:23i think part of the idea is that once we do this classification in terms of future work our goal
0:13:28is to use and and these syllables is a dictionary and once we have a dictionary there's a lot of
0:13:33um
0:13:35a lot of all the can be applied
0:13:37um
0:13:38once you have an dictionary so for example topic models are some
0:13:41some of our research interest and were interested in applying topic models
0:13:45to identifying a bird species in these recordings you are
0:13:49some examples of a how visible the are of class is
0:13:52that were form
0:13:53after are um
0:13:55after applying the segmentation you can see
0:13:57a fairly
0:13:58a repetitive patterns will in each class there or you can barely see i guess you to the
0:14:04different contrary
0:14:06so so the point here
0:14:07is that we're getting
0:14:09a segmentation that is
0:14:11to the level that were interested in we're not over segmenting or under segmenting
0:14:16and and then allows us to perform class there is and to basically convert
0:14:20these audio
0:14:21into two docking
0:14:26questions
0:14:41this is a very interesting problem
0:14:42uh
0:14:43do you happen to have a a a priori the number of a classes of words that you're looking at
0:14:49or this is an open set
0:14:51so
0:14:52yes and also so we we do have a they're going here and and at this but we think the
0:14:56humans a better than the computer um so we trust the what they tell is right so they do know
0:15:00about the species that are present in that area
0:15:03and
0:15:04the no part is that there some rare species that often
0:15:07get into that area that
0:15:09maybe present and vocalise
0:15:10in the recordings of will not be able to
0:15:12that
0:15:13yeah because you know my
0:15:15my idea is that if you know the class you probably can
0:15:19a a to the segmentations to classifier
0:15:21me
0:15:22monitor
0:15:23right
0:15:23right
0:15:24if you know what right out the we want to be i mean one of the problems to the interesting
0:15:28to us is new class detection
0:15:30and we would like to avoid relying on the class of that we know
0:15:46since for the presentations really uh really impressive work
0:15:49and uh uh so i would like to know who with to um some yeah this pretty good the look
0:15:55kernel with the the the random forest and number would uh could you describe the the the kind of phones
0:15:59but it to use the that you good because
0:16:02since you're already have like really good results like goes the focus would be like
0:16:05oh can you even
0:16:07a a or
0:16:09that i think that's a good question um
0:16:11i i don't know that i can answer directly or question about what i wanna point out is that
0:16:15um
0:16:16you know this is highly dependent how
0:16:19people segment the data in this part we we've tried this supports many times with
0:16:23different features
0:16:24um and and
0:16:26basically different people who signal of what we notice is that
0:16:29the results are highly dependent
0:16:31in terms of
0:16:31form
0:16:33on how because and we still see the same relationship between the methods but
0:16:37the kurds themselves would very quite a lot depending on what features you're speaking
0:16:41uh to estimate syllables and depending on how
0:16:43you're segmenting in fact
0:16:45you know we're using a uh
0:16:46um
0:16:48forget get in in in my of right
0:16:51a simple as a as it goes to to mark the syllables
0:16:54and at everybody uses a different a different brush so
0:16:57so it the actually does
0:16:59generate different
0:17:00different
0:17:01oh estimate to classify different
0:17:03E
0:17:06but i i can talk to have to
0:17:08to get to the point of a
0:17:20i don't know i don't know maybe it's worth to consider different um
0:17:24different features a or different representation that
0:17:28we haven't tried that i know that in this field people use
0:17:30a coral eloquence as well and another method
0:17:33we simply do not right
0:17:38yeah
0:17:41we will make it available so
0:17:45we we we basically have a a paper
0:17:47in submission right now once it gets
0:17:50one one it gets except that all we will will make the data available for everyone to
0:17:55try now do that
0:17:56the man