Speech Transcript - TIME-FREQUENCY SEGMENTATION OF BIRD SONG IN NOISY ACOUSTIC ENVIRONMENTS

0:00:13	oh
0:00:28	okay
0:00:28	so
0:00:29	so the this presentation is an a completely different project this is far less
0:00:34	uh and political far less convergence proofs
0:00:38	um this is basically we we have a
0:00:41	by acoustics group there there is interested in
0:00:44	um
0:00:44	learning
0:00:45	species distributions basically a connection between
0:00:48	a signal processing machine learning and you college
0:00:52	and
0:00:54	we we form between myself
0:00:56	and and we for an a group that
0:00:58	looks at doing that problem
0:01:01	um
0:01:01	microphones
0:01:02	placing microphones in the force and listening to birds
0:01:05	uh
0:01:06	michael there's here are lawrence knee of is
0:01:09	i think it the time you wrote this paper was a a
0:01:12	third year
0:01:13	uh and grad student
0:01:14	you know group
0:01:16	and it was very enthusiastic about this problem so
0:01:20	you want to help out then we got a of this problem in very quickly caught up with
0:01:24	the signal processing the machine learning techniques use the that
0:01:27	for ratings
0:01:28	is a P D
0:01:29	student in the group
0:01:31	and
0:01:32	basically self and mike
0:01:34	collaborate
0:01:37	okay so
0:01:38	so the bird my acoustic project
0:01:40	deals with um
0:01:43	do with a variety of questions
0:01:45	ah
0:01:46	ecological questions and how to solve them using
0:01:49	audio
0:01:50	i don't
0:01:50	set a microphone and force this to
0:01:52	the audio and
0:01:54	the term
0:01:55	species distributions so on
0:01:58	individual be
0:01:59	so
0:02:01	um the uh the other motivation for this particular paper is coming from just audio segmentation
0:02:07	and the focus is
0:02:08	rather than a classical one the audio segmentation we're is typically used in this
0:02:13	a type of data
0:02:14	uh we present a time free "'cause" is so way to D
0:02:17	uh audio segmentation
0:02:19	which is a rather than a threshold meant method is based on
0:02:22	a classification so learning from examples
0:02:26	um
0:02:27	specifically
0:02:29	oh well i'll present a segmentation system that is
0:02:31	using the random of classifier
0:02:34	um
0:02:35	to segment
0:02:36	and we will show some results
0:02:38	so here's the general idea so
0:02:41	i think this this particular
0:02:43	picture was taken at the harvard broke
0:02:45	a long term ecological research site
0:02:48	i and the idea is the following
0:02:50	what people interested in exploring
0:02:52	species distribution for the purpose of ecology research
0:02:55	this end an individual
0:02:57	there will go and in each of these points this is actually a topographic map
0:03:01	and the will standing individual we'll send individual the will stand each of those points
0:03:05	for ten minute
0:03:07	and we'll move on to the next point sounds that pretty exhausting something they send multiple individuals to do that
0:03:13	and for
0:03:14	a sampling of ten minutes there in the day
0:03:17	uh
0:03:17	the idea was to estimate
0:03:19	the distribution of species during and they're of course
0:03:22	everybody here we no sampling realise the something must be very wrong with
0:03:26	just ten minutes during the day to
0:03:28	so be obtain a a an accurate
0:03:31	uh distribution of speech
0:03:34	and of course
0:03:35	part of the a here is to
0:03:37	uh keep these maps across the years and learn how
0:03:40	uh
0:03:41	how species distribution change over time
0:03:43	how of for example once C estimate the distribution
0:03:46	um
0:03:48	a how to
0:03:49	integrate that we've ecological research in other words
0:03:52	how to connect the distributions to environmental parameter
0:03:56	so
0:03:57	in in this project we actually have placed microphones in the A J and use um
0:04:02	long term ecological research side
0:04:05	we're using the song leader which is
0:04:06	commonly using in on sound analysis and placing in and
0:04:10	a variety of places i think in this
0:04:12	this time this
0:04:13	preliminary search we put
0:04:15	in fifteen places
0:04:16	um
0:04:17	a what happens is this is not a fully automatic yeah
0:04:20	somebody has to replace the batteries every two week
0:04:24	"'cause" uh because of the
0:04:25	time of the recording and also replace them every chips because it takes
0:04:29	a lot of memory to record over
0:04:31	oh
0:04:32	couple weeks
0:04:34	so so the system that we developed in order to do a species distribution involves
0:04:39	or a collection of automated
0:04:41	uh sorry placing automated recorders collection of
0:04:45	uh data from these recorded the recording
0:04:48	uh converted to spectrograms
0:04:50	perform segmentation
0:04:52	in the sake spectrograms
0:04:53	uh and you know order to extract syllables and the reason why we do that
0:04:57	is because um
0:04:59	burt's tend to vocalise
0:05:01	a simultaneously
0:05:03	and so one B
0:05:04	a segmentation would have been
0:05:06	uh
0:05:06	useful he
0:05:08	and so the idea is that to you you get those syllables you extract features are each syllables
0:05:13	and then you you build a you probabilistic model which is probably
0:05:16	um um so the the goal
0:05:18	i and and the focus of of this project
0:05:21	not this particular paper
0:05:22	how to build a probabilistic model
0:05:24	that takes recordings a collection of recordings
0:05:27	each recording is labeled
0:05:29	with the speech is present or subset of the recordings
0:05:33	uh yes
0:05:34	is labeled with the speech is present and species absent
0:05:37	and
0:05:39	the idea that here is to learn from the collection of the recording and to be able to analyse
0:05:43	and you're recording which doesn't have labels in other words
0:05:46	we just been pointing in the recording which syllable be long to which bird and what birds are present in
0:05:51	that recording
0:05:55	so as i mentioned before um birds
0:05:59	tend to have many independent vocalisations
0:06:01	and
0:06:02	open these vocalisation
0:06:04	a overlap in time another other word
0:06:07	or ten in order to communicate birds to tend to pick different frequency channel
0:06:11	so that they're not overlapping and da da here's communication used for
0:06:15	i just caring off and i means
0:06:17	which are short simple calls or
0:06:20	make a
0:06:21	and so the idea they what's the point of
0:06:23	communicating with
0:06:24	a a your mate
0:06:26	when they can hear so the D is
0:06:28	birds tend to space themselves in frequency as well
0:06:33	so so as i mentioned one D segmentation may look like this when there's no overlap
0:06:38	oh birds singing
0:06:39	but in practice
0:06:40	a lot of the recordings that we get
0:06:42	uh would look something like this some of the stuff use
0:06:45	barely visible
0:06:47	uh but and what i'd like to point out is that
0:06:50	uh this is the
0:06:51	frequency axis and this is the time axis and one can see that at
0:06:55	individual
0:06:56	time instance there's an overlap between syllable for one bird species
0:07:00	syllables from other
0:07:01	speech
0:07:04	and so
0:07:05	in this particular paper the focus is how to one
0:07:08	good segmentation as i mentioned
0:07:10	it apparently even though segmentation is
0:07:13	is not as fine
0:07:14	for for us to this signal processing is is developing the models
0:07:17	and how to a how to estimate
0:07:19	species distribution an your recording this and models
0:07:22	it is
0:07:23	probably one of the most important problems in
0:07:25	for a and an important aspects of this project
0:07:28	when you break a syllable either
0:07:31	over segment of syllable or
0:07:33	you know segment syllables jointly
0:07:35	um a lot of the models
0:07:37	seem to fail
0:07:39	so the process is the following
0:07:41	uh we get these spectrograms and we have somebody
0:07:45	sit and use
0:07:46	either paint
0:07:47	or other program to going mark
0:07:49	uh a what is noise and what is signal and the a D here is
0:07:53	will give the computer enough examples of this
0:07:56	and
0:07:57	basically will
0:07:59	uh will take the or to generate spectrograms
0:08:01	um
0:08:03	extract
0:08:04	features feed it into a classifier a along with the labels with
0:08:07	those particular mass the were created
0:08:09	and hope that
0:08:11	a given in your recording
0:08:12	the algorithm will uh uh with the us to classify a particular
0:08:16	picks on the spectrogram is belonging longing to
0:08:19	background of foreground
0:08:20	so the the the framework is very similar to those of you are
0:08:23	in image post processing were computers
0:08:27	so we we borrow
0:08:28	some of the principles from there
0:08:32	so
0:08:34	the details are as follows we we
0:08:36	we obtain the spectrum grants what you can see on the screen as well as on my on the lap
0:08:40	of here is that
0:08:41	uh after converting the recordings in the spectrogram
0:08:44	there are um there's is a stationary background noise
0:08:48	it it corresponding to
0:08:50	stream or other
0:08:51	environment noise
0:08:52	that exist low frequency and so we're applying a whitening filter to get rid of the
0:08:57	and then at this point
0:08:58	we take the spectrogram in we extract feature
0:09:01	for each
0:09:01	a a pixel in the spectrogram we collect the neighborhood of values
0:09:05	and extract features for that neighbourhood of values
0:09:08	and
0:09:09	once we of the features we applied the random forest class to like i will skip the details of of
0:09:14	the classifier
0:09:16	so
0:09:17	da da here is to then
0:09:19	be able to
0:09:20	predict done in your recording
0:09:22	the right uh
0:09:24	the right label either foreground or background
0:09:27	another aspect is
0:09:28	um another advantage of the random forest classifier which is very suitable for segmentation is a for example to svm
0:09:35	other classifiers
0:09:36	is the fact that the classifier can give you a weighted threshold the performance
0:09:40	in a a forest classifier you you know
0:09:42	you obtain multiple trees each tree provides a classification so either zero or one
0:09:47	like taking for example fifty of those
0:09:49	and averaging it can get the probability
0:09:52	all be long to one class of the other
0:09:54	which allows you to then threshold
0:09:56	and helps in segmentation
0:09:59	and so he here an example how
0:10:01	after segmenting a after applying to class for segmenting we obtain
0:10:05	uh we obtain this result and of course we filter out
0:10:09	small
0:10:11	uh
0:10:12	small segments
0:10:13	a small connected components
0:10:16	and the is up to you do that nick and then i should be extract
0:10:20	individual syllables from the recordings
0:10:22	which is really any input to all the work that we to late
0:10:27	so the data that we have
0:10:29	uh the we worked on
0:10:31	for the experiment is
0:10:32	steer P C and data recorded at sixteen khz we will increase that in the future "'cause" some birds
0:10:37	articulate
0:10:39	uh if frequencies higher than eight khz
0:10:41	are we have
0:10:43	dataset set of six are twenty five audio segments that we collected fifteen seconds each
0:10:48	um
0:10:49	basically
0:10:51	two segments the the way the that it was course to segments out twenty four hours per side from thirteen
0:10:56	sites
0:10:56	so there's a mix of sites and mix of hours of the day
0:11:00	and
0:11:01	each spectrogram in order to do this evaluation properly we had to label all six hundred sixty five spectrograms
0:11:07	and then to see whether the
0:11:09	a segmentation algorithm can predict
0:11:11	the human labelling
0:11:13	and
0:11:14	in this particular we use forty trees for each other in of forest and we use
0:11:18	well we do is we cut out
0:11:20	um and random neighbour for this we can use
0:11:23	all the patches in the spectrogram so we cut a
0:11:25	five and five
0:11:26	sorry half a million
0:11:27	exam
0:11:31	and we're considering
0:11:33	to to evaluations uh of the roc C one is in terms of time-frequency area the number of spectrogram units
0:11:39	correctly classified you look at the number of pixels the work are like could correctly classify
0:11:44	and i another one is you consider the energy weighting in other words
0:11:48	um there could be these pixels that are hard to get
0:11:52	and perhaps you want take that into account
0:11:54	so how how obvious as a picks a how high the energy there
0:11:57	um
0:11:59	is incorporated in the second form
0:12:02	so next
0:12:03	i'm presenting the you results the results uh we
0:12:06	the classifier
0:12:07	and scan through the threshold in order to obtain
0:12:10	the uh receiver operating characteristic the C
0:12:13	and what we can see is that you know our intuition why not just use of the first question is
0:12:18	why don't you just use an energy thresholding
0:12:20	and that the performs that we get with energy
0:12:23	thresholding
0:12:24	so okay so energy thresholding work pixel wise and that could be inaccurate accuracy it to use
0:12:29	a a thresholding and well or perhaps to take advantage of
0:12:33	neighborhood and in this case
0:12:35	oh we try that as well and then of course we compare that to the classifier and
0:12:40	the close to the R C is
0:12:42	this line or that i'm the better it is and we can see that
0:12:46	the classifier
0:12:47	does a far better
0:12:49	then energy thresholding which are the common myth
0:12:54	and of course we look
0:12:55	yeah well as well as
0:12:56	uh
0:12:57	at the our C in terms of total acoustic energy
0:13:00	uh rather than
0:13:01	calculate pixels as zero and one we signed the weight
0:13:04	a corresponding to the pixel on a spectrum
0:13:07	and once again we see the same relationship that
0:13:11	um
0:13:12	basically the classified does better than thresholding and blurring and thresholding and blurring does
0:13:17	some of better than um just simply threshold
0:13:22	so
0:13:23	i think part of the idea is that once we do this classification in terms of future work our goal
0:13:28	is to use and and these syllables is a dictionary and once we have a dictionary there's a lot of
0:13:33	um
0:13:35	a lot of all the can be applied
0:13:37	um
0:13:38	once you have an dictionary so for example topic models are some
0:13:41	some of our research interest and were interested in applying topic models
0:13:45	to identifying a bird species in these recordings you are
0:13:49	some examples of a how visible the are of class is
0:13:52	that were form
0:13:53	after are um
0:13:55	after applying the segmentation you can see
0:13:57	a fairly
0:13:58	a repetitive patterns will in each class there or you can barely see i guess you to the
0:14:04	different contrary
0:14:06	so so the point here
0:14:07	is that we're getting
0:14:09	a segmentation that is
0:14:11	to the level that were interested in we're not over segmenting or under segmenting
0:14:16	and and then allows us to perform class there is and to basically convert
0:14:20	these audio
0:14:21	into two docking
0:14:26	questions
0:14:41	this is a very interesting problem
0:14:42	uh
0:14:43	do you happen to have a a a priori the number of a classes of words that you're looking at
0:14:49	or this is an open set
0:14:51	so
0:14:52	yes and also so we we do have a they're going here and and at this but we think the
0:14:56	humans a better than the computer um so we trust the what they tell is right so they do know
0:15:00	about the species that are present in that area
0:15:03	and
0:15:04	the no part is that there some rare species that often
0:15:07	get into that area that
0:15:09	maybe present and vocalise
0:15:10	in the recordings of will not be able to
0:15:12	that
0:15:13	yeah because you know my
0:15:15	my idea is that if you know the class you probably can
0:15:19	a a to the segmentations to classifier
0:15:21	me
0:15:22	monitor
0:15:23	right
0:15:23	right
0:15:24	if you know what right out the we want to be i mean one of the problems to the interesting
0:15:28	to us is new class detection
0:15:30	and we would like to avoid relying on the class of that we know
0:15:46	since for the presentations really uh really impressive work
0:15:49	and uh uh so i would like to know who with to um some yeah this pretty good the look
0:15:55	kernel with the the the random forest and number would uh could you describe the the the kind of phones
0:15:59	but it to use the that you good because
0:16:02	since you're already have like really good results like goes the focus would be like
0:16:05	oh can you even
0:16:07	a a or
0:16:09	that i think that's a good question um
0:16:11	i i don't know that i can answer directly or question about what i wanna point out is that
0:16:15	um
0:16:16	you know this is highly dependent how
0:16:19	people segment the data in this part we we've tried this supports many times with
0:16:23	different features
0:16:24	um and and
0:16:26	basically different people who signal of what we notice is that
0:16:29	the results are highly dependent
0:16:31	in terms of
0:16:31	form
0:16:33	on how because and we still see the same relationship between the methods but
0:16:37	the kurds themselves would very quite a lot depending on what features you're speaking
0:16:41	uh to estimate syllables and depending on how
0:16:43	you're segmenting in fact
0:16:45	you know we're using a uh
0:16:46	um
0:16:48	forget get in in in my of right
0:16:51	a simple as a as it goes to to mark the syllables
0:16:54	and at everybody uses a different a different brush so
0:16:57	so it the actually does
0:16:59	generate different
0:17:00	different
0:17:01	oh estimate to classify different
0:17:03	E
0:17:06	but i i can talk to have to
0:17:08	to get to the point of a
0:17:20	i don't know i don't know maybe it's worth to consider different um
0:17:24	different features a or different representation that
0:17:28	we haven't tried that i know that in this field people use
0:17:30	a coral eloquence as well and another method
0:17:33	we simply do not right
0:17:38	yeah
0:17:41	we will make it available so
0:17:45	we we we basically have a a paper
0:17:47	in submission right now once it gets
0:17:50	one one it gets except that all we will will make the data available for everyone to
0:17:55	try now do that
0:17:56	the man

TIME-FREQUENCY SEGMENTATION OF BIRD SONG IN NOISY ACOUSTIC ENVIRONMENTS

Machine Learning Methods and Applications

Presented by: Raviv Raich, Author(s): Lawrence Neal, Forrest Briggs, Raviv Raich, Xiaoli Fern, Oregon State University, United States