Speech Transcript - POLYPHONIC MUSIC TRANSCRIPTION USING NOTE ONSET AND OFFSET DETECTION

0:00:13	uh the uh a goal of this work was to uh improve upon state-of-the-art transcription
0:00:19	uh by explicitly incorporate information about not tons is an offset
0:00:25	uh some general information about transcription which you might have had before by not but on no
0:00:29	i it's the process of
0:00:31	combating an audio recording it do some for them music notation
0:00:34	eight has
0:00:35	numerous applications thing
0:00:37	and my are in interactive music system such as uh automated score following already uh competition colour G
0:00:45	and it could be divided into several subtasks
0:00:48	uh such as multi pitch just mission a detection of note onset and offsets
0:00:53	uh instrument litigation
0:00:55	extraction of rhythmic information is the temple and
0:00:58	in the multi pitch multiple instrument
0:01:00	case it still remains an open problem
0:01:04	uh some related work in which is linked to this
0:01:08	work
0:01:09	is the uh iterative
0:01:11	spectral subtraction based system by to poor which
0:01:14	propose the spectral smoothness principle
0:01:18	uh the rule based system by rules do who also proposed as a time representation the resonator time-frequency image which
0:01:25	is also used
0:01:26	and this work
0:01:28	uh i thing yes
0:01:30	joint model is your estimation method which
0:01:32	continues you ranks first in the U X uh
0:01:36	public evaluations for uh most as your estimation and note tracking
0:01:41	and then iterative estimation
0:01:43	system for multi pitch estimation which exploit the temporal evolution which was previously proposed but you're
0:01:50	uh also some related work an onset detection
0:01:53	uh is the well known
0:01:55	uh
0:01:56	fused onset detection functions by one problem well low which combine energy and phase
0:02:02	base
0:02:02	measures
0:02:04	and uh
0:02:05	a more recent development which was late fusion
0:02:09	by holds up to which
0:02:11	fused at the onset descriptors of the decision level
0:02:16	and uh in this work
0:02:17	uh we propose a system for joint multi just estimation
0:02:21	which will
0:02:22	to exploit on and than of the detection
0:02:24	in an effort to have improved multiple pitch estimation results
0:02:28	um
0:02:31	not also detection feature were developed
0:02:33	and proposed which were derived from preprocessing steps from the description system
0:02:39	and offsets
0:02:40	a we believe are the first time to be explicitly exploited by using uh a kid a markov model
0:02:48	this is the basic outline of the system
0:02:50	basically there is a preprocessing step where the uh time-frequency representation is extracted
0:02:56	spectral whitening is performed
0:02:58	and noise is suppressed and they pitch
0:03:00	sailing an sort pitch strength function is extracted
0:03:04	of the was the core the system is the
0:03:07	onset detection using late fusion and the proposed scriptures
0:03:11	joint
0:03:11	multipitch estimation
0:03:14	afterwards
0:03:14	each wise of the detection and the result is the general transcription in a T four
0:03:21	uh this is an example of the uh
0:03:24	time for the series we used
0:03:26	which was the resonator time-frequency image which is uh
0:03:29	resonator filter bank
0:03:31	um we use that's
0:03:34	and course with them more
0:03:36	common on and Q transform for example because
0:03:38	of of its exponential decay factor it's uh had the
0:03:41	but temporal resolution in low because that you mike
0:03:44	see here
0:03:46	this is a very typical uh are
0:03:48	um
0:03:49	recording from the mikes to thousand seven competition which is
0:03:52	usually
0:03:53	employ
0:03:56	after the extraction of the uh a time for this representation presentation aspects a whitening is performed you know to
0:04:01	suppress timbral information and
0:04:04	make the system more robust to different sound sources
0:04:07	uh the
0:04:08	what method but to it was used to that end
0:04:11	and uh it was followed by a two
0:04:14	once the octave span a if filtering procedure
0:04:18	and the based on that uh
0:04:20	white and and noise suppressed presentation
0:04:22	a pitch aliens
0:04:24	or pitch
0:04:24	strength function is extracted
0:04:27	uh a along with tuning and how many to coefficients
0:04:30	and the lower figure you can see in the bottom that
0:04:32	are the
0:04:33	are T i
0:04:35	spectrum of a C four pound notes
0:04:37	and in the lower left and right figure you can see the corresponding peaks set function when you see a
0:04:42	prominent peak
0:04:43	in the C for note but you can also see several peaks sing sub harmonic positions or in super how
0:04:48	positions of that
0:04:50	can have
0:04:53	a a so what's onset detection is
0:04:56	uh forms
0:04:57	two
0:04:57	also the scriptures were extracted
0:04:59	and proposed utilise information from the preprocessing steps
0:05:03	of the multi pitch estimation stage
0:05:06	first first proposed a script was a spectral flux basis to which also incorporated tuning information
0:05:11	and is essentially E was
0:05:14	probe motivated because in
0:05:17	uh many cases you have
0:05:19	for as is called by V brought T or a by tuning changes and these might uh
0:05:25	give many false alarms in uh normal energy
0:05:29	based uh also detection measure
0:05:32	and these proposed measure is basically a half wave rectified uh a might on resolution fills bank
0:05:37	which also incorporate an information from the extract pitch salience function
0:05:42	as so what's on it's can be easily detected by P B all that
0:05:46	but the function
0:05:48	uh a second function
0:05:50	a a for the detection was also proposed
0:05:53	in order to detect soft on source of dozens of
0:05:56	are produced without any knots of to change
0:05:59	my be produced by both string is as for example
0:06:02	and the proposed function was based on the P on a chrome are up to version of the
0:06:07	extract pitch salience function
0:06:10	um which was also have a rectified of work
0:06:14	you know to combine these two you want the scriptures late fusion was applied
0:06:19	and uh in know two
0:06:21	train the late fusion problems as a development set from uh
0:06:25	ghent university was "'cause" this is of ten thirty second uh a classic music X was
0:06:29	you
0:06:32	uh for multiple of zero estimation
0:06:35	for each
0:06:36	frame for each uh
0:06:39	a kind of that's are extracted
0:06:40	and for each possible combination
0:06:43	uh the overlapped partials are estimated
0:06:46	and overlapping partial treatment is applied
0:06:49	basically for each combination of partial collision is is computed
0:06:53	and that's was the um pitches of the overlapped partials are estimated by uh
0:06:57	this script cepstrum base spectral envelope estimation procedure in the low frequency domain
0:07:01	in the figure here you can see the uh in the right you can see the uh
0:07:05	harmonic partial sequence of a
0:07:08	have but G five for can nodes and the course one express of them a
0:07:14	after what's for each possible peach combination for to a frame
0:07:18	a score function is computed
0:07:20	which exploits uh several spectral features
0:07:25	and also aims to minimize the residual spectrum
0:07:29	so um the features of what use were the uh spectral flatness
0:07:33	for that's harmonic partial sequence
0:07:36	a smooth this measured based on these spectral smoothness principle
0:07:39	the spectral centroid which is these centre of gravity for that
0:07:43	harmonic partial sequence aiming for a low
0:07:45	spectral centroid
0:07:47	is usually an indication of a
0:07:50	musical is one harmonic sound
0:07:52	uh a novel feature was proposed which was the harmonic related speech ratio which was a a you know to
0:07:58	to um
0:07:59	suppress press
0:08:00	any harmonic or sub money cared
0:08:03	and finally we try to minimize the uh flatness for the residual spectrum to much my of there is is
0:08:10	suspect
0:08:13	so of the optimal speech kind it said is one that actually maximise that
0:08:18	score function
0:08:20	and the weight promises as for that's score function were trained using now the my station
0:08:25	using a development sense of one hundred kind of samples from the media lines
0:08:29	kind of sounds
0:08:30	database from uh that was propose developed by fun is in a me from uh in india
0:08:38	i to the uh the pitch estimation stage the of the texan is proposed
0:08:42	and it's
0:08:43	applied
0:08:44	uh and it's done using two state on of
0:08:48	hidden markov models for each
0:08:50	single speech
0:08:52	and in this system an off that is defined as the
0:08:55	time frame between two consecutive on sets
0:08:58	well the
0:09:00	at this stage of a peach
0:09:01	firstly turns in any if states
0:09:04	you know it to act uh
0:09:07	compute the state priors and state transition for that a man
0:09:11	um E files from the other C database were used from the classic and jazz
0:09:16	john
0:09:16	and for the observation probably we
0:09:20	a the information from the
0:09:21	pretty extracted pitch sense function
0:09:24	and uh basically the observation function for not to pitch is essentially a sigmoid function for that
0:09:29	extracted salience function and
0:09:31	see the basic structure of the peach wise H of them for of the detection in both
0:09:38	for evaluation we use the we used in just get really it's true a a set of twelve twenty three
0:09:43	second X or some the other C the base
0:09:46	which consist of classic and just music excerpts
0:09:49	uh uh you can see most of these pieces are are but not all of them in five there as
0:09:54	many guitars and there's a very nice court is also
0:09:58	um here's a basic example of that
0:10:01	just caption shen
0:10:03	in the upper
0:10:04	figure you you can see um
0:10:06	the beach ground truth for a D tar X or the lower have you can see that description
0:10:11	this is what the original recording sounds like hmmm mm um are able
0:10:22	and uh this is the synthesized transcription for this same recording oh you're you're
0:10:34	generally you can see that um
0:10:36	the going
0:10:37	doesn't have men false alarms but in some cases tends to underestimate the
0:10:43	chord thing number notes of polyphony number so it has some
0:10:46	miss detections
0:10:47	but overall deep is
0:10:49	quite good
0:10:50	and uh these are the results for that system
0:10:54	uh
0:10:55	the results in terms of accuracy using ten millisecond
0:10:58	a a evaluation
0:11:00	is sixty
0:11:01	point five percent for just a frame based evaluation with not without on of the detection
0:11:06	eight fifty nine point seven percent utilizing information for since only because it so uh has money more false alarms
0:11:13	because it doesn't have any the activations what beaches
0:11:16	and it right up to sixty one point two percent
0:11:19	for the joint owns and that of the case
0:11:22	and when compared to the uh various
0:11:24	all a in there it so so that so as the uh method by
0:11:28	can as case that which had a uh
0:11:30	gmms that and that
0:11:32	spectrum models uh of these spec but
0:11:35	uh
0:11:36	method by site or the H T C up than that was also present before
0:11:41	uh results are
0:11:42	about two percent improves in terms of like accuracy
0:11:46	multi detail these
0:11:48	might be given with some additional metrics where can be seen that's
0:11:51	most of the errors if at the uh uh a are uh a false negatives missed detections
0:11:56	where as the number of false positives
0:11:59	R is
0:11:59	relatively to be smaller
0:12:02	and finally some results on the onset detection
0:12:05	procedure
0:12:06	uh
0:12:07	it should be noted that we were aiming for a high recall a not a high measure because we want
0:12:12	went is the state in um
0:12:15	or segmenting the signal but rather to capture as many on sits as possible
0:12:21	finally the contributions what work where the onset detection features that's where do right for from speech
0:12:27	estimation
0:12:28	preprocessing
0:12:30	uh score function that complain combine several features for multi pitch estimation including uh
0:12:36	no feature for suppressing come on pitches
0:12:40	apps of detection
0:12:41	uh using pitch wise
0:12:43	hmms
0:12:44	and so could show results using the a C database which
0:12:47	a perform state-of-the-art
0:12:49	and uh in the future we like to explicitly model uh the
0:12:54	no detailed they produce sound stay says as the uh attack to change in since and the case part of
0:12:59	the produce sound
0:13:01	uh a phone joint model which just mission and not tracking not separately
0:13:06	and finally publicly clear at uh methods to the marks framework of was done in a previous method of or
0:13:12	thank you
0:13:21	right of those control so this questions have time
0:13:34	hi so um
0:13:36	i notice that you said you train your onset detection or on piano
0:13:41	i i i think it was a detection no was trained and general uh classic music X most of them
0:13:46	were string actually
0:13:47	okay i guess that was gonna be my next question because
0:13:50	oh
0:13:51	a lot of times on
0:13:52	plucked strings and
0:13:54	and struck instruments it's much much easier not to do that yeah onset set
0:13:58	and uh i was just wondering
0:14:00	if you feel like you're onset detection has anything new to say about detecting on sets and things with bows
0:14:06	or
0:14:06	with singing or you know things words or yeah uh that's why the uh uh a second tones measure was
0:14:11	proposed not to detect soft once it's which was a bit a pitch based measure which is actually i think
0:14:16	the only
0:14:17	reliable way to detect
0:14:19	on sets without any energy change and
0:14:21	in fact we put some examples transcription exams like they wanna you before in the web
0:14:27	uh where i have a another exam from a string quartets description which is actually pretty accurate
0:14:36	i one question you say more about
0:14:38	you also sets
0:14:40	also signed for perceptually important
0:14:42	and one in wire was so important to your performance
0:14:46	well uh the thing is that most
0:14:48	multipitch pitch estimation methods
0:14:50	to that they do not explicitly export some information about
0:14:54	the excitation time the
0:14:56	octave octave the activation time of the produce sound and also
0:15:00	the the activation that sound and uh
0:15:03	by incorporating not information
0:15:06	in fact uh
0:15:07	to demonstrate that we can also
0:15:10	just
0:15:11	improve the bit
0:15:12	on that frame based multi pitch estimation
0:15:15	C G and um
0:15:17	yeah i mean uh generally
0:15:19	on so
0:15:20	should
0:15:21	if fact be used um
0:15:23	more widely and
0:15:25	not
0:15:26	be left outside
0:15:27	which just is them as is user down
0:15:30	the look like a a look like a lot of errors were cars part missing notes
0:15:34	yeah means you are so was helping was impaired
0:15:36	the missing notes were actually producing case of
0:15:39	then scored when you might have also some uh octave errors
0:15:43	uh sometimes the upper pitch might not be detect in that case of you have an to
0:15:48	and that doesn't have anything to do we don't sets because uh all it's are
0:15:53	the for the lower note about it has something to say about the features of we might use for
0:15:59	um
0:16:00	multipitch pitch estimation that we need features that might be more
0:16:04	robust that's say to uh in the case of overlap no
0:16:12	know questions
0:16:15	oh one "'em" up monopolise support
0:16:18	when you have a core
0:16:19	i mean can we really hope to get all the notes so
0:16:22	we automatic means
0:16:23	well uh depending on the instrument model is that you have and
0:16:28	if you exam if we for example you have
0:16:30	change the promise of you system based on that
0:16:32	specific and smith so that might be
0:16:34	generally easier
0:16:36	compared to let's say you have a change your problems in a counter and tested it in uh
0:16:40	course
0:16:41	from both string
0:16:44	so you are good
0:16:45	music
0:16:45	are used to be dependent models can be used as an i i think that's
0:16:49	generally a trend would be to
0:16:51	for mean the future estimate to speak description
0:16:54	so that it will also include that joint
0:16:57	in meant relations that
0:17:01	yeah else
0:17:02	yeah bills
0:17:03	because remote
0:17:04	Q

POLYPHONIC MUSIC TRANSCRIPTION USING NOTE ONSET AND OFFSET DETECTION

Music Signal Processing

Presented by: Emmanouil Benetos, Author(s): Emmanouil Benetos, Simon Dixon, Queen Mary University of London, United Kingdom