0:00:13uh the uh a goal of this work was to uh improve upon state-of-the-art transcription
0:00:19uh by explicitly incorporate information about not tons is an offset
0:00:25uh some general information about transcription which you might have had before by not but on no
0:00:29i it's the process of
0:00:31combating an audio recording it do some for them music notation
0:00:34eight has
0:00:35numerous applications thing
0:00:37and my are in interactive music system such as uh automated score following already uh competition colour G
0:00:45and it could be divided into several subtasks
0:00:48uh such as multi pitch just mission a detection of note onset and offsets
0:00:53uh instrument litigation
0:00:55extraction of rhythmic information is the temple and
0:00:58in the multi pitch multiple instrument
0:01:00case it still remains an open problem
0:01:04uh some related work in which is linked to this
0:01:08work
0:01:09is the uh iterative
0:01:11spectral subtraction based system by to poor which
0:01:14propose the spectral smoothness principle
0:01:18uh the rule based system by rules do who also proposed as a time representation the resonator time-frequency image which
0:01:25is also used
0:01:26and this work
0:01:28uh i thing yes
0:01:30joint model is your estimation method which
0:01:32continues you ranks first in the U X uh
0:01:36public evaluations for uh most as your estimation and note tracking
0:01:41and then iterative estimation
0:01:43system for multi pitch estimation which exploit the temporal evolution which was previously proposed but you're
0:01:50uh also some related work an onset detection
0:01:53uh is the well known
0:01:55uh
0:01:56fused onset detection functions by one problem well low which combine energy and phase
0:02:02base
0:02:02measures
0:02:04and uh
0:02:05a more recent development which was late fusion
0:02:09by holds up to which
0:02:11fused at the onset descriptors of the decision level
0:02:16and uh in this work
0:02:17uh we propose a system for joint multi just estimation
0:02:21which will
0:02:22to exploit on and than of the detection
0:02:24in an effort to have improved multiple pitch estimation results
0:02:28um
0:02:31not also detection feature were developed
0:02:33and proposed which were derived from preprocessing steps from the description system
0:02:39and offsets
0:02:40a we believe are the first time to be explicitly exploited by using uh a kid a markov model
0:02:48this is the basic outline of the system
0:02:50basically there is a preprocessing step where the uh time-frequency representation is extracted
0:02:56spectral whitening is performed
0:02:58and noise is suppressed and they pitch
0:03:00sailing an sort pitch strength function is extracted
0:03:04of the was the core the system is the
0:03:07onset detection using late fusion and the proposed scriptures
0:03:11joint
0:03:11multipitch estimation
0:03:14afterwards
0:03:14each wise of the detection and the result is the general transcription in a T four
0:03:21uh this is an example of the uh
0:03:24time for the series we used
0:03:26which was the resonator time-frequency image which is uh
0:03:29resonator filter bank
0:03:31um we use that's
0:03:34and course with them more
0:03:36common on and Q transform for example because
0:03:38of of its exponential decay factor it's uh had the
0:03:41but temporal resolution in low because that you mike
0:03:44see here
0:03:46this is a very typical uh are
0:03:48um
0:03:49recording from the mikes to thousand seven competition which is
0:03:52usually
0:03:53employ
0:03:56after the extraction of the uh a time for this representation presentation aspects a whitening is performed you know to
0:04:01suppress timbral information and
0:04:04make the system more robust to different sound sources
0:04:07uh the
0:04:08what method but to it was used to that end
0:04:11and uh it was followed by a two
0:04:14once the octave span a if filtering procedure
0:04:18and the based on that uh
0:04:20white and and noise suppressed presentation
0:04:22a pitch aliens
0:04:24or pitch
0:04:24strength function is extracted
0:04:27uh a along with tuning and how many to coefficients
0:04:30and the lower figure you can see in the bottom that
0:04:32are the
0:04:33are T i
0:04:35spectrum of a C four pound notes
0:04:37and in the lower left and right figure you can see the corresponding peaks set function when you see a
0:04:42prominent peak
0:04:43in the C for note but you can also see several peaks sing sub harmonic positions or in super how
0:04:48positions of that
0:04:50can have
0:04:53a a so what's onset detection is
0:04:56uh forms
0:04:57two
0:04:57also the scriptures were extracted
0:04:59and proposed utilise information from the preprocessing steps
0:05:03of the multi pitch estimation stage
0:05:06first first proposed a script was a spectral flux basis to which also incorporated tuning information
0:05:11and is essentially E was
0:05:14probe motivated because in
0:05:17uh many cases you have
0:05:19for as is called by V brought T or a by tuning changes and these might uh
0:05:25give many false alarms in uh normal energy
0:05:29based uh also detection measure
0:05:32and these proposed measure is basically a half wave rectified uh a might on resolution fills bank
0:05:37which also incorporate an information from the extract pitch salience function
0:05:42as so what's on it's can be easily detected by P B all that
0:05:46but the function
0:05:48uh a second function
0:05:50a a for the detection was also proposed
0:05:53in order to detect soft on source of dozens of
0:05:56are produced without any knots of to change
0:05:59my be produced by both string is as for example
0:06:02and the proposed function was based on the P on a chrome are up to version of the
0:06:07extract pitch salience function
0:06:10um which was also have a rectified of work
0:06:14you know to combine these two you want the scriptures late fusion was applied
0:06:19and uh in know two
0:06:21train the late fusion problems as a development set from uh
0:06:25ghent university was "'cause" this is of ten thirty second uh a classic music X was
0:06:29you
0:06:32uh for multiple of zero estimation
0:06:35for each
0:06:36frame for each uh
0:06:39a kind of that's are extracted
0:06:40and for each possible combination
0:06:43uh the overlapped partials are estimated
0:06:46and overlapping partial treatment is applied
0:06:49basically for each combination of partial collision is is computed
0:06:53and that's was the um pitches of the overlapped partials are estimated by uh
0:06:57this script cepstrum base spectral envelope estimation procedure in the low frequency domain
0:07:01in the figure here you can see the uh in the right you can see the uh
0:07:05harmonic partial sequence of a
0:07:08have but G five for can nodes and the course one express of them a
0:07:14after what's for each possible peach combination for to a frame
0:07:18a score function is computed
0:07:20which exploits uh several spectral features
0:07:25and also aims to minimize the residual spectrum
0:07:29so um the features of what use were the uh spectral flatness
0:07:33for that's harmonic partial sequence
0:07:36a smooth this measured based on these spectral smoothness principle
0:07:39the spectral centroid which is these centre of gravity for that
0:07:43harmonic partial sequence aiming for a low
0:07:45spectral centroid
0:07:47is usually an indication of a
0:07:50musical is one harmonic sound
0:07:52uh a novel feature was proposed which was the harmonic related speech ratio which was a a you know to
0:07:58to um
0:07:59suppress press
0:08:00any harmonic or sub money cared
0:08:03and finally we try to minimize the uh flatness for the residual spectrum to much my of there is is
0:08:10suspect
0:08:13so of the optimal speech kind it said is one that actually maximise that
0:08:18score function
0:08:20and the weight promises as for that's score function were trained using now the my station
0:08:25using a development sense of one hundred kind of samples from the media lines
0:08:29kind of sounds
0:08:30database from uh that was propose developed by fun is in a me from uh in india
0:08:38i to the uh the pitch estimation stage the of the texan is proposed
0:08:42and it's
0:08:43applied
0:08:44uh and it's done using two state on of
0:08:48hidden markov models for each
0:08:50single speech
0:08:52and in this system an off that is defined as the
0:08:55time frame between two consecutive on sets
0:08:58well the
0:09:00at this stage of a peach
0:09:01firstly turns in any if states
0:09:04you know it to act uh
0:09:07compute the state priors and state transition for that a man
0:09:11um E files from the other C database were used from the classic and jazz
0:09:16john
0:09:16and for the observation probably we
0:09:20a the information from the
0:09:21pretty extracted pitch sense function
0:09:24and uh basically the observation function for not to pitch is essentially a sigmoid function for that
0:09:29extracted salience function and
0:09:31see the basic structure of the peach wise H of them for of the detection in both
0:09:38for evaluation we use the we used in just get really it's true a a set of twelve twenty three
0:09:43second X or some the other C the base
0:09:46which consist of classic and just music excerpts
0:09:49uh uh you can see most of these pieces are are but not all of them in five there as
0:09:54many guitars and there's a very nice court is also
0:09:58um here's a basic example of that
0:10:01just caption shen
0:10:03in the upper
0:10:04figure you you can see um
0:10:06the beach ground truth for a D tar X or the lower have you can see that description
0:10:11this is what the original recording sounds like hmmm mm um are able
0:10:22and uh this is the synthesized transcription for this same recording oh you're you're
0:10:34generally you can see that um
0:10:36the going
0:10:37doesn't have men false alarms but in some cases tends to underestimate the
0:10:43chord thing number notes of polyphony number so it has some
0:10:46miss detections
0:10:47but overall deep is
0:10:49quite good
0:10:50and uh these are the results for that system
0:10:54uh
0:10:55the results in terms of accuracy using ten millisecond
0:10:58a a evaluation
0:11:00is sixty
0:11:01point five percent for just a frame based evaluation with not without on of the detection
0:11:06eight fifty nine point seven percent utilizing information for since only because it so uh has money more false alarms
0:11:13because it doesn't have any the activations what beaches
0:11:16and it right up to sixty one point two percent
0:11:19for the joint owns and that of the case
0:11:22and when compared to the uh various
0:11:24all a in there it so so that so as the uh method by
0:11:28can as case that which had a uh
0:11:30gmms that and that
0:11:32spectrum models uh of these spec but
0:11:35uh
0:11:36method by site or the H T C up than that was also present before
0:11:41uh results are
0:11:42about two percent improves in terms of like accuracy
0:11:46multi detail these
0:11:48might be given with some additional metrics where can be seen that's
0:11:51most of the errors if at the uh uh a are uh a false negatives missed detections
0:11:56where as the number of false positives
0:11:59R is
0:11:59relatively to be smaller
0:12:02and finally some results on the onset detection
0:12:05procedure
0:12:06uh
0:12:07it should be noted that we were aiming for a high recall a not a high measure because we want
0:12:12went is the state in um
0:12:15or segmenting the signal but rather to capture as many on sits as possible
0:12:21finally the contributions what work where the onset detection features that's where do right for from speech
0:12:27estimation
0:12:28preprocessing
0:12:30uh score function that complain combine several features for multi pitch estimation including uh
0:12:36no feature for suppressing come on pitches
0:12:40apps of detection
0:12:41uh using pitch wise
0:12:43hmms
0:12:44and so could show results using the a C database which
0:12:47a perform state-of-the-art
0:12:49and uh in the future we like to explicitly model uh the
0:12:54no detailed they produce sound stay says as the uh attack to change in since and the case part of
0:12:59the produce sound
0:13:01uh a phone joint model which just mission and not tracking not separately
0:13:06and finally publicly clear at uh methods to the marks framework of was done in a previous method of or
0:13:12thank you
0:13:21right of those control so this questions have time
0:13:34hi so um
0:13:36i notice that you said you train your onset detection or on piano
0:13:41i i i think it was a detection no was trained and general uh classic music X most of them
0:13:46were string actually
0:13:47okay i guess that was gonna be my next question because
0:13:50oh
0:13:51a lot of times on
0:13:52plucked strings and
0:13:54and struck instruments it's much much easier not to do that yeah onset set
0:13:58and uh i was just wondering
0:14:00if you feel like you're onset detection has anything new to say about detecting on sets and things with bows
0:14:06or
0:14:06with singing or you know things words or yeah uh that's why the uh uh a second tones measure was
0:14:11proposed not to detect soft once it's which was a bit a pitch based measure which is actually i think
0:14:16the only
0:14:17reliable way to detect
0:14:19on sets without any energy change and
0:14:21in fact we put some examples transcription exams like they wanna you before in the web
0:14:27uh where i have a another exam from a string quartets description which is actually pretty accurate
0:14:36i one question you say more about
0:14:38you also sets
0:14:40also signed for perceptually important
0:14:42and one in wire was so important to your performance
0:14:46well uh the thing is that most
0:14:48multipitch pitch estimation methods
0:14:50to that they do not explicitly export some information about
0:14:54the excitation time the
0:14:56octave octave the activation time of the produce sound and also
0:15:00the the activation that sound and uh
0:15:03by incorporating not information
0:15:06in fact uh
0:15:07to demonstrate that we can also
0:15:10just
0:15:11improve the bit
0:15:12on that frame based multi pitch estimation
0:15:15C G and um
0:15:17yeah i mean uh generally
0:15:19on so
0:15:20should
0:15:21if fact be used um
0:15:23more widely and
0:15:25not
0:15:26be left outside
0:15:27which just is them as is user down
0:15:30the look like a a look like a lot of errors were cars part missing notes
0:15:34yeah means you are so was helping was impaired
0:15:36the missing notes were actually producing case of
0:15:39then scored when you might have also some uh octave errors
0:15:43uh sometimes the upper pitch might not be detect in that case of you have an to
0:15:48and that doesn't have anything to do we don't sets because uh all it's are
0:15:53the for the lower note about it has something to say about the features of we might use for
0:15:59um
0:16:00multipitch pitch estimation that we need features that might be more
0:16:04robust that's say to uh in the case of overlap no
0:16:12know questions
0:16:15oh one "'em" up monopolise support
0:16:18when you have a core
0:16:19i mean can we really hope to get all the notes so
0:16:22we automatic means
0:16:23well uh depending on the instrument model is that you have and
0:16:28if you exam if we for example you have
0:16:30change the promise of you system based on that
0:16:32specific and smith so that might be
0:16:34generally easier
0:16:36compared to let's say you have a change your problems in a counter and tested it in uh
0:16:40course
0:16:41from both string
0:16:44so you are good
0:16:45music
0:16:45are used to be dependent models can be used as an i i think that's
0:16:49generally a trend would be to
0:16:51for mean the future estimate to speak description
0:16:54so that it will also include that joint
0:16:57in meant relations that
0:17:01yeah else
0:17:02yeah bills
0:17:03because remote
0:17:04Q