0:00:13a
0:00:14um
0:00:14and which a yeah and i can be talking to about better work on um
0:00:18if search search of music pitch contours
0:00:21uh a first uh i'm gonna
0:00:23if
0:00:23start with a short overview about uh content
0:00:26based music search especially melody based music search
0:00:29a first we need to define what's a we mean by main melody
0:00:33um
0:00:34the exact mean logical definition is probably a lot more complicated but we're just gonna use something that's except in
0:00:39the uh
0:00:40music information retrieval community
0:00:42which is the single modify nick pitch sequence that a listener might trip reproduce it fast to was so or
0:00:48hum a piece of polyphonic music
0:00:50and that a listener would recognise as being be
0:00:53as sense of that music when when heard in compares
0:00:56um so based on this definition of main melody are we're just gonna make it even
0:01:00further loose sure
0:01:02and just assume that melody means dominant pitch contour
0:01:05which in turn we're gonna assume a means
0:01:08dominant the fundamental frequency
0:01:10contour and of course all three of these concepts
0:01:13you know are actually
0:01:14subtly different uh especially uh
0:01:17not only between melody and pitch but also between pitch and fundamental frequency but
0:01:21here we're just gonna lose assume that
0:01:24fundamental frequency under some kind of map map or notion of of down minutes
0:01:28um
0:01:29uh define as the main melody
0:01:31so um
0:01:32oh we know that a you know in searching we based on melody is is a very interesting problem people
0:01:37have done for a long time
0:01:38a a query by humming is a very well known
0:01:41application
0:01:42um in doing this
0:01:43a task
0:01:44um there are different methods of representing melodies
0:01:47for for music search
0:01:48and uh the most well-known known um is note
0:01:52transcriptions were or note interval transcriptions
0:01:54so if you have some kind of melody and then for example as you can see this example here
0:01:59um you can just
0:02:00a
0:02:01represent or mobile T as a sequence of notes
0:02:03see for D for you forty three G three
0:02:05or even just model the a transitions between notes
0:02:08such says up down repeat
0:02:11um and and using these transcriptions people have been quite successful when in doing a query by humming
0:02:16uh another method is to just use the pay be continuous or or frame based pitch contour
0:02:22a note the course i have a around continuous "'cause" it's not really continuous it's still a sample digital
0:02:28uh signal but nevertheless we all continuous because it
0:02:31so pretty much uh resembles a a a a a smooth curve
0:02:35um so just to give an example
0:02:37um it it in the
0:02:39uh
0:02:40example here
0:02:41um at the top and see a lot frequency
0:02:44spectrum that's actually a uh you know a small a small segment of of a beatles song
0:02:48and then based on that you can extract some kind of dominant F zero contour
0:02:52uh that can see a um at the bottom
0:02:55and again you have a a a a a time axis and then you have or frequency axis
0:02:59yeah and you just basically just plot the evolution of the uh the dominant a phone and this is what
0:03:03we're gonna call
0:03:04continuous F O contour now the reason why this is interesting
0:03:08is um for i'll get to that of a in a bit
0:03:11but uh
0:03:12let's see how we would actually do a melody based music search uh using that
0:03:16so this is just an example uh you have a query melody they can see here
0:03:20and then you have a a a a a much longer target
0:03:22a music piece that's and maybe some kind of music database
0:03:25and you can see that there's a small part of that target
0:03:28that
0:03:29a very closely resembles a the query
0:03:32um
0:03:32it's just one small segment of that target
0:03:35no and the object of music search is to know given this query you try to match it up with
0:03:40that part
0:03:41and the target
0:03:41now ready you see that there was you know a number of of
0:03:44problems most you have a solve here for example the length of the query here
0:03:47is about seventeen seconds but then the corresponding part
0:03:50um and the target is maybe a little over ten seconds
0:03:54so the target
0:03:55has a higher tempo it's been sound faster
0:03:58and the query so you have to be able to just for that are from the temple
0:04:01they can also be differences in in the in the uh musical key so maybe the queries
0:04:05as
0:04:06you know
0:04:07some that a higher key or at a low key compared to the target
0:04:11so
0:04:13well when you do a music search a first all you have to be able to search all possible starting
0:04:17locations of each target because people may sing you know for starting from the beginning of some saw
0:04:22not this are so at the beginning
0:04:24um
0:04:26a people may start thing if the middle of of of the song
0:04:29uh you want to be able to adjust for differences in speed or temple adjust for differences in key
0:04:33um people may also have some inconsistencies in rhythm in pitch
0:04:38within their saying it
0:04:41so uh dynamic time warping is a very popular uh
0:04:45no technique that's used to uh compare two different uh sequences of pitch
0:04:50uh for doing uh a music search and then there's been other state-of-the-art fingerprinting or hashing techniques
0:04:55uh that have a allowed very efficient and effective search using the note transcription data that's the example were showed
0:05:01you where you maybe have a sequence of notes or sequence of
0:05:04of transitions
0:05:05um
0:05:07uh is it it has however also been suggested that sort of using these no transcriptions using the continuous pitch
0:05:13contours that we just saw a a a can work better
0:05:16and the the prime time primary argument for this is that the no transcriptions can only be very real right
0:05:21we obtained from on a funny music
0:05:23if you have polyphonic music
0:05:25then you're no transcription is very easily break down um if there
0:05:29there errors in the pitch transcription for a fun music the no transcriptions can compound those errors
0:05:34so uh people have suggested just working on these original can pitch contour
0:05:39uh of course that's a problem
0:05:40because
0:05:41the amount of data is
0:05:42so much larger compared to just using the nodes
0:05:45so it's very computationally um
0:05:47expensive
0:05:48so if if for example we've uh in in our previous examples
0:05:51if you had maybe five notes
0:05:53a if you using the no transcription then your input data is just five
0:05:57by values but if we using the contain pitch contour it can be hundreds and hundreds of values that if
0:06:01you're sampling every
0:06:02there two miliseconds
0:06:05so
0:06:06uh in order to uh try to use
0:06:08uh is uh
0:06:10a tune is
0:06:11a pitch contours well also efficiently doing music search we previously
0:06:15a proposed of a method of indexing melody fragments using a set of
0:06:19a what we call time normalized key normalized we would coefficients and store these coefficients in a tree
0:06:25uh to search then
0:06:26them efficiently um there was a lot of mathematical teacher that when it to this that um
0:06:31uh i won't get into here
0:06:32a the problem with this method is that it just compares home of melody fragments to rigid late that it
0:06:37it just uses a simple euclidean distance between a different melody fragments is it does not really a a with
0:06:43the a query changes
0:06:44in the read them although it does allow well i differences in tempo
0:06:48so uh to really do this problem
0:06:50properly we have to do some kind of
0:06:52some kind of dynamic time warping
0:06:55so that's
0:06:56just take a look at
0:06:57a what exactly that how that dynamic time warping can be um
0:07:00formulated related
0:07:01so if you assume
0:07:02some query sequence Q
0:07:04a uh Q
0:07:05which
0:07:06yeah it's just a set of of uh
0:07:08uh a pitch samples and then you have a target sequence P
0:07:11um the classic dtw equation
0:07:14is
0:07:15what you see here are you based be a a a a scene you have R warping operations and you
0:07:19try to
0:07:20you you set up your compass your total cost as a summation
0:07:24of of costs
0:07:25a a a your warping
0:07:27uh dimension or warping operation the dimension
0:07:30and uh the path that you take a along this path
0:07:33is defined by uh these warping functions um fee if Q and fee of P this is just
0:07:38from just very traditional speech recognition or
0:07:42or other pattern recognition text
0:07:44um and here
0:07:45a we just defined the the distance
0:07:47between a query value and a pitch value as a that euclidean dist
0:07:52no note that here go we also have a and and and an extra parameter called the B
0:07:57which uh
0:07:58which actually represents the difference in the musical key between the query and the pitch because as you can see
0:08:04in the uh calculation of the uh
0:08:06of the just as
0:08:07are you also have to
0:08:08uh
0:08:09um take into account that the crew can be sung at a at a different key from the from targets
0:08:14we can't just subtract to balance you have to at some kind of offset
0:08:18and you you add here because the
0:08:20frequencies are log frequency so it of ski shift
0:08:23and the lot frequency "'cause" you domain just becomes a a a a constant by
0:08:27so the B is is some bias factor model difference in musical key and um
0:08:32you we can't of course constraint this true main roughly cons
0:08:35constant uh which does make sense if you just let the
0:08:38be change for every single different pitch value
0:08:41it wouldn't make any sense at all you have to seen that
0:08:43that that both the query and the target
0:08:45and they may have different keys
0:08:47between each other but you still maintain your key a within that so so i'm me sing higher then
0:08:53then
0:08:53john one and but then
0:08:55um
0:08:56i was still maintain a
0:08:59my musical key
0:09:00even though i choose a different one
0:09:02so i'm and i'm fruit to choose whatever key at one
0:09:05so you can see ut
0:09:06replace the be so by with with just a cost of valid B
0:09:10but even in this case uh when you actually tried to do the dynamic programming to to minimize this whole
0:09:14equation
0:09:15you basically have to compute a was over three dimensional space you have you
0:09:20your your query access you have your target access and then you also have to
0:09:24are i try every single possible
0:09:26difference
0:09:27in the uh in the musical key so you have to compute
0:09:31are are the costs in a Q times P to be
0:09:33E space so the cost is it's just too high
0:09:36um
0:09:37the on original study that try to use ms cool pitch a pitch contours actually did exactly back
0:09:44so i to to address this concern um you know people have
0:09:47have
0:09:48uh
0:09:48a propose different ways
0:09:50and
0:09:51uh what we did was to to speed up the dtw we we partition the query into segments
0:09:57treating each segment as a as their rhythmically consistent unit
0:10:00no no this idea of partitioning up
0:10:03queries or or or melodies into to segments
0:10:05that's not really do you and people have done it in in one form or another
0:10:08um
0:10:09what we did here though is uh
0:10:11there are a number of mathematical issues that you have to um
0:10:15i H
0:10:16you know a handle in terms of how the
0:10:18the the by can change it and that things like that
0:10:21and uh so these are just more algorithmic details that that are in the paper so i won't really get
0:10:26it to them
0:10:27here
0:10:28but really matters is that we are uh just dividing up the query it to segments and this drastically reduces
0:10:34the the amount of computation and we propose
0:10:37a framework for for for doing that
0:10:39um
0:10:41so this can actually solve
0:10:43quite elegantly
0:10:44um
0:10:45using a yeah just a very classical level building but with just some once subtle different
0:10:51um it's kind of like in connected word recognition what the classical word recognition technique that maybe you seen the
0:10:57rip be or too long speech recognition book
0:10:59um where you
0:11:00but upon levels were each level represents
0:11:03one word segment
0:11:04um and here we have a
0:11:05or query that's divided up into to different segments
0:11:08but then we also introduce what we call can something like buffer zones
0:11:11between the query segments and these are employed
0:11:14to a lower the uh some target segments to to overlap
0:11:18and that's provides a much greater flexibility in the dtw if you if you don't know about this overlap
0:11:23then um you introduce too much rigidity um into to the system
0:11:27so all all this
0:11:28uh it can be done quite elegantly uh with with a level building scheme
0:11:34so uh we actually um
0:11:36implemented this
0:11:37and
0:11:38uh uh we we and this
0:11:40on the uh mare's you thousand six
0:11:41test set
0:11:43and these are just some
0:11:44uh mean a standard deviations for the for this average start sisters there's time per query that you could it
0:11:50cheap
0:11:51uh using these techniques
0:11:53and uh what you see at the bottom are just asymptotic search costs
0:11:56um
0:11:57uh that's a so did associated with each different type of of of a search technique
0:12:02um and T W is actually are original work where we just rigidly the
0:12:06um try to
0:12:08compare
0:12:09um mel what you fragments was together using a simple euclidean distance
0:12:12and for that
0:12:13yeah that each search or or i i just the each comparison can just be done add in constant time
0:12:19because it's just
0:12:19a euclidean distance
0:12:21a multi-scale query windowing is just a variation of that
0:12:24um and the be a search cost is going to be asymptotically uh big oh of and what are and
0:12:30is the number
0:12:31a ship query segments
0:12:33and then the segmented dtw is is our proposed method
0:12:36uh for which be there's cost is
0:12:38it's and times P
0:12:40where again and is number of query segments and P is the link
0:12:43of of your target signal
0:12:44and the dtw is a is the brute force dtw T W that are originally would you the sort optimum
0:12:50equation in for that the search cost is P times to times be again you have to do it over
0:12:54all three dimensions
0:12:55so that's are sign for this
0:12:57is the greatest
0:12:58and and that you can obviously see that uh in actual experimentation
0:13:02um the uh
0:13:04multiscale just kill target don't is of course the the fastest
0:13:07uh and uh
0:13:09the um and U W the multiscale query when doing followed by our proposed S D
0:13:14W is
0:13:15much slower but still zero point six four seconds per
0:13:19queries
0:13:20wasn't a very bad figure
0:13:22and then we
0:13:23just estimated the amount of time would take if we use P brute force optimum dtw T which she C
0:13:29is
0:13:30much much uh takes much much longer and than any of these given methods
0:13:34um in terms of the uh search performance
0:13:37um
0:13:38we found that
0:13:39uh be uh
0:13:41the uh segmented dtw obviously is going to give the best
0:13:45are performance because again it it doesn't actual dynamic time warping between between queries and can handle rhythmic inconsistencies and
0:13:52all these other
0:13:53variations that's uh the M T W or D but the previous and to W
0:13:57can't
0:13:58um
0:13:59and uh you can see that uh the M are that we got was your point he three eight
0:14:03um this is actually a much lower than the state-of-the-art art
0:14:07uh which is higher than of the a point nine um however um and that particular work
0:14:12um
0:14:13there was a constraint
0:14:14that's
0:14:15that
0:14:15um
0:14:16that all queries
0:14:18charted at the beginning of
0:14:20of melodies or
0:14:21or
0:14:22mel what melodic phrase
0:14:23uh in our case we searched every possible location in all the songs in a database and source search space
0:14:29as is much larger and there's much more room for or
0:14:32so to just to be on an equal footing with the C every the are we also tried constraining all
0:14:37a queries to start at the beginning of melodies
0:14:40and they that gave us an M am or of zero point nine to one which is almost the same
0:14:44as that the of the art
0:14:46and we also tried an a
0:14:48upper real
0:14:49a limit every power of polyphonic music set
0:14:51uh which we just create on real using
0:14:54commercial pop some recordings so the are actual
0:14:56acoustic polyphonic a music recordings from which we did
0:15:00just some kind of automatic um a F zero extraction
0:15:04and then a are test on that
0:15:05and
0:15:06obviously the results are gonna be much lower "'cause" we now we using polyphonic music
0:15:10and uh there's can be lots of errors the pitch transcriptions
0:15:13but again our are
0:15:14our proposed method was still able to give the best for
0:15:19so and this is just a
0:15:21showing the relation between the number of query segments that use verses the M are are and the search time
0:15:26the more segments you have you have basically the close you get to the optimum dtw so W M R
0:15:31is gonna go well but then at the same time you're search times also critical uh
0:15:35so
0:15:36that's pretty much it
0:15:38and channel of this work
0:15:40there any questions
0:15:52i
0:16:00the
0:16:00more efficient way of trying
0:16:03a fine the bias like the the i your
0:16:06we task
0:16:07that is the reigning
0:16:08that's a way that and trying every possible E
0:16:12all
0:16:13while trying every possible be yeah than the computer are large you you uh one of a smaller resolution
0:16:20right so i mean
0:16:22if you just going to try every possible be
0:16:26by definition you're trying every possible be
0:16:28so
0:16:30you're
0:16:33i mean you could try using some kind of heuristics maybe to reduce
0:16:37you're search space ensure the be trying instead of trying all those possible of of the subset
0:16:41um in our work uh we just a a did simplification so that we use some kind of optimum be
0:16:46that's estimate just using the first query segment
0:16:49and that's how we were able to just eliminate that whole search space
0:16:52but uh there probably other more other heuristic
0:16:56but the that you could
0:16:56apply there