Speech Transcript - EFFICIENT SEARCH OF MUSIC PITCH CONTOURS USING WAVELET TRANSFORMS AND SEGMENTED DYNAMIC TIME WARPING

a um and which a yeah and i can be talking to about better work on um if search search of music pitch contours uh a first uh i'm gonna if start with a short overview about uh content based music search especially melody based music search a first we need to define what's a we mean by main melody um the exact mean logical definition is probably a lot more complicated but we're just gonna use something that's except in the uh music information retrieval community which is the single modify nick pitch sequence that a listener might trip reproduce it fast to was so or hum a piece of polyphonic music and that a listener would recognise as being be as sense of that music when when heard in compares um so based on this definition of main melody are we're just gonna make it even further loose sure and just assume that melody means dominant pitch contour which in turn we're gonna assume a means dominant the fundamental frequency contour and of course all three of these concepts you know are actually subtly different uh especially uh not only between melody and pitch but also between pitch and fundamental frequency but here we're just gonna lose assume that fundamental frequency under some kind of map map or notion of of down minutes um uh define as the main melody so um oh we know that a you know in searching we based on melody is is a very interesting problem people have done for a long time a a query by humming is a very well known application um in doing this a task um there are different methods of representing melodies for for music search and uh the most well-known known um is note transcriptions were or note interval transcriptions so if you have some kind of melody and then for example as you can see this example here um you can just a represent or mobile T as a sequence of notes see for D for you forty three G three or even just model the a transitions between notes such says up down repeat um and and using these transcriptions people have been quite successful when in doing a query by humming uh another method is to just use the pay be continuous or or frame based pitch contour a note the course i have a around continuous "'cause" it's not really continuous it's still a sample digital uh signal but nevertheless we all continuous because it so pretty much uh resembles a a a a a smooth curve um so just to give an example um it it in the uh example here um at the top and see a lot frequency spectrum that's actually a uh you know a small a small segment of of a beatles song and then based on that you can extract some kind of dominant F zero contour uh that can see a um at the bottom and again you have a a a a a time axis and then you have or frequency axis yeah and you just basically just plot the evolution of the uh the dominant a phone and this is what we're gonna call continuous F O contour now the reason why this is interesting is um for i'll get to that of a in a bit but uh let's see how we would actually do a melody based music search uh using that so this is just an example uh you have a query melody they can see here and then you have a a a a a much longer target a music piece that's and maybe some kind of music database and you can see that there's a small part of that target that a very closely resembles a the query um it's just one small segment of that target no and the object of music search is to know given this query you try to match it up with that part and the target now ready you see that there was you know a number of of problems most you have a solve here for example the length of the query here is about seventeen seconds but then the corresponding part um and the target is maybe a little over ten seconds so the target has a higher tempo it's been sound faster and the query so you have to be able to just for that are from the temple they can also be differences in in the in the uh musical key so maybe the queries as you know some that a higher key or at a low key compared to the target so well when you do a music search a first all you have to be able to search all possible starting locations of each target because people may sing you know for starting from the beginning of some saw not this are so at the beginning um a people may start thing if the middle of of of the song uh you want to be able to adjust for differences in speed or temple adjust for differences in key um people may also have some inconsistencies in rhythm in pitch within their saying it so uh dynamic time warping is a very popular uh no technique that's used to uh compare two different uh sequences of pitch uh for doing uh a music search and then there's been other state-of-the-art fingerprinting or hashing techniques uh that have a allowed very efficient and effective search using the note transcription data that's the example were showed you where you maybe have a sequence of notes or sequence of of transitions um uh is it it has however also been suggested that sort of using these no transcriptions using the continuous pitch contours that we just saw a a a can work better and the the prime time primary argument for this is that the no transcriptions can only be very real right we obtained from on a funny music if you have polyphonic music then you're no transcription is very easily break down um if there there errors in the pitch transcription for a fun music the no transcriptions can compound those errors so uh people have suggested just working on these original can pitch contour uh of course that's a problem because the amount of data is so much larger compared to just using the nodes so it's very computationally um expensive so if if for example we've uh in in our previous examples if you had maybe five notes a if you using the no transcription then your input data is just five by values but if we using the contain pitch contour it can be hundreds and hundreds of values that if you're sampling every there two miliseconds so uh in order to uh try to use uh is uh a tune is a pitch contours well also efficiently doing music search we previously a proposed of a method of indexing melody fragments using a set of a what we call time normalized key normalized we would coefficients and store these coefficients in a tree uh to search then them efficiently um there was a lot of mathematical teacher that when it to this that um uh i won't get into here a the problem with this method is that it just compares home of melody fragments to rigid late that it it just uses a simple euclidean distance between a different melody fragments is it does not really a a with the a query changes in the read them although it does allow well i differences in tempo so uh to really do this problem properly we have to do some kind of some kind of dynamic time warping so that's just take a look at a what exactly that how that dynamic time warping can be um formulated related so if you assume some query sequence Q a uh Q which yeah it's just a set of of uh uh a pitch samples and then you have a target sequence P um the classic dtw equation is what you see here are you based be a a a a scene you have R warping operations and you try to you you set up your compass your total cost as a summation of of costs a a a your warping uh dimension or warping operation the dimension and uh the path that you take a along this path is defined by uh these warping functions um fee if Q and fee of P this is just from just very traditional speech recognition or or other pattern recognition text um and here a we just defined the the distance between a query value and a pitch value as a that euclidean dist no note that here go we also have a and and and an extra parameter called the B which uh which actually represents the difference in the musical key between the query and the pitch because as you can see in the uh calculation of the uh of the just as are you also have to uh um take into account that the crew can be sung at a at a different key from the from targets we can't just subtract to balance you have to at some kind of offset and you you add here because the frequencies are log frequency so it of ski shift and the lot frequency "'cause" you domain just becomes a a a a constant by so the B is is some bias factor model difference in musical key and um you we can't of course constraint this true main roughly cons constant uh which does make sense if you just let the be change for every single different pitch value it wouldn't make any sense at all you have to seen that that that both the query and the target and they may have different keys between each other but you still maintain your key a within that so so i'm me sing higher then then john one and but then um i was still maintain a my musical key even though i choose a different one so i'm and i'm fruit to choose whatever key at one so you can see ut replace the be so by with with just a cost of valid B but even in this case uh when you actually tried to do the dynamic programming to to minimize this whole equation you basically have to compute a was over three dimensional space you have you your your query access you have your target access and then you also have to are i try every single possible difference in the uh in the musical key so you have to compute are are the costs in a Q times P to be E space so the cost is it's just too high um the on original study that try to use ms cool pitch a pitch contours actually did exactly back so i to to address this concern um you know people have have uh a propose different ways and uh what we did was to to speed up the dtw we we partition the query into segments treating each segment as a as their rhythmically consistent unit no no this idea of partitioning up queries or or or melodies into to segments that's not really do you and people have done it in in one form or another um what we did here though is uh there are a number of mathematical issues that you have to um i H you know a handle in terms of how the the the by can change it and that things like that and uh so these are just more algorithmic details that that are in the paper so i won't really get it to them here but really matters is that we are uh just dividing up the query it to segments and this drastically reduces the the amount of computation and we propose a framework for for for doing that um so this can actually solve quite elegantly um using a yeah just a very classical level building but with just some once subtle different um it's kind of like in connected word recognition what the classical word recognition technique that maybe you seen the rip be or too long speech recognition book um where you but upon levels were each level represents one word segment um and here we have a or query that's divided up into to different segments but then we also introduce what we call can something like buffer zones between the query segments and these are employed to a lower the uh some target segments to to overlap and that's provides a much greater flexibility in the dtw if you if you don't know about this overlap then um you introduce too much rigidity um into to the system so all all this uh it can be done quite elegantly uh with with a level building scheme so uh we actually um implemented this and uh uh we we and this on the uh mare's you thousand six test set and these are just some uh mean a standard deviations for the for this average start sisters there's time per query that you could it cheap uh using these techniques and uh what you see at the bottom are just asymptotic search costs um uh that's a so did associated with each different type of of of a search technique um and T W is actually are original work where we just rigidly the um try to compare um mel what you fragments was together using a simple euclidean distance and for that yeah that each search or or i i just the each comparison can just be done add in constant time because it's just a euclidean distance a multi-scale query windowing is just a variation of that um and the be a search cost is going to be asymptotically uh big oh of and what are and is the number a ship query segments and then the segmented dtw is is our proposed method uh for which be there's cost is it's and times P where again and is number of query segments and P is the link of of your target signal and the dtw is a is the brute force dtw T W that are originally would you the sort optimum equation in for that the search cost is P times to times be again you have to do it over all three dimensions so that's are sign for this is the greatest and and that you can obviously see that uh in actual experimentation um the uh multiscale just kill target don't is of course the the fastest uh and uh the um and U W the multiscale query when doing followed by our proposed S D W is much slower but still zero point six four seconds per queries wasn't a very bad figure and then we just estimated the amount of time would take if we use P brute force optimum dtw T which she C is much much uh takes much much longer and than any of these given methods um in terms of the uh search performance um we found that uh be uh the uh segmented dtw obviously is going to give the best are performance because again it it doesn't actual dynamic time warping between between queries and can handle rhythmic inconsistencies and all these other variations that's uh the M T W or D but the previous and to W can't um and uh you can see that uh the M are that we got was your point he three eight um this is actually a much lower than the state-of-the-art art uh which is higher than of the a point nine um however um and that particular work um there was a constraint that's that um that all queries charted at the beginning of of melodies or or mel what melodic phrase uh in our case we searched every possible location in all the songs in a database and source search space as is much larger and there's much more room for or so to just to be on an equal footing with the C every the are we also tried constraining all a queries to start at the beginning of melodies and they that gave us an M am or of zero point nine to one which is almost the same as that the of the art and we also tried an a upper real a limit every power of polyphonic music set uh which we just create on real using commercial pop some recordings so the are actual acoustic polyphonic a music recordings from which we did just some kind of automatic um a F zero extraction and then a are test on that and obviously the results are gonna be much lower "'cause" we now we using polyphonic music and uh there's can be lots of errors the pitch transcriptions but again our are our proposed method was still able to give the best for so and this is just a showing the relation between the number of query segments that use verses the M are are and the search time the more segments you have you have basically the close you get to the optimum dtw so W M R is gonna go well but then at the same time you're search times also critical uh so that's pretty much it and channel of this work there any questions i the more efficient way of trying a fine the bias like the the i your we task that is the reigning that's a way that and trying every possible E all while trying every possible be yeah than the computer are large you you uh one of a smaller resolution right so i mean if you just going to try every possible be by definition you're trying every possible be so you're i mean you could try using some kind of heuristics maybe to reduce you're search space ensure the be trying instead of trying all those possible of of the subset um in our work uh we just a a did simplification so that we use some kind of optimum be that's estimate just using the first query segment and that's how we were able to just eliminate that whole search space but uh there probably other more other heuristic but the that you could apply there

EFFICIENT SEARCH OF MUSIC PITCH CONTOURS USING WAVELET TRANSFORMS AND SEGMENTED DYNAMIC TIME WARPING

Multimedia Indexing and Retrieval

Presented by: Woojay Jeon, Author(s): Woojay Jeon, Changxue Ma, Motorola, United States