a
um
and which a yeah and i can be talking to about better work on um
if search search of music pitch contours
uh a first uh i'm gonna
if
start with a short overview about uh content
based music search especially melody based music search
a first we need to define what's a we mean by main melody
um
the exact mean logical definition is probably a lot more complicated but we're just gonna use something that's except in
the uh
music information retrieval community
which is the single modify nick pitch sequence that a listener might trip reproduce it fast to was so or
hum a piece of polyphonic music
and that a listener would recognise as being be
as sense of that music when when heard in compares
um so based on this definition of main melody are we're just gonna make it even
further loose sure
and just assume that melody means dominant pitch contour
which in turn we're gonna assume a means
dominant the fundamental frequency
contour and of course all three of these concepts
you know are actually
subtly different uh especially uh
not only between melody and pitch but also between pitch and fundamental frequency but
here we're just gonna lose assume that
fundamental frequency under some kind of map map or notion of of down minutes
um
uh define as the main melody
so um
oh we know that a you know in searching we based on melody is is a very interesting problem people
have done for a long time
a a query by humming is a very well known
application
um in doing this
a task
um there are different methods of representing melodies
for for music search
and uh the most well-known known um is note
transcriptions were or note interval transcriptions
so if you have some kind of melody and then for example as you can see this example here
um you can just
a
represent or mobile T as a sequence of notes
see for D for you forty three G three
or even just model the a transitions between notes
such says up down repeat
um and and using these transcriptions people have been quite successful when in doing a query by humming
uh another method is to just use the pay be continuous or or frame based pitch contour
a note the course i have a around continuous "'cause" it's not really continuous it's still a sample digital
uh signal but nevertheless we all continuous because it
so pretty much uh resembles a a a a a smooth curve
um so just to give an example
um it it in the
uh
example here
um at the top and see a lot frequency
spectrum that's actually a uh you know a small a small segment of of a beatles song
and then based on that you can extract some kind of dominant F zero contour
uh that can see a um at the bottom
and again you have a a a a a time axis and then you have or frequency axis
yeah and you just basically just plot the evolution of the uh the dominant a phone and this is what
we're gonna call
continuous F O contour now the reason why this is interesting
is um for i'll get to that of a in a bit
but uh
let's see how we would actually do a melody based music search uh using that
so this is just an example uh you have a query melody they can see here
and then you have a a a a a much longer target
a music piece that's and maybe some kind of music database
and you can see that there's a small part of that target
that
a very closely resembles a the query
um
it's just one small segment of that target
no and the object of music search is to know given this query you try to match it up with
that part
and the target
now ready you see that there was you know a number of of
problems most you have a solve here for example the length of the query here
is about seventeen seconds but then the corresponding part
um and the target is maybe a little over ten seconds
so the target
has a higher tempo it's been sound faster
and the query so you have to be able to just for that are from the temple
they can also be differences in in the in the uh musical key so maybe the queries
as
you know
some that a higher key or at a low key compared to the target
so
well when you do a music search a first all you have to be able to search all possible starting
locations of each target because people may sing you know for starting from the beginning of some saw
not this are so at the beginning
um
a people may start thing if the middle of of of the song
uh you want to be able to adjust for differences in speed or temple adjust for differences in key
um people may also have some inconsistencies in rhythm in pitch
within their saying it
so uh dynamic time warping is a very popular uh
no technique that's used to uh compare two different uh sequences of pitch
uh for doing uh a music search and then there's been other state-of-the-art fingerprinting or hashing techniques
uh that have a allowed very efficient and effective search using the note transcription data that's the example were showed
you where you maybe have a sequence of notes or sequence of
of transitions
um
uh is it it has however also been suggested that sort of using these no transcriptions using the continuous pitch
contours that we just saw a a a can work better
and the the prime time primary argument for this is that the no transcriptions can only be very real right
we obtained from on a funny music
if you have polyphonic music
then you're no transcription is very easily break down um if there
there errors in the pitch transcription for a fun music the no transcriptions can compound those errors
so uh people have suggested just working on these original can pitch contour
uh of course that's a problem
because
the amount of data is
so much larger compared to just using the nodes
so it's very computationally um
expensive
so if if for example we've uh in in our previous examples
if you had maybe five notes
a if you using the no transcription then your input data is just five
by values but if we using the contain pitch contour it can be hundreds and hundreds of values that if
you're sampling every
there two miliseconds
so
uh in order to uh try to use
uh is uh
a tune is
a pitch contours well also efficiently doing music search we previously
a proposed of a method of indexing melody fragments using a set of
a what we call time normalized key normalized we would coefficients and store these coefficients in a tree
uh to search then
them efficiently um there was a lot of mathematical teacher that when it to this that um
uh i won't get into here
a the problem with this method is that it just compares home of melody fragments to rigid late that it
it just uses a simple euclidean distance between a different melody fragments is it does not really a a with
the a query changes
in the read them although it does allow well i differences in tempo
so uh to really do this problem
properly we have to do some kind of
some kind of dynamic time warping
so that's
just take a look at
a what exactly that how that dynamic time warping can be um
formulated related
so if you assume
some query sequence Q
a uh Q
which
yeah it's just a set of of uh
uh a pitch samples and then you have a target sequence P
um the classic dtw equation
is
what you see here are you based be a a a a scene you have R warping operations and you
try to
you you set up your compass your total cost as a summation
of of costs
a a a your warping
uh dimension or warping operation the dimension
and uh the path that you take a along this path
is defined by uh these warping functions um fee if Q and fee of P this is just
from just very traditional speech recognition or
or other pattern recognition text
um and here
a we just defined the the distance
between a query value and a pitch value as a that euclidean dist
no note that here go we also have a and and and an extra parameter called the B
which uh
which actually represents the difference in the musical key between the query and the pitch because as you can see
in the uh calculation of the uh
of the just as
are you also have to
uh
um take into account that the crew can be sung at a at a different key from the from targets
we can't just subtract to balance you have to at some kind of offset
and you you add here because the
frequencies are log frequency so it of ski shift
and the lot frequency "'cause" you domain just becomes a a a a constant by
so the B is is some bias factor model difference in musical key and um
you we can't of course constraint this true main roughly cons
constant uh which does make sense if you just let the
be change for every single different pitch value
it wouldn't make any sense at all you have to seen that
that that both the query and the target
and they may have different keys
between each other but you still maintain your key a within that so so i'm me sing higher then
then
john one and but then
um
i was still maintain a
my musical key
even though i choose a different one
so i'm and i'm fruit to choose whatever key at one
so you can see ut
replace the be so by with with just a cost of valid B
but even in this case uh when you actually tried to do the dynamic programming to to minimize this whole
equation
you basically have to compute a was over three dimensional space you have you
your your query access you have your target access and then you also have to
are i try every single possible
difference
in the uh in the musical key so you have to compute
are are the costs in a Q times P to be
E space so the cost is it's just too high
um
the on original study that try to use ms cool pitch a pitch contours actually did exactly back
so i to to address this concern um you know people have
have
uh
a propose different ways
and
uh what we did was to to speed up the dtw we we partition the query into segments
treating each segment as a as their rhythmically consistent unit
no no this idea of partitioning up
queries or or or melodies into to segments
that's not really do you and people have done it in in one form or another
um
what we did here though is uh
there are a number of mathematical issues that you have to um
i H
you know a handle in terms of how the
the the by can change it and that things like that
and uh so these are just more algorithmic details that that are in the paper so i won't really get
it to them
here
but really matters is that we are uh just dividing up the query it to segments and this drastically reduces
the the amount of computation and we propose
a framework for for for doing that
um
so this can actually solve
quite elegantly
um
using a yeah just a very classical level building but with just some once subtle different
um it's kind of like in connected word recognition what the classical word recognition technique that maybe you seen the
rip be or too long speech recognition book
um where you
but upon levels were each level represents
one word segment
um and here we have a
or query that's divided up into to different segments
but then we also introduce what we call can something like buffer zones
between the query segments and these are employed
to a lower the uh some target segments to to overlap
and that's provides a much greater flexibility in the dtw if you if you don't know about this overlap
then um you introduce too much rigidity um into to the system
so all all this
uh it can be done quite elegantly uh with with a level building scheme
so uh we actually um
implemented this
and
uh uh we we and this
on the uh mare's you thousand six
test set
and these are just some
uh mean a standard deviations for the for this average start sisters there's time per query that you could it
cheap
uh using these techniques
and uh what you see at the bottom are just asymptotic search costs
um
uh that's a so did associated with each different type of of of a search technique
um and T W is actually are original work where we just rigidly the
um try to
compare
um mel what you fragments was together using a simple euclidean distance
and for that
yeah that each search or or i i just the each comparison can just be done add in constant time
because it's just
a euclidean distance
a multi-scale query windowing is just a variation of that
um and the be a search cost is going to be asymptotically uh big oh of and what are and
is the number
a ship query segments
and then the segmented dtw is is our proposed method
uh for which be there's cost is
it's and times P
where again and is number of query segments and P is the link
of of your target signal
and the dtw is a is the brute force dtw T W that are originally would you the sort optimum
equation in for that the search cost is P times to times be again you have to do it over
all three dimensions
so that's are sign for this
is the greatest
and and that you can obviously see that uh in actual experimentation
um the uh
multiscale just kill target don't is of course the the fastest
uh and uh
the um and U W the multiscale query when doing followed by our proposed S D
W is
much slower but still zero point six four seconds per
queries
wasn't a very bad figure
and then we
just estimated the amount of time would take if we use P brute force optimum dtw T which she C
is
much much uh takes much much longer and than any of these given methods
um in terms of the uh search performance
um
we found that
uh be uh
the uh segmented dtw obviously is going to give the best
are performance because again it it doesn't actual dynamic time warping between between queries and can handle rhythmic inconsistencies and
all these other
variations that's uh the M T W or D but the previous and to W
can't
um
and uh you can see that uh the M are that we got was your point he three eight
um this is actually a much lower than the state-of-the-art art
uh which is higher than of the a point nine um however um and that particular work
um
there was a constraint
that's
that
um
that all queries
charted at the beginning of
of melodies or
or
mel what melodic phrase
uh in our case we searched every possible location in all the songs in a database and source search space
as is much larger and there's much more room for or
so to just to be on an equal footing with the C every the are we also tried constraining all
a queries to start at the beginning of melodies
and they that gave us an M am or of zero point nine to one which is almost the same
as that the of the art
and we also tried an a
upper real
a limit every power of polyphonic music set
uh which we just create on real using
commercial pop some recordings so the are actual
acoustic polyphonic a music recordings from which we did
just some kind of automatic um a F zero extraction
and then a are test on that
and
obviously the results are gonna be much lower "'cause" we now we using polyphonic music
and uh there's can be lots of errors the pitch transcriptions
but again our are
our proposed method was still able to give the best for
so and this is just a
showing the relation between the number of query segments that use verses the M are are and the search time
the more segments you have you have basically the close you get to the optimum dtw so W M R
is gonna go well but then at the same time you're search times also critical uh
so
that's pretty much it
and channel of this work
there any questions
i
the
more efficient way of trying
a fine the bias like the the i your
we task
that is the reigning
that's a way that and trying every possible E
all
while trying every possible be yeah than the computer are large you you uh one of a smaller resolution
right so i mean
if you just going to try every possible be
by definition you're trying every possible be
so
you're
i mean you could try using some kind of heuristics maybe to reduce
you're search space ensure the be trying instead of trying all those possible of of the subset
um in our work uh we just a a did simplification so that we use some kind of optimum be
that's estimate just using the first query segment
and that's how we were able to just eliminate that whole search space
but uh there probably other more other heuristic
but the that you could
apply there