and which a yeah and i can be talking to about better work on um

if search search of music pitch contours

uh a first uh i'm gonna

start with a short overview about uh content

based music search especially melody based music search

a first we need to define what's a we mean by main melody

the exact mean logical definition is probably a lot more complicated but we're just gonna use something that's except in

the uh

music information retrieval community

which is the single modify nick pitch sequence that a listener might trip reproduce it fast to was so or

hum a piece of polyphonic music

and that a listener would recognise as being be

as sense of that music when when heard in compares

um so based on this definition of main melody are we're just gonna make it even

further loose sure

and just assume that melody means dominant pitch contour

which in turn we're gonna assume a means

dominant the fundamental frequency

contour and of course all three of these concepts

you know are actually

subtly different uh especially uh

not only between melody and pitch but also between pitch and fundamental frequency but

here we're just gonna lose assume that

fundamental frequency under some kind of map map or notion of of down minutes

uh define as the main melody

so um

oh we know that a you know in searching we based on melody is is a very interesting problem people

have done for a long time

a a query by humming is a very well known

application

um in doing this

a task

um there are different methods of representing melodies

for for music search

and uh the most well-known known um is note

transcriptions were or note interval transcriptions

so if you have some kind of melody and then for example as you can see this example here

um you can just

represent or mobile T as a sequence of notes

see for D for you forty three G three

or even just model the a transitions between notes

such says up down repeat

um and and using these transcriptions people have been quite successful when in doing a query by humming

uh another method is to just use the pay be continuous or or frame based pitch contour

a note the course i have a around continuous "'cause" it's not really continuous it's still a sample digital

uh signal but nevertheless we all continuous because it

so pretty much uh resembles a a a a a smooth curve

um so just to give an example

um it it in the

example here

um at the top and see a lot frequency

spectrum that's actually a uh you know a small a small segment of of a beatles song

and then based on that you can extract some kind of dominant F zero contour

uh that can see a um at the bottom

and again you have a a a a a time axis and then you have or frequency axis

yeah and you just basically just plot the evolution of the uh the dominant a phone and this is what

we're gonna call

continuous F O contour now the reason why this is interesting

is um for i'll get to that of a in a bit

but uh

let's see how we would actually do a melody based music search uh using that

so this is just an example uh you have a query melody they can see here

and then you have a a a a a much longer target

a music piece that's and maybe some kind of music database

and you can see that there's a small part of that target

that

a very closely resembles a the query

it's just one small segment of that target

no and the object of music search is to know given this query you try to match it up with

that part

and the target

now ready you see that there was you know a number of of

problems most you have a solve here for example the length of the query here

is about seventeen seconds but then the corresponding part

um and the target is maybe a little over ten seconds

so the target

has a higher tempo it's been sound faster

and the query so you have to be able to just for that are from the temple

they can also be differences in in the in the uh musical key so maybe the queries

you know

some that a higher key or at a low key compared to the target

well when you do a music search a first all you have to be able to search all possible starting

locations of each target because people may sing you know for starting from the beginning of some saw

not this are so at the beginning

a people may start thing if the middle of of of the song

uh you want to be able to adjust for differences in speed or temple adjust for differences in key

um people may also have some inconsistencies in rhythm in pitch

within their saying it

so uh dynamic time warping is a very popular uh

no technique that's used to uh compare two different uh sequences of pitch

uh for doing uh a music search and then there's been other state-of-the-art fingerprinting or hashing techniques

uh that have a allowed very efficient and effective search using the note transcription data that's the example were showed

you where you maybe have a sequence of notes or sequence of

of transitions

uh is it it has however also been suggested that sort of using these no transcriptions using the continuous pitch

contours that we just saw a a a can work better

and the the prime time primary argument for this is that the no transcriptions can only be very real right

we obtained from on a funny music

if you have polyphonic music

then you're no transcription is very easily break down um if there

there errors in the pitch transcription for a fun music the no transcriptions can compound those errors

so uh people have suggested just working on these original can pitch contour

uh of course that's a problem

because

the amount of data is

so much larger compared to just using the nodes

so it's very computationally um

expensive

so if if for example we've uh in in our previous examples

if you had maybe five notes

a if you using the no transcription then your input data is just five

by values but if we using the contain pitch contour it can be hundreds and hundreds of values that if

you're sampling every

there two miliseconds

uh in order to uh try to use

uh is uh

a tune is

a pitch contours well also efficiently doing music search we previously

a proposed of a method of indexing melody fragments using a set of

a what we call time normalized key normalized we would coefficients and store these coefficients in a tree

uh to search then

them efficiently um there was a lot of mathematical teacher that when it to this that um

uh i won't get into here

a the problem with this method is that it just compares home of melody fragments to rigid late that it

it just uses a simple euclidean distance between a different melody fragments is it does not really a a with

the a query changes

in the read them although it does allow well i differences in tempo

so uh to really do this problem

properly we have to do some kind of

some kind of dynamic time warping

so that's

just take a look at

a what exactly that how that dynamic time warping can be um

formulated related

so if you assume

some query sequence Q

a uh Q

which

yeah it's just a set of of uh

uh a pitch samples and then you have a target sequence P

um the classic dtw equation

what you see here are you based be a a a a scene you have R warping operations and you

try to

you you set up your compass your total cost as a summation

of of costs

a a a your warping

uh dimension or warping operation the dimension

and uh the path that you take a along this path

is defined by uh these warping functions um fee if Q and fee of P this is just

from just very traditional speech recognition or

or other pattern recognition text

um and here

a we just defined the the distance

between a query value and a pitch value as a that euclidean dist

no note that here go we also have a and and and an extra parameter called the B

which uh

which actually represents the difference in the musical key between the query and the pitch because as you can see

in the uh calculation of the uh

of the just as

are you also have to

um take into account that the crew can be sung at a at a different key from the from targets

we can't just subtract to balance you have to at some kind of offset

and you you add here because the

frequencies are log frequency so it of ski shift

and the lot frequency "'cause" you domain just becomes a a a a constant by

so the B is is some bias factor model difference in musical key and um

you we can't of course constraint this true main roughly cons

constant uh which does make sense if you just let the

be change for every single different pitch value

it wouldn't make any sense at all you have to seen that

that that both the query and the target

and they may have different keys

between each other but you still maintain your key a within that so so i'm me sing higher then

then

john one and but then

i was still maintain a

my musical key

even though i choose a different one

so i'm and i'm fruit to choose whatever key at one

so you can see ut

replace the be so by with with just a cost of valid B

but even in this case uh when you actually tried to do the dynamic programming to to minimize this whole

equation

you basically have to compute a was over three dimensional space you have you

your your query access you have your target access and then you also have to

are i try every single possible

difference

in the uh in the musical key so you have to compute

are are the costs in a Q times P to be

E space so the cost is it's just too high

the on original study that try to use ms cool pitch a pitch contours actually did exactly back

so i to to address this concern um you know people have

have

a propose different ways

and

uh what we did was to to speed up the dtw we we partition the query into segments

treating each segment as a as their rhythmically consistent unit

no no this idea of partitioning up

queries or or or melodies into to segments

that's not really do you and people have done it in in one form or another

what we did here though is uh

there are a number of mathematical issues that you have to um

i H

you know a handle in terms of how the

the the by can change it and that things like that

and uh so these are just more algorithmic details that that are in the paper so i won't really get

it to them

here

but really matters is that we are uh just dividing up the query it to segments and this drastically reduces

the the amount of computation and we propose

a framework for for for doing that

so this can actually solve

quite elegantly

using a yeah just a very classical level building but with just some once subtle different

um it's kind of like in connected word recognition what the classical word recognition technique that maybe you seen the

rip be or too long speech recognition book

um where you

but upon levels were each level represents

one word segment

um and here we have a

or query that's divided up into to different segments

but then we also introduce what we call can something like buffer zones

between the query segments and these are employed

to a lower the uh some target segments to to overlap

and that's provides a much greater flexibility in the dtw if you if you don't know about this overlap

then um you introduce too much rigidity um into to the system

so all all this

uh it can be done quite elegantly uh with with a level building scheme

so uh we actually um

implemented this

and

uh uh we we and this

on the uh mare's you thousand six

test set

and these are just some

uh mean a standard deviations for the for this average start sisters there's time per query that you could it

cheap

uh using these techniques

and uh what you see at the bottom are just asymptotic search costs

uh that's a so did associated with each different type of of of a search technique

um and T W is actually are original work where we just rigidly the

um try to

compare

um mel what you fragments was together using a simple euclidean distance

and for that

yeah that each search or or i i just the each comparison can just be done add in constant time

because it's just

a euclidean distance

a multi-scale query windowing is just a variation of that

um and the be a search cost is going to be asymptotically uh big oh of and what are and

is the number

a ship query segments

and then the segmented dtw is is our proposed method

uh for which be there's cost is

it's and times P

where again and is number of query segments and P is the link

of of your target signal

and the dtw is a is the brute force dtw T W that are originally would you the sort optimum

equation in for that the search cost is P times to times be again you have to do it over

all three dimensions

so that's are sign for this

is the greatest

and and that you can obviously see that uh in actual experimentation

um the uh

multiscale just kill target don't is of course the the fastest

uh and uh

the um and U W the multiscale query when doing followed by our proposed S D

W is

much slower but still zero point six four seconds per

queries

wasn't a very bad figure

and then we

just estimated the amount of time would take if we use P brute force optimum dtw T which she C

much much uh takes much much longer and than any of these given methods

um in terms of the uh search performance

we found that

uh be uh

the uh segmented dtw obviously is going to give the best

are performance because again it it doesn't actual dynamic time warping between between queries and can handle rhythmic inconsistencies and

all these other

variations that's uh the M T W or D but the previous and to W

can't

and uh you can see that uh the M are that we got was your point he three eight

um this is actually a much lower than the state-of-the-art art

uh which is higher than of the a point nine um however um and that particular work

there was a constraint

that's

that

that all queries

charted at the beginning of

of melodies or

mel what melodic phrase

uh in our case we searched every possible location in all the songs in a database and source search space

as is much larger and there's much more room for or

so to just to be on an equal footing with the C every the are we also tried constraining all

a queries to start at the beginning of melodies

and they that gave us an M am or of zero point nine to one which is almost the same

as that the of the art

and we also tried an a

upper real

a limit every power of polyphonic music set

uh which we just create on real using

commercial pop some recordings so the are actual

acoustic polyphonic a music recordings from which we did

just some kind of automatic um a F zero extraction

and then a are test on that

and

obviously the results are gonna be much lower "'cause" we now we using polyphonic music

and uh there's can be lots of errors the pitch transcriptions

but again our are

our proposed method was still able to give the best for

so and this is just a

showing the relation between the number of query segments that use verses the M are are and the search time

the more segments you have you have basically the close you get to the optimum dtw so W M R

is gonna go well but then at the same time you're search times also critical uh

that's pretty much it

and channel of this work

there any questions

the

more efficient way of trying

a fine the bias like the the i your

we task

that is the reigning

that's a way that and trying every possible E

all

while trying every possible be yeah than the computer are large you you uh one of a smaller resolution

right so i mean

if you just going to try every possible be

by definition you're trying every possible be

you're

i mean you could try using some kind of heuristics maybe to reduce

you're search space ensure the be trying instead of trying all those possible of of the subset

um in our work uh we just a a did simplification so that we use some kind of optimum be

that's estimate just using the first query segment

and that's how we were able to just eliminate that whole search space

but uh there probably other more other heuristic

but the that you could

apply there