i L one

oh

please

oh copy

a a number

two are and

number five so

okay

i i'm year of the session i the my you know from university of to group power

K

so let's start

so first presentation

read three yeah

yeah

uh so combining a gmm base melody extraction and

and based

soft masking for

a separating

voice and

a company meant

oh from one or or or a wood

okay

so young

wang and um

oh

okay please

okay good morning everyone

um yeah and presenting in my paper uh out

combining hmm based melody extraction and nmf based soft masking for separate voice and accompaniment from one or or audio

and so he first you see uh block diagram of a

a of

most set uh separate system for voice and accompaniment

is made up of two main model one it the melody extraction which the i outputs a pitch contour from

the audio the audio signal

and then the time-frequency masking works on the spectrogram to give an estimation of the spectrogram

a voice and a coming

so the difference system are different in the the techniques the use for this uh a these individual models goes

so for extraction the a popular

a a point or methods a hidden markov models and a to matrix factorization

and for

that type because the masking

there's a a a a hard masking and soft mask

our work is largely based on the work of you could

a which which is the light based on on net Q non-negative matrix factorization

but we find that

format the traction the and M at doesn't work where while so uh is that we a week

the we are inspired by that were close you the which does

uh the extraction of a markov models and then

from our work

so for a for i'll give a brief review of about the a and a have an and nmf based

melody extraction and also the time proxy mask

so in the non as can make to a factorization the uh

the observed

spectrogram of the given

audio signal is

regarded as a stochastic process

we're each element a bayes a a a a i i is a complex number of being uh

so caution

distribution where there's a

various parameter T and if you put all the D's together you get the power spectrum

and

the problems of the non-negative matrix factorization is to estimate this

a power but power spectrum the to

max my the likelihood of observed spectrum X

the power spectrum the of the total signal can be

and

decompose into two parts the spectrogram of the voice and the spectral put

the spectrogram of the music

that or the accompaniment

and for them more the

spectrum of the voice can be to calm

decomposed into the product

oh the spectrograms of the class to execution and vocoder

now you you show these parentheses is

the matrix P it can be regarded as a code books and that matrix a can you got is as

lena calm but combination coefficients of these

uh

basis vectors so a a a a a a let me show you should hold this work

um let's take the plot i i a got the excitation a matrix pf and a for example

the pf makes looks like this

so a each column is the

the spectrum of

of the class excitation

at a certain fundamental frequency

fundamental frequencies i express media numbers which is the log scale of the frequency

and

here

you can see two old columns of the a matrix that one is for the media number fifty five and

the other for seven

you can see that the four fifty five it

it has a lower fundamental frequencies so they are so the harmonics a placed close or

and for seven they are a a a place for the for further part

and for

the a F matrix which is uh a combination coefficients for these based a basis vectors

for example if we look at

this

or activated

so

there is a a a a a coefficient for basis vector and me number sixty and a smaller quite

for the

the

basis vector ads

need you number send

and uh

so uh if you the

all these F matrix out you can

actually be realized that pitch contour on

on this matrix which is the dark line there

so a

the the lie above is like at the

uh as the second harmonic and this small lines maybe that common men

so that procedure for a melody extraction and soft masking using an amp paid is as follows

first so we fixed the pf F matrix as shown in the previous slide

and then at we so we saw using an iterative procedure that for the other fine matrices

and we are specially interested in yeah

next uh we find the

strong as can do no speech track on this a yeah matrix using done and dynamic programming

and then we cleared the other ones that the that is far from the

the can you no speech rec

and

with this new a have we we saw for the other four is which can be a more accurate estimate

a for solving all that all the but matrices in the decomposition of

the power spectrum the week and then use a

wiener filtering to

estimate the

in will spectrum of the voice and accompaniment and then we were this into the time domain with or add

a method

then we get an estimate of the voice and components

respectively

so here are the most important part of my lecture here

uh which is that we find that the non egg you the factorization is it doesn't work well enough for

that that extraction

so a

a a a a matrix i shown in the previous slides are just like the ideal ones once but the

actual yeah i get is looks like this

so we can see that

there is a great imbalance in the

in different frequency

for high frequencies is the yeah values are large and for to the hours there small

um so uh we have identified a identified to close this for this balance the first is the nonlinear T

of the mean an numb scale where using

so the mean there a meeting number scale is a logarithm

scale of the

frequency and if we

four

the for the same as same amount of energy in the low frequency a look low frequency and

a we have more basis vector to divide it so the coefficient for individual

basis vectors we get smaller than the higher or frequency "'cause" the end

this is one of the reason why

yeah

yeah matrix has

smaller values in the lower frequency range

and

to compensate for this in as we have

uh we now we a multiply apply uh a one term into that yeah matrix

here F is them a

the frequency in first and and is the median number

so a

the first derivative of have the respect to and is a

is like the this city of uh

the basis vectors at a certain frequency

and by dividing a this

a

a a a actually is a i've more and must placating the

uh the city of the basis vectors we can make the

values at the lower frequencies

the bit larger

now the second "'cause" we i i the i didn't fight is that the columns of the P a matrix

a not normalized

so a

and as you can see the for uh uh lower a media number like fifty five

there are more harmonics

and

since the M to use of these high

harmonics are similar

because they are more

harmonics in the low frequency bass a basis vector the total energy is also higher

therefore for this again can that contributes to the balance in yeah

so to compensate for this

before for the multiplied the

a a for each

unit in the A F matrix

we multiply the

total energy in the basis vector

a

of the corresponding frequency

and

this is this is that total a station that we can out bit

uh in in do was original paper he also it came up with a conversation which is not a a

most multiple a multiplicative as ours but additive

so uh basically what this means is that

for each unit in the F matrix

half of the bad or

at the unit one octave higher is added to the

or you not unit

but the effect of these conversations and not so good

as you can see a

the leftmost

figure is the original yeah matrix

uh in the middle is the yeah measure is calmness it it's using do queries

um

at to to conversation and the rightmost most is our multiplicative the conversation

so uh you can see that's uh after applying these conversations

the

lower or but a

the values at lower frequencies of the F matrix do can larger but if you look at the

uh

pitch contours extracted with done and we then i'm a programming

you get a you see that

yeah you

like the all about the true pitch contour

with a which is just the result of this embarrassing the ad

so our conclusion here is

even if you do comes john yeah matrix it a you cannot totally eliminate the imbalance and that can have

a pet effect on the pitch control that you

that to extract with dynamic programming

um therefore for we propose or on hmm based melody extraction

the future we use is called energy as gsm it ones of interest

which is an integral the say function with it within each segment on and we use

um there is thirty six

um

the mentions

that is the media numbers from

the thirty nine to seventy four

the same function is uh

wait is some of the

a of the spectrum of

the given all or the signal and

i use is run here

it's the so there

the red a parse show the large values and blue part of the small values and you can actually see

the

on this data structure function map

oh the signal uh we calculate this say this function i four

the at a step of zero point one meeting numbers

so a

that it that gives use like a more than three hundred dimensions mentions uh a a feature and

which is

too much for that the M

therefore probably integrated in it into the S i features at a there six M once

and

we also use these sent ones at the states of the hmm they are fully connected is and the all

core probability you for each hmm is

models with a

eight component gmm

the parameters of this M is trained from the M my are one K database base it a his annotated

with the at frame level with the

a

a fundamental frequency

and if you do a viterbi decoding on

on the

oh the on a piece so all with it is a hmm if will use uh

pitch

pitch contour for a query to once i talk

in in order to get a a fine P track which is a a a a a a great down

to zero point once i meet ones

um

we been take the maximum value of the C is function map

a

in a their point five some into range around that for speech

and then a show you a how a for or hmm a

is based matter tracking

uh

contrasts with the an mm based

pitch tracking and uh also

a a a a

they fact of the net and then map soft masking

in contrast with the hard masking

so the evaluation corpora we use are the M our K

database the it and also some of the clips available and the please bats that

the items a evaluation encode the

the sept the separate model was and also that or all

form

so first for the melody extraction uh a if force it compare are uh our

uh our system with a with use a which which use also based on a hmm and yes yes i

features

but there are i features at different a a defined differently from hours and the use two streams of features

why we use only one stream

and

the performance of the but the two systems are comparable

um the the a result of our keys here so uh this at the comparison of the pitch tracking of

our proposed hmm based a method and you could use an ml based method that

uh so for if you look at the accuracy and our is much higher than the than the row and

M have and also higher than the

compensated at math

and he's these process out the down of errors so we can you can see for our hmm based

methods uh there the isn't a very much uh errors and mostly a a like one octave higher at the

twelve some once and one E

but up to lower at the minus simon ones

and for the and then have you can see that there is always

but distributed cost a large range of a

uh

errors

a so that this is

right to to the imbalance in the F matrix

so if you use dp it will always like pick something

uh about the true pitch contour

and even you even if you do the compensation

is that like that comes to you a person is are in is not completely

cleared

also worth mentioning is that uh because already each am and uh

each i meant based

P tracking method a trained offline and the online part does the does not you will you bought and iterations

so this run six to seven times sure then the

it or to an M F for C

for the time-frequency masking we the we compare our system with a hard masking system a of shoe

and

a a week uh evaluate them at the three and mixing uh

S not snrs

like a a man five zero five db

um

now it first

you look at the blue

the blue squares where we use the annotated pitch tracking so we isolated isolate the

T a a a a a T F three masking part

and

i see that a a a all the snr we shows our

our system

uh performs better

and

but mentioning it that's our

the

our performance for the

i two did pitch tracking which it use is soft masking

guess close or even exceed the hard masking

i do you ideal masks which is

kind of a a a per for that

for of the heart must

now a for that the overall evaluation we use the extracted pitch tracks uh

or all or or and we see that

it also performs better than the haar must insist

and then now uh we

and here the or system of ours with duke clues which is completely based on

i

i like to show you

yeah

i

i i a

so this is a make sure

and this is the separation results

using do please never have based method

i don't know oh

oh

oh

that's see that a for the last notes the pitch contour

the pitch is like

uh twice the true pitch

i

i

i

it's so you here that some of the voices that's in the common men

oh you i no no oh

so a a a our pitch the pitch you structure for the last noise correct

i

i

i

i

so the common here is green or than do you please system

and

a if you look at these results to some of them

for some of them our system force better and for some of them

it's worse

the the reason here is a is like mainly it it determined by the performance of the

matt extraction

okay so of for the conclusion

and

oh control can that's that an M at A based net the extraction of suffers from and embarrassing the F

matrix and for

for this matter each be be better and also run faster

and for the tf masking and M

and based soft masking is much better than hard masking

so uh we we propose the combination of hmm the extraction be and at based soft mask

thank you

a any questions

real time for you

yeah piece

yeah so thank you for you of your of your tool

i have one question i mean to question actually no one question is them

you're method is um uh should provide or you have some only

yeah why you're your we you compared to a method to do real at which is completely and provide

so uh my question is in which way

um

the learning you do

it could be to generic and can be applied to

completely different signals and my second question would be

do you have to sonic samples where

you methods is slightly

let's performance and do used method

yes

if you can play also them

that would you know noise

oh okay

or a

this the other all uh the others

leaves the separate it all side will will available on that a demo a web page which it is a

where the U R are is included in our paper

and for a for the for the first question in it that's

uh we use this to supervised the method because we find that the imbalance are

is

that's the results very much and

uh

and actually a

so

do you use a conversation is like

uh some ad hoc rule based uh

compensation

uh like like this one so uh

this is not completely unsupervised is

he also looks at the the

like a like a a what the imbalance looks like and design this rule to

to compensate for this thing and

our H am training is like a to learn this

to learn the

a

to learn what the in looks like a by and

a automatically learning method

okay let's go to

okay thank you