i L one oh please oh copy a a number two are and number five so okay i i'm year of the session i the my you know from university of to group power K so let's start so first presentation read three yeah yeah uh so combining a gmm base melody extraction and and based soft masking for a separating voice and a company meant oh from one or or or a wood okay so young wang and um oh okay please okay good morning everyone um yeah and presenting in my paper uh out combining hmm based melody extraction and nmf based soft masking for separate voice and accompaniment from one or or audio and so he first you see uh block diagram of a a of most set uh separate system for voice and accompaniment is made up of two main model one it the melody extraction which the i outputs a pitch contour from the audio the audio signal and then the time-frequency masking works on the spectrogram to give an estimation of the spectrogram a voice and a coming so the difference system are different in the the techniques the use for this uh a these individual models goes so for extraction the a popular a a point or methods a hidden markov models and a to matrix factorization and for that type because the masking there's a a a a hard masking and soft mask our work is largely based on the work of you could a which which is the light based on on net Q non-negative matrix factorization but we find that format the traction the and M at doesn't work where while so uh is that we a week the we are inspired by that were close you the which does uh the extraction of a markov models and then from our work so for a for i'll give a brief review of about the a and a have an and nmf based melody extraction and also the time proxy mask so in the non as can make to a factorization the uh the observed spectrogram of the given audio signal is regarded as a stochastic process we're each element a bayes a a a a i i is a complex number of being uh so caution distribution where there's a various parameter T and if you put all the D's together you get the power spectrum and the problems of the non-negative matrix factorization is to estimate this a power but power spectrum the to max my the likelihood of observed spectrum X the power spectrum the of the total signal can be and decompose into two parts the spectrogram of the voice and the spectral put the spectrogram of the music that or the accompaniment and for them more the spectrum of the voice can be to calm decomposed into the product oh the spectrograms of the class to execution and vocoder now you you show these parentheses is the matrix P it can be regarded as a code books and that matrix a can you got is as lena calm but combination coefficients of these uh basis vectors so a a a a a a let me show you should hold this work um let's take the plot i i a got the excitation a matrix pf and a for example the pf makes looks like this so a each column is the the spectrum of of the class excitation at a certain fundamental frequency fundamental frequencies i express media numbers which is the log scale of the frequency and here you can see two old columns of the a matrix that one is for the media number fifty five and the other for seven you can see that the four fifty five it it has a lower fundamental frequencies so they are so the harmonics a placed close or and for seven they are a a a place for the for further part and for the a F matrix which is uh a combination coefficients for these based a basis vectors for example if we look at this or activated so there is a a a a a coefficient for basis vector and me number sixty and a smaller quite for the the basis vector ads need you number send and uh so uh if you the all these F matrix out you can actually be realized that pitch contour on on this matrix which is the dark line there so a the the lie above is like at the uh as the second harmonic and this small lines maybe that common men so that procedure for a melody extraction and soft masking using an amp paid is as follows first so we fixed the pf F matrix as shown in the previous slide and then at we so we saw using an iterative procedure that for the other fine matrices and we are specially interested in yeah next uh we find the strong as can do no speech track on this a yeah matrix using done and dynamic programming and then we cleared the other ones that the that is far from the the can you no speech rec and with this new a have we we saw for the other four is which can be a more accurate estimate a for solving all that all the but matrices in the decomposition of the power spectrum the week and then use a wiener filtering to estimate the in will spectrum of the voice and accompaniment and then we were this into the time domain with or add a method then we get an estimate of the voice and components respectively so here are the most important part of my lecture here uh which is that we find that the non egg you the factorization is it doesn't work well enough for that that extraction so a a a a a matrix i shown in the previous slides are just like the ideal ones once but the actual yeah i get is looks like this so we can see that there is a great imbalance in the in different frequency for high frequencies is the yeah values are large and for to the hours there small um so uh we have identified a identified to close this for this balance the first is the nonlinear T of the mean an numb scale where using so the mean there a meeting number scale is a logarithm scale of the frequency and if we four the for the same as same amount of energy in the low frequency a look low frequency and a we have more basis vector to divide it so the coefficient for individual basis vectors we get smaller than the higher or frequency "'cause" the end this is one of the reason why yeah yeah matrix has smaller values in the lower frequency range and to compensate for this in as we have uh we now we a multiply apply uh a one term into that yeah matrix here F is them a the frequency in first and and is the median number so a the first derivative of have the respect to and is a is like the this city of uh the basis vectors at a certain frequency and by dividing a this a a a a actually is a i've more and must placating the uh the city of the basis vectors we can make the values at the lower frequencies the bit larger now the second "'cause" we i i the i didn't fight is that the columns of the P a matrix a not normalized so a and as you can see the for uh uh lower a media number like fifty five there are more harmonics and since the M to use of these high harmonics are similar because they are more harmonics in the low frequency bass a basis vector the total energy is also higher therefore for this again can that contributes to the balance in yeah so to compensate for this before for the multiplied the a a for each unit in the A F matrix we multiply the total energy in the basis vector a of the corresponding frequency and this is this is that total a station that we can out bit uh in in do was original paper he also it came up with a conversation which is not a a most multiple a multiplicative as ours but additive so uh basically what this means is that for each unit in the F matrix half of the bad or at the unit one octave higher is added to the or you not unit but the effect of these conversations and not so good as you can see a the leftmost figure is the original yeah matrix uh in the middle is the yeah measure is calmness it it's using do queries um at to to conversation and the rightmost most is our multiplicative the conversation so uh you can see that's uh after applying these conversations the lower or but a the values at lower frequencies of the F matrix do can larger but if you look at the uh pitch contours extracted with done and we then i'm a programming you get a you see that yeah you like the all about the true pitch contour with a which is just the result of this embarrassing the ad so our conclusion here is even if you do comes john yeah matrix it a you cannot totally eliminate the imbalance and that can have a pet effect on the pitch control that you that to extract with dynamic programming um therefore for we propose or on hmm based melody extraction the future we use is called energy as gsm it ones of interest which is an integral the say function with it within each segment on and we use um there is thirty six um the mentions that is the media numbers from the thirty nine to seventy four the same function is uh wait is some of the a of the spectrum of the given all or the signal and i use is run here it's the so there the red a parse show the large values and blue part of the small values and you can actually see the on this data structure function map oh the signal uh we calculate this say this function i four the at a step of zero point one meeting numbers so a that it that gives use like a more than three hundred dimensions mentions uh a a feature and which is too much for that the M therefore probably integrated in it into the S i features at a there six M once and we also use these sent ones at the states of the hmm they are fully connected is and the all core probability you for each hmm is models with a eight component gmm the parameters of this M is trained from the M my are one K database base it a his annotated with the at frame level with the a a fundamental frequency and if you do a viterbi decoding on on the oh the on a piece so all with it is a hmm if will use uh pitch pitch contour for a query to once i talk in in order to get a a fine P track which is a a a a a a great down to zero point once i meet ones um we been take the maximum value of the C is function map a in a their point five some into range around that for speech and then a show you a how a for or hmm a is based matter tracking uh contrasts with the an mm based pitch tracking and uh also a a a a they fact of the net and then map soft masking in contrast with the hard masking so the evaluation corpora we use are the M our K database the it and also some of the clips available and the please bats that the items a evaluation encode the the sept the separate model was and also that or all form so first for the melody extraction uh a if force it compare are uh our uh our system with a with use a which which use also based on a hmm and yes yes i features but there are i features at different a a defined differently from hours and the use two streams of features why we use only one stream and the performance of the but the two systems are comparable um the the a result of our keys here so uh this at the comparison of the pitch tracking of our proposed hmm based a method and you could use an ml based method that uh so for if you look at the accuracy and our is much higher than the than the row and M have and also higher than the compensated at math and he's these process out the down of errors so we can you can see for our hmm based methods uh there the isn't a very much uh errors and mostly a a like one octave higher at the twelve some once and one E but up to lower at the minus simon ones and for the and then have you can see that there is always but distributed cost a large range of a uh errors a so that this is right to to the imbalance in the F matrix so if you use dp it will always like pick something uh about the true pitch contour and even you even if you do the compensation is that like that comes to you a person is are in is not completely cleared also worth mentioning is that uh because already each am and uh each i meant based P tracking method a trained offline and the online part does the does not you will you bought and iterations so this run six to seven times sure then the it or to an M F for C for the time-frequency masking we the we compare our system with a hard masking system a of shoe and a a week uh evaluate them at the three and mixing uh S not snrs like a a man five zero five db um now it first you look at the blue the blue squares where we use the annotated pitch tracking so we isolated isolate the T a a a a a T F three masking part and i see that a a a all the snr we shows our our system uh performs better and but mentioning it that's our the our performance for the i two did pitch tracking which it use is soft masking guess close or even exceed the hard masking i do you ideal masks which is kind of a a a per for that for of the heart must now a for that the overall evaluation we use the extracted pitch tracks uh or all or or and we see that it also performs better than the haar must insist and then now uh we and here the or system of ours with duke clues which is completely based on i i like to show you yeah i i i a so this is a make sure and this is the separation results using do please never have based method i don't know oh oh oh that's see that a for the last notes the pitch contour the pitch is like uh twice the true pitch i i i it's so you here that some of the voices that's in the common men oh you i no no oh so a a a our pitch the pitch you structure for the last noise correct i i i i so the common here is green or than do you please system and a if you look at these results to some of them for some of them our system force better and for some of them it's worse the the reason here is a is like mainly it it determined by the performance of the matt extraction okay so of for the conclusion and oh control can that's that an M at A based net the extraction of suffers from and embarrassing the F matrix and for for this matter each be be better and also run faster and for the tf masking and M and based soft masking is much better than hard masking so uh we we propose the combination of hmm the extraction be and at based soft mask thank you a any questions real time for you yeah piece yeah so thank you for you of your of your tool i have one question i mean to question actually no one question is them you're method is um uh should provide or you have some only yeah why you're your we you compared to a method to do real at which is completely and provide so uh my question is in which way um the learning you do it could be to generic and can be applied to completely different signals and my second question would be do you have to sonic samples where you methods is slightly let's performance and do used method yes if you can play also them that would you know noise oh okay or a this the other all uh the others leaves the separate it all side will will available on that a demo a web page which it is a where the U R are is included in our paper and for a for the for the first question in it that's uh we use this to supervised the method because we find that the imbalance are is that's the results very much and uh and actually a so do you use a conversation is like uh some ad hoc rule based uh compensation uh like like this one so uh this is not completely unsupervised is he also looks at the the like a like a a what the imbalance looks like and design this rule to to compensate for this thing and our H am training is like a to learn this to learn the a to learn what the in looks like a by and a automatically learning method okay let's go to okay thank you