0:00:13 | i L one |
---|---|

0:00:18 | oh |

0:00:19 | please |

0:00:20 | oh copy |

0:00:22 | a a number |

0:00:24 | two are and |

0:00:25 | number five so |

0:00:26 | okay |

0:00:31 | i i'm year of the session i the my you know from university of to group power |

0:00:35 | K |

0:00:36 | so let's start |

0:00:37 | so first presentation |

0:00:40 | read three yeah |

0:00:42 | yeah |

0:00:44 | uh so combining a gmm base melody extraction and |

0:00:48 | and based |

0:00:50 | soft masking for |

0:00:51 | a separating |

0:00:52 | voice and |

0:00:54 | a company meant |

0:00:56 | oh from one or or or a wood |

0:00:59 | okay |

0:01:00 | so young |

0:01:01 | wang and um |

0:01:03 | oh |

0:01:04 | okay please |

0:01:07 | okay good morning everyone |

0:01:09 | um yeah and presenting in my paper uh out |

0:01:11 | combining hmm based melody extraction and nmf based soft masking for separate voice and accompaniment from one or or audio |

0:01:22 | and so he first you see uh block diagram of a |

0:01:26 | a of |

0:01:26 | most set uh separate system for voice and accompaniment |

0:01:29 | is made up of two main model one it the melody extraction which the i outputs a pitch contour from |

0:01:35 | the audio the audio signal |

0:01:37 | and then the time-frequency masking works on the spectrogram to give an estimation of the spectrogram |

0:01:43 | a voice and a coming |

0:01:45 | so the difference system are different in the the techniques the use for this uh a these individual models goes |

0:01:51 | so for extraction the a popular |

0:01:54 | a a point or methods a hidden markov models and a to matrix factorization |

0:01:58 | and for |

0:01:59 | that type because the masking |

0:02:01 | there's a a a a hard masking and soft mask |

0:02:04 | our work is largely based on the work of you could |

0:02:07 | a which which is the light based on on net Q non-negative matrix factorization |

0:02:12 | but we find that |

0:02:14 | format the traction the and M at doesn't work where while so uh is that we a week |

0:02:20 | the we are inspired by that were close you the which does |

0:02:23 | uh the extraction of a markov models and then |

0:02:28 | from our work |

0:02:30 | so for a for i'll give a brief review of about the a and a have an and nmf based |

0:02:35 | melody extraction and also the time proxy mask |

0:02:39 | so in the non as can make to a factorization the uh |

0:02:43 | the observed |

0:02:44 | spectrogram of the given |

0:02:45 | audio signal is |

0:02:47 | regarded as a stochastic process |

0:02:50 | we're each element a bayes a a a a i i is a complex number of being uh |

0:02:55 | so caution |

0:02:56 | distribution where there's a |

0:02:58 | various parameter T and if you put all the D's together you get the power spectrum |

0:03:03 | and |

0:03:05 | the problems of the non-negative matrix factorization is to estimate this |

0:03:09 | a power but power spectrum the to |

0:03:12 | max my the likelihood of observed spectrum X |

0:03:17 | the power spectrum the of the total signal can be |

0:03:21 | and |

0:03:21 | decompose into two parts the spectrogram of the voice and the spectral put |

0:03:26 | the spectrogram of the music |

0:03:28 | that or the accompaniment |

0:03:30 | and for them more the |

0:03:31 | spectrum of the voice can be to calm |

0:03:34 | decomposed into the product |

0:03:36 | oh the spectrograms of the class to execution and vocoder |

0:03:41 | now you you show these parentheses is |

0:03:45 | the matrix P it can be regarded as a code books and that matrix a can you got is as |

0:03:50 | lena calm but combination coefficients of these |

0:03:53 | uh |

0:03:54 | basis vectors so a a a a a a let me show you should hold this work |

0:03:57 | um let's take the plot i i a got the excitation a matrix pf and a for example |

0:04:02 | the pf makes looks like this |

0:04:04 | so a each column is the |

0:04:07 | the spectrum of |

0:04:08 | of the class excitation |

0:04:10 | at a certain fundamental frequency |

0:04:12 | fundamental frequencies i express media numbers which is the log scale of the frequency |

0:04:19 | and |

0:04:20 | here |

0:04:20 | you can see two old columns of the a matrix that one is for the media number fifty five and |

0:04:25 | the other for seven |

0:04:26 | you can see that the four fifty five it |

0:04:28 | it has a lower fundamental frequencies so they are so the harmonics a placed close or |

0:04:33 | and for seven they are a a a place for the for further part |

0:04:38 | and for |

0:04:39 | the a F matrix which is uh a combination coefficients for these based a basis vectors |

0:04:45 | for example if we look at |

0:04:46 | this |

0:04:52 | or activated |

0:04:54 | so |

0:04:55 | there is a a a a a coefficient for basis vector and me number sixty and a smaller quite |

0:05:01 | for the |

0:05:02 | the |

0:05:03 | basis vector ads |

0:05:04 | need you number send |

0:05:05 | and uh |

0:05:06 | so uh if you the |

0:05:08 | all these F matrix out you can |

0:05:10 | actually be realized that pitch contour on |

0:05:12 | on this matrix which is the dark line there |

0:05:15 | so a |

0:05:16 | the the lie above is like at the |

0:05:19 | uh as the second harmonic and this small lines maybe that common men |

0:05:27 | so that procedure for a melody extraction and soft masking using an amp paid is as follows |

0:05:33 | first so we fixed the pf F matrix as shown in the previous slide |

0:05:38 | and then at we so we saw using an iterative procedure that for the other fine matrices |

0:05:43 | and we are specially interested in yeah |

0:05:49 | next uh we find the |

0:05:50 | strong as can do no speech track on this a yeah matrix using done and dynamic programming |

0:05:57 | and then we cleared the other ones that the that is far from the |

0:06:02 | the can you no speech rec |

0:06:04 | and |

0:06:05 | with this new a have we we saw for the other four is which can be a more accurate estimate |

0:06:12 | a for solving all that all the but matrices in the decomposition of |

0:06:16 | the power spectrum the week and then use a |

0:06:19 | wiener filtering to |

0:06:21 | estimate the |

0:06:23 | in will spectrum of the voice and accompaniment and then we were this into the time domain with or add |

0:06:28 | a method |

0:06:29 | then we get an estimate of the voice and components |

0:06:32 | respectively |

0:06:34 | so here are the most important part of my lecture here |

0:06:38 | uh which is that we find that the non egg you the factorization is it doesn't work well enough for |

0:06:44 | that that extraction |

0:06:46 | so a |

0:06:47 | a a a a matrix i shown in the previous slides are just like the ideal ones once but the |

0:06:52 | actual yeah i get is looks like this |

0:06:55 | so we can see that |

0:06:56 | there is a great imbalance in the |

0:06:59 | in different frequency |

0:07:00 | for high frequencies is the yeah values are large and for to the hours there small |

0:07:09 | um so uh we have identified a identified to close this for this balance the first is the nonlinear T |

0:07:14 | of the mean an numb scale where using |

0:07:17 | so the mean there a meeting number scale is a logarithm |

0:07:21 | scale of the |

0:07:22 | frequency and if we |

0:07:38 | four |

0:07:38 | the for the same as same amount of energy in the low frequency a look low frequency and |

0:07:44 | a we have more basis vector to divide it so the coefficient for individual |

0:07:50 | basis vectors we get smaller than the higher or frequency "'cause" the end |

0:07:53 | this is one of the reason why |

0:07:55 | yeah |

0:07:56 | yeah matrix has |

0:07:57 | smaller values in the lower frequency range |

0:08:02 | and |

0:08:03 | to compensate for this in as we have |

0:08:07 | uh we now we a multiply apply uh a one term into that yeah matrix |

0:08:11 | here F is them a |

0:08:13 | the frequency in first and and is the median number |

0:08:16 | so a |

0:08:17 | the first derivative of have the respect to and is a |

0:08:21 | is like the this city of uh |

0:08:24 | the basis vectors at a certain frequency |

0:08:27 | and by dividing a this |

0:08:29 | a |

0:08:30 | a a a actually is a i've more and must placating the |

0:08:33 | uh the city of the basis vectors we can make the |

0:08:37 | values at the lower frequencies |

0:08:39 | the bit larger |

0:08:43 | now the second "'cause" we i i the i didn't fight is that the columns of the P a matrix |

0:08:48 | a not normalized |

0:08:49 | so a |

0:08:50 | and as you can see the for uh uh lower a media number like fifty five |

0:08:55 | there are more harmonics |

0:08:56 | and |

0:08:57 | since the M to use of these high |

0:08:58 | harmonics are similar |

0:09:00 | because they are more |

0:09:02 | harmonics in the low frequency bass a basis vector the total energy is also higher |

0:09:07 | therefore for this again can that contributes to the balance in yeah |

0:09:11 | so to compensate for this |

0:09:13 | before for the multiplied the |

0:09:15 | a a for each |

0:09:17 | unit in the A F matrix |

0:09:19 | we multiply the |

0:09:20 | total energy in the basis vector |

0:09:23 | a |

0:09:24 | of the corresponding frequency |

0:09:26 | and |

0:09:27 | this is this is that total a station that we can out bit |

0:09:33 | uh in in do was original paper he also it came up with a conversation which is not a a |

0:09:39 | most multiple a multiplicative as ours but additive |

0:09:43 | so uh basically what this means is that |

0:09:46 | for each unit in the F matrix |

0:09:48 | half of the bad or |

0:09:49 | at the unit one octave higher is added to the |

0:09:53 | or you not unit |

0:09:57 | but the effect of these conversations and not so good |

0:10:01 | as you can see a |

0:10:02 | the leftmost |

0:10:03 | figure is the original yeah matrix |

0:10:06 | uh in the middle is the yeah measure is calmness it it's using do queries |

0:10:11 | um |

0:10:11 | at to to conversation and the rightmost most is our multiplicative the conversation |

0:10:16 | so uh you can see that's uh after applying these conversations |

0:10:20 | the |

0:10:21 | lower or but a |

0:10:22 | the values at lower frequencies of the F matrix do can larger but if you look at the |

0:10:27 | uh |

0:10:28 | pitch contours extracted with done and we then i'm a programming |

0:10:31 | you get a you see that |

0:10:33 | yeah you |

0:10:34 | like the all about the true pitch contour |

0:10:37 | with a which is just the result of this embarrassing the ad |

0:10:41 | so our conclusion here is |

0:10:43 | even if you do comes john yeah matrix it a you cannot totally eliminate the imbalance and that can have |

0:10:50 | a pet effect on the pitch control that you |

0:10:52 | that to extract with dynamic programming |

0:10:57 | um therefore for we propose or on hmm based melody extraction |

0:11:02 | the future we use is called energy as gsm it ones of interest |

0:11:06 | which is an integral the say function with it within each segment on and we use |

0:11:10 | um there is thirty six |

0:11:12 | um |

0:11:13 | the mentions |

0:11:14 | that is the media numbers from |

0:11:16 | the thirty nine to seventy four |

0:11:19 | the same function is uh |

0:11:21 | wait is some of the |

0:11:23 | a of the spectrum of |

0:11:25 | the given all or the signal and |

0:11:28 | i use is run here |

0:11:29 | it's the so there |

0:11:31 | the red a parse show the large values and blue part of the small values and you can actually see |

0:11:36 | the |

0:11:42 | on this data structure function map |

0:11:47 | oh the signal uh we calculate this say this function i four |

0:11:51 | the at a step of zero point one meeting numbers |

0:11:54 | so a |

0:11:55 | that it that gives use like a more than three hundred dimensions mentions uh a a feature and |

0:11:59 | which is |

0:12:00 | too much for that the M |

0:12:02 | therefore probably integrated in it into the S i features at a there six M once |

0:12:07 | and |

0:12:09 | we also use these sent ones at the states of the hmm they are fully connected is and the all |

0:12:14 | core probability you for each hmm is |

0:12:16 | models with a |

0:12:17 | eight component gmm |

0:12:20 | the parameters of this M is trained from the M my are one K database base it a his annotated |

0:12:25 | with the at frame level with the |

0:12:28 | a |

0:12:29 | a fundamental frequency |

0:12:30 | and if you do a viterbi decoding on |

0:12:33 | on the |

0:12:34 | oh the on a piece so all with it is a hmm if will use uh |

0:12:38 | pitch |

0:12:38 | pitch contour for a query to once i talk |

0:12:43 | in in order to get a a fine P track which is a a a a a a great down |

0:12:46 | to zero point once i meet ones |

0:12:48 | um |

0:12:49 | we been take the maximum value of the C is function map |

0:12:53 | a |

0:12:54 | in a their point five some into range around that for speech |

0:13:00 | and then a show you a how a for or hmm a |

0:13:04 | is based matter tracking |

0:13:06 | uh |

0:13:06 | contrasts with the an mm based |

0:13:09 | pitch tracking and uh also |

0:13:11 | a a a a |

0:13:12 | they fact of the net and then map soft masking |

0:13:15 | in contrast with the hard masking |

0:13:17 | so the evaluation corpora we use are the M our K |

0:13:21 | database the it and also some of the clips available and the please bats that |

0:13:26 | the items a evaluation encode the |

0:13:29 | the sept the separate model was and also that or all |

0:13:32 | form |

0:13:34 | so first for the melody extraction uh a if force it compare are uh our |

0:13:40 | uh our system with a with use a which which use also based on a hmm and yes yes i |

0:13:45 | features |

0:13:46 | but there are i features at different a a defined differently from hours and the use two streams of features |

0:13:51 | why we use only one stream |

0:13:53 | and |

0:13:54 | the performance of the but the two systems are comparable |

0:14:00 | um the the a result of our keys here so uh this at the comparison of the pitch tracking of |

0:14:05 | our proposed hmm based a method and you could use an ml based method that |

0:14:11 | uh so for if you look at the accuracy and our is much higher than the than the row and |

0:14:16 | M have and also higher than the |

0:14:18 | compensated at math |

0:14:21 | and he's these process out the down of errors so we can you can see for our hmm based |

0:14:27 | methods uh there the isn't a very much uh errors and mostly a a like one octave higher at the |

0:14:34 | twelve some once and one E |

0:14:36 | but up to lower at the minus simon ones |

0:14:39 | and for the and then have you can see that there is always |

0:14:47 | but distributed cost a large range of a |

0:14:50 | uh |

0:14:51 | errors |

0:14:51 | a so that this is |

0:14:53 | right to to the imbalance in the F matrix |

0:14:56 | so if you use dp it will always like pick something |

0:14:59 | uh about the true pitch contour |

0:15:01 | and even you even if you do the compensation |

0:15:05 | is that like that comes to you a person is are in is not completely |

0:15:10 | cleared |

0:15:13 | also worth mentioning is that uh because already each am and uh |

0:15:17 | each i meant based |

0:15:18 | P tracking method a trained offline and the online part does the does not you will you bought and iterations |

0:15:24 | so this run six to seven times sure then the |

0:15:28 | it or to an M F for C |

0:15:32 | for the time-frequency masking we the we compare our system with a hard masking system a of shoe |

0:15:37 | and |

0:15:39 | a a week uh evaluate them at the three and mixing uh |

0:15:43 | S not snrs |

0:15:44 | like a a man five zero five db |

0:15:47 | um |

0:15:48 | now it first |

0:15:50 | you look at the blue |

0:15:51 | the blue squares where we use the annotated pitch tracking so we isolated isolate the |

0:15:56 | T a a a a a T F three masking part |

0:15:59 | and |

0:16:00 | i see that a a a all the snr we shows our |

0:16:04 | our system |

0:16:05 | uh performs better |

0:16:06 | and |

0:16:07 | but mentioning it that's our |

0:16:10 | the |

0:16:10 | our performance for the |

0:16:12 | i two did pitch tracking which it use is soft masking |

0:16:15 | guess close or even exceed the hard masking |

0:16:19 | i do you ideal masks which is |

0:16:21 | kind of a a a per for that |

0:16:22 | for of the heart must |

0:16:26 | now a for that the overall evaluation we use the extracted pitch tracks uh |

0:16:31 | or all or or and we see that |

0:16:33 | it also performs better than the haar must insist |

0:16:39 | and then now uh we |

0:16:41 | and here the or system of ours with duke clues which is completely based on |

0:16:47 | i |

0:16:48 | i like to show you |

0:16:50 | yeah |

0:16:51 | i |

0:16:52 | i i a |

0:16:58 | so this is a make sure |

0:17:00 | and this is the separation results |

0:17:02 | using do please never have based method |

0:17:08 | i don't know oh |

0:17:10 | oh |

0:17:11 | oh |

0:17:14 | that's see that a for the last notes the pitch contour |

0:17:17 | the pitch is like |

0:17:19 | uh twice the true pitch |

0:17:21 | i |

0:17:23 | i |

0:17:25 | i |

0:17:29 | it's so you here that some of the voices that's in the common men |

0:17:36 | oh you i no no oh |

0:17:42 | so a a a our pitch the pitch you structure for the last noise correct |

0:17:48 | i |

0:17:48 | i |

0:17:51 | i |

0:17:53 | i |

0:17:54 | so the common here is green or than do you please system |

0:17:58 | and |

0:17:59 | a if you look at these results to some of them |

0:18:03 | for some of them our system force better and for some of them |

0:18:05 | it's worse |

0:18:06 | the the reason here is a is like mainly it it determined by the performance of the |

0:18:12 | matt extraction |

0:18:15 | okay so of for the conclusion |

0:18:18 | and |

0:18:18 | oh control can that's that an M at A based net the extraction of suffers from and embarrassing the F |

0:18:24 | matrix and for |

0:18:26 | for this matter each be be better and also run faster |

0:18:29 | and for the tf masking and M |

0:18:32 | and based soft masking is much better than hard masking |

0:18:34 | so uh we we propose the combination of hmm the extraction be and at based soft mask |

0:18:40 | thank you |

0:18:47 | a any questions |

0:18:48 | real time for you |

0:18:51 | yeah piece |

0:18:55 | yeah so thank you for you of your of your tool |

0:18:57 | i have one question i mean to question actually no one question is them |

0:19:01 | you're method is um uh should provide or you have some only |

0:19:06 | yeah why you're your we you compared to a method to do real at which is completely and provide |

0:19:11 | so uh my question is in which way |

0:19:14 | um |

0:19:14 | the learning you do |

0:19:16 | it could be to generic and can be applied to |

0:19:19 | completely different signals and my second question would be |

0:19:22 | do you have to sonic samples where |

0:19:23 | you methods is slightly |

0:19:25 | let's performance and do used method |

0:19:28 | yes |

0:19:29 | if you can play also them |

0:19:30 | that would you know noise |

0:19:32 | oh okay |

0:19:34 | or a |

0:19:35 | this the other all uh the others |

0:19:38 | leaves the separate it all side will will available on that a demo a web page which it is a |

0:19:42 | where the U R are is included in our paper |

0:19:45 | and for a for the for the first question in it that's |

0:19:48 | uh we use this to supervised the method because we find that the imbalance are |

0:19:53 | is |

0:19:54 | that's the results very much and |

0:19:57 | uh |

0:19:57 | and actually a |

0:19:59 | so |

0:20:00 | do you use a conversation is like |

0:20:02 | uh some ad hoc rule based uh |

0:20:04 | compensation |

0:20:06 | uh like like this one so uh |

0:20:08 | this is not completely unsupervised is |

0:20:11 | he also looks at the the |

0:20:15 | like a like a a what the imbalance looks like and design this rule to |

0:20:19 | to compensate for this thing and |

0:20:21 | our H am training is like a to learn this |

0:20:24 | to learn the |

0:20:26 | a |

0:20:26 | to learn what the in looks like a by and |

0:20:29 | a automatically learning method |

0:20:32 | okay let's go to |

0:20:34 | okay thank you |