Speech Transcript - COMBINING HMM-BASED MELODY EXTRACTION AND NMF-BASED SOFT MASKING FOR SEPARATING VOICE AND ACCOMPANIMENT FROM MONAURAL AUDIO

0:00:13	i L one
0:00:18	oh
0:00:19	please
0:00:20	oh copy
0:00:22	a a number
0:00:24	two are and
0:00:25	number five so
0:00:26	okay
0:00:31	i i'm year of the session i the my you know from university of to group power
0:00:35	K
0:00:36	so let's start
0:00:37	so first presentation
0:00:40	read three yeah
0:00:42	yeah
0:00:44	uh so combining a gmm base melody extraction and
0:00:48	and based
0:00:50	soft masking for
0:00:51	a separating
0:00:52	voice and
0:00:54	a company meant
0:00:56	oh from one or or or a wood
0:00:59	okay
0:01:00	so young
0:01:01	wang and um
0:01:03	oh
0:01:04	okay please
0:01:07	okay good morning everyone
0:01:09	um yeah and presenting in my paper uh out
0:01:11	combining hmm based melody extraction and nmf based soft masking for separate voice and accompaniment from one or or audio
0:01:22	and so he first you see uh block diagram of a
0:01:26	a of
0:01:26	most set uh separate system for voice and accompaniment
0:01:29	is made up of two main model one it the melody extraction which the i outputs a pitch contour from
0:01:35	the audio the audio signal
0:01:37	and then the time-frequency masking works on the spectrogram to give an estimation of the spectrogram
0:01:43	a voice and a coming
0:01:45	so the difference system are different in the the techniques the use for this uh a these individual models goes
0:01:51	so for extraction the a popular
0:01:54	a a point or methods a hidden markov models and a to matrix factorization
0:01:58	and for
0:01:59	that type because the masking
0:02:01	there's a a a a hard masking and soft mask
0:02:04	our work is largely based on the work of you could
0:02:07	a which which is the light based on on net Q non-negative matrix factorization
0:02:12	but we find that
0:02:14	format the traction the and M at doesn't work where while so uh is that we a week
0:02:20	the we are inspired by that were close you the which does
0:02:23	uh the extraction of a markov models and then
0:02:28	from our work
0:02:30	so for a for i'll give a brief review of about the a and a have an and nmf based
0:02:35	melody extraction and also the time proxy mask
0:02:39	so in the non as can make to a factorization the uh
0:02:43	the observed
0:02:44	spectrogram of the given
0:02:45	audio signal is
0:02:47	regarded as a stochastic process
0:02:50	we're each element a bayes a a a a i i is a complex number of being uh
0:02:55	so caution
0:02:56	distribution where there's a
0:02:58	various parameter T and if you put all the D's together you get the power spectrum
0:03:03	and
0:03:05	the problems of the non-negative matrix factorization is to estimate this
0:03:09	a power but power spectrum the to
0:03:12	max my the likelihood of observed spectrum X
0:03:17	the power spectrum the of the total signal can be
0:03:21	and
0:03:21	decompose into two parts the spectrogram of the voice and the spectral put
0:03:26	the spectrogram of the music
0:03:28	that or the accompaniment
0:03:30	and for them more the
0:03:31	spectrum of the voice can be to calm
0:03:34	decomposed into the product
0:03:36	oh the spectrograms of the class to execution and vocoder
0:03:41	now you you show these parentheses is
0:03:45	the matrix P it can be regarded as a code books and that matrix a can you got is as
0:03:50	lena calm but combination coefficients of these
0:03:53	uh
0:03:54	basis vectors so a a a a a a let me show you should hold this work
0:03:57	um let's take the plot i i a got the excitation a matrix pf and a for example
0:04:02	the pf makes looks like this
0:04:04	so a each column is the
0:04:07	the spectrum of
0:04:08	of the class excitation
0:04:10	at a certain fundamental frequency
0:04:12	fundamental frequencies i express media numbers which is the log scale of the frequency
0:04:19	and
0:04:20	here
0:04:20	you can see two old columns of the a matrix that one is for the media number fifty five and
0:04:25	the other for seven
0:04:26	you can see that the four fifty five it
0:04:28	it has a lower fundamental frequencies so they are so the harmonics a placed close or
0:04:33	and for seven they are a a a place for the for further part
0:04:38	and for
0:04:39	the a F matrix which is uh a combination coefficients for these based a basis vectors
0:04:45	for example if we look at
0:04:46	this
0:04:52	or activated
0:04:54	so
0:04:55	there is a a a a a coefficient for basis vector and me number sixty and a smaller quite
0:05:01	for the
0:05:02	the
0:05:03	basis vector ads
0:05:04	need you number send
0:05:05	and uh
0:05:06	so uh if you the
0:05:08	all these F matrix out you can
0:05:10	actually be realized that pitch contour on
0:05:12	on this matrix which is the dark line there
0:05:15	so a
0:05:16	the the lie above is like at the
0:05:19	uh as the second harmonic and this small lines maybe that common men
0:05:27	so that procedure for a melody extraction and soft masking using an amp paid is as follows
0:05:33	first so we fixed the pf F matrix as shown in the previous slide
0:05:38	and then at we so we saw using an iterative procedure that for the other fine matrices
0:05:43	and we are specially interested in yeah
0:05:49	next uh we find the
0:05:50	strong as can do no speech track on this a yeah matrix using done and dynamic programming
0:05:57	and then we cleared the other ones that the that is far from the
0:06:02	the can you no speech rec
0:06:04	and
0:06:05	with this new a have we we saw for the other four is which can be a more accurate estimate
0:06:12	a for solving all that all the but matrices in the decomposition of
0:06:16	the power spectrum the week and then use a
0:06:19	wiener filtering to
0:06:21	estimate the
0:06:23	in will spectrum of the voice and accompaniment and then we were this into the time domain with or add
0:06:28	a method
0:06:29	then we get an estimate of the voice and components
0:06:32	respectively
0:06:34	so here are the most important part of my lecture here
0:06:38	uh which is that we find that the non egg you the factorization is it doesn't work well enough for
0:06:44	that that extraction
0:06:46	so a
0:06:47	a a a a matrix i shown in the previous slides are just like the ideal ones once but the
0:06:52	actual yeah i get is looks like this
0:06:55	so we can see that
0:06:56	there is a great imbalance in the
0:06:59	in different frequency
0:07:00	for high frequencies is the yeah values are large and for to the hours there small
0:07:09	um so uh we have identified a identified to close this for this balance the first is the nonlinear T
0:07:14	of the mean an numb scale where using
0:07:17	so the mean there a meeting number scale is a logarithm
0:07:21	scale of the
0:07:22	frequency and if we
0:07:38	four
0:07:38	the for the same as same amount of energy in the low frequency a look low frequency and
0:07:44	a we have more basis vector to divide it so the coefficient for individual
0:07:50	basis vectors we get smaller than the higher or frequency "'cause" the end
0:07:53	this is one of the reason why
0:07:55	yeah
0:07:56	yeah matrix has
0:07:57	smaller values in the lower frequency range
0:08:02	and
0:08:03	to compensate for this in as we have
0:08:07	uh we now we a multiply apply uh a one term into that yeah matrix
0:08:11	here F is them a
0:08:13	the frequency in first and and is the median number
0:08:16	so a
0:08:17	the first derivative of have the respect to and is a
0:08:21	is like the this city of uh
0:08:24	the basis vectors at a certain frequency
0:08:27	and by dividing a this
0:08:29	a
0:08:30	a a a actually is a i've more and must placating the
0:08:33	uh the city of the basis vectors we can make the
0:08:37	values at the lower frequencies
0:08:39	the bit larger
0:08:43	now the second "'cause" we i i the i didn't fight is that the columns of the P a matrix
0:08:48	a not normalized
0:08:49	so a
0:08:50	and as you can see the for uh uh lower a media number like fifty five
0:08:55	there are more harmonics
0:08:56	and
0:08:57	since the M to use of these high
0:08:58	harmonics are similar
0:09:00	because they are more
0:09:02	harmonics in the low frequency bass a basis vector the total energy is also higher
0:09:07	therefore for this again can that contributes to the balance in yeah
0:09:11	so to compensate for this
0:09:13	before for the multiplied the
0:09:15	a a for each
0:09:17	unit in the A F matrix
0:09:19	we multiply the
0:09:20	total energy in the basis vector
0:09:23	a
0:09:24	of the corresponding frequency
0:09:26	and
0:09:27	this is this is that total a station that we can out bit
0:09:33	uh in in do was original paper he also it came up with a conversation which is not a a
0:09:39	most multiple a multiplicative as ours but additive
0:09:43	so uh basically what this means is that
0:09:46	for each unit in the F matrix
0:09:48	half of the bad or
0:09:49	at the unit one octave higher is added to the
0:09:53	or you not unit
0:09:57	but the effect of these conversations and not so good
0:10:01	as you can see a
0:10:02	the leftmost
0:10:03	figure is the original yeah matrix
0:10:06	uh in the middle is the yeah measure is calmness it it's using do queries
0:10:11	um
0:10:11	at to to conversation and the rightmost most is our multiplicative the conversation
0:10:16	so uh you can see that's uh after applying these conversations
0:10:20	the
0:10:21	lower or but a
0:10:22	the values at lower frequencies of the F matrix do can larger but if you look at the
0:10:27	uh
0:10:28	pitch contours extracted with done and we then i'm a programming
0:10:31	you get a you see that
0:10:33	yeah you
0:10:34	like the all about the true pitch contour
0:10:37	with a which is just the result of this embarrassing the ad
0:10:41	so our conclusion here is
0:10:43	even if you do comes john yeah matrix it a you cannot totally eliminate the imbalance and that can have
0:10:50	a pet effect on the pitch control that you
0:10:52	that to extract with dynamic programming
0:10:57	um therefore for we propose or on hmm based melody extraction
0:11:02	the future we use is called energy as gsm it ones of interest
0:11:06	which is an integral the say function with it within each segment on and we use
0:11:10	um there is thirty six
0:11:12	um
0:11:13	the mentions
0:11:14	that is the media numbers from
0:11:16	the thirty nine to seventy four
0:11:19	the same function is uh
0:11:21	wait is some of the
0:11:23	a of the spectrum of
0:11:25	the given all or the signal and
0:11:28	i use is run here
0:11:29	it's the so there
0:11:31	the red a parse show the large values and blue part of the small values and you can actually see
0:11:36	the
0:11:42	on this data structure function map
0:11:47	oh the signal uh we calculate this say this function i four
0:11:51	the at a step of zero point one meeting numbers
0:11:54	so a
0:11:55	that it that gives use like a more than three hundred dimensions mentions uh a a feature and
0:11:59	which is
0:12:00	too much for that the M
0:12:02	therefore probably integrated in it into the S i features at a there six M once
0:12:07	and
0:12:09	we also use these sent ones at the states of the hmm they are fully connected is and the all
0:12:14	core probability you for each hmm is
0:12:16	models with a
0:12:17	eight component gmm
0:12:20	the parameters of this M is trained from the M my are one K database base it a his annotated
0:12:25	with the at frame level with the
0:12:28	a
0:12:29	a fundamental frequency
0:12:30	and if you do a viterbi decoding on
0:12:33	on the
0:12:34	oh the on a piece so all with it is a hmm if will use uh
0:12:38	pitch
0:12:38	pitch contour for a query to once i talk
0:12:43	in in order to get a a fine P track which is a a a a a a great down
0:12:46	to zero point once i meet ones
0:12:48	um
0:12:49	we been take the maximum value of the C is function map
0:12:53	a
0:12:54	in a their point five some into range around that for speech
0:13:00	and then a show you a how a for or hmm a
0:13:04	is based matter tracking
0:13:06	uh
0:13:06	contrasts with the an mm based
0:13:09	pitch tracking and uh also
0:13:11	a a a a
0:13:12	they fact of the net and then map soft masking
0:13:15	in contrast with the hard masking
0:13:17	so the evaluation corpora we use are the M our K
0:13:21	database the it and also some of the clips available and the please bats that
0:13:26	the items a evaluation encode the
0:13:29	the sept the separate model was and also that or all
0:13:32	form
0:13:34	so first for the melody extraction uh a if force it compare are uh our
0:13:40	uh our system with a with use a which which use also based on a hmm and yes yes i
0:13:45	features
0:13:46	but there are i features at different a a defined differently from hours and the use two streams of features
0:13:51	why we use only one stream
0:13:53	and
0:13:54	the performance of the but the two systems are comparable
0:14:00	um the the a result of our keys here so uh this at the comparison of the pitch tracking of
0:14:05	our proposed hmm based a method and you could use an ml based method that
0:14:11	uh so for if you look at the accuracy and our is much higher than the than the row and
0:14:16	M have and also higher than the
0:14:18	compensated at math
0:14:21	and he's these process out the down of errors so we can you can see for our hmm based
0:14:27	methods uh there the isn't a very much uh errors and mostly a a like one octave higher at the
0:14:34	twelve some once and one E
0:14:36	but up to lower at the minus simon ones
0:14:39	and for the and then have you can see that there is always
0:14:47	but distributed cost a large range of a
0:14:50	uh
0:14:51	errors
0:14:51	a so that this is
0:14:53	right to to the imbalance in the F matrix
0:14:56	so if you use dp it will always like pick something
0:14:59	uh about the true pitch contour
0:15:01	and even you even if you do the compensation
0:15:05	is that like that comes to you a person is are in is not completely
0:15:10	cleared
0:15:13	also worth mentioning is that uh because already each am and uh
0:15:17	each i meant based
0:15:18	P tracking method a trained offline and the online part does the does not you will you bought and iterations
0:15:24	so this run six to seven times sure then the
0:15:28	it or to an M F for C
0:15:32	for the time-frequency masking we the we compare our system with a hard masking system a of shoe
0:15:37	and
0:15:39	a a week uh evaluate them at the three and mixing uh
0:15:43	S not snrs
0:15:44	like a a man five zero five db
0:15:47	um
0:15:48	now it first
0:15:50	you look at the blue
0:15:51	the blue squares where we use the annotated pitch tracking so we isolated isolate the
0:15:56	T a a a a a T F three masking part
0:15:59	and
0:16:00	i see that a a a all the snr we shows our
0:16:04	our system
0:16:05	uh performs better
0:16:06	and
0:16:07	but mentioning it that's our
0:16:10	the
0:16:10	our performance for the
0:16:12	i two did pitch tracking which it use is soft masking
0:16:15	guess close or even exceed the hard masking
0:16:19	i do you ideal masks which is
0:16:21	kind of a a a per for that
0:16:22	for of the heart must
0:16:26	now a for that the overall evaluation we use the extracted pitch tracks uh
0:16:31	or all or or and we see that
0:16:33	it also performs better than the haar must insist
0:16:39	and then now uh we
0:16:41	and here the or system of ours with duke clues which is completely based on
0:16:47	i
0:16:48	i like to show you
0:16:50	yeah
0:16:51	i
0:16:52	i i a
0:16:58	so this is a make sure
0:17:00	and this is the separation results
0:17:02	using do please never have based method
0:17:08	i don't know oh
0:17:10	oh
0:17:11	oh
0:17:14	that's see that a for the last notes the pitch contour
0:17:17	the pitch is like
0:17:19	uh twice the true pitch
0:17:21	i
0:17:23	i
0:17:25	i
0:17:29	it's so you here that some of the voices that's in the common men
0:17:36	oh you i no no oh
0:17:42	so a a a our pitch the pitch you structure for the last noise correct
0:17:48	i
0:17:48	i
0:17:51	i
0:17:53	i
0:17:54	so the common here is green or than do you please system
0:17:58	and
0:17:59	a if you look at these results to some of them
0:18:03	for some of them our system force better and for some of them
0:18:05	it's worse
0:18:06	the the reason here is a is like mainly it it determined by the performance of the
0:18:12	matt extraction
0:18:15	okay so of for the conclusion
0:18:18	and
0:18:18	oh control can that's that an M at A based net the extraction of suffers from and embarrassing the F
0:18:24	matrix and for
0:18:26	for this matter each be be better and also run faster
0:18:29	and for the tf masking and M
0:18:32	and based soft masking is much better than hard masking
0:18:34	so uh we we propose the combination of hmm the extraction be and at based soft mask
0:18:40	thank you
0:18:47	a any questions
0:18:48	real time for you
0:18:51	yeah piece
0:18:55	yeah so thank you for you of your of your tool
0:18:57	i have one question i mean to question actually no one question is them
0:19:01	you're method is um uh should provide or you have some only
0:19:06	yeah why you're your we you compared to a method to do real at which is completely and provide
0:19:11	so uh my question is in which way
0:19:14	um
0:19:14	the learning you do
0:19:16	it could be to generic and can be applied to
0:19:19	completely different signals and my second question would be
0:19:22	do you have to sonic samples where
0:19:23	you methods is slightly
0:19:25	let's performance and do used method
0:19:28	yes
0:19:29	if you can play also them
0:19:30	that would you know noise
0:19:32	oh okay
0:19:34	or a
0:19:35	this the other all uh the others
0:19:38	leaves the separate it all side will will available on that a demo a web page which it is a
0:19:42	where the U R are is included in our paper
0:19:45	and for a for the for the first question in it that's
0:19:48	uh we use this to supervised the method because we find that the imbalance are
0:19:53	is
0:19:54	that's the results very much and
0:19:57	uh
0:19:57	and actually a
0:19:59	so
0:20:00	do you use a conversation is like
0:20:02	uh some ad hoc rule based uh
0:20:04	compensation
0:20:06	uh like like this one so uh
0:20:08	this is not completely unsupervised is
0:20:11	he also looks at the the
0:20:15	like a like a a what the imbalance looks like and design this rule to
0:20:19	to compensate for this thing and
0:20:21	our H am training is like a to learn this
0:20:24	to learn the
0:20:26	a
0:20:26	to learn what the in looks like a by and
0:20:29	a automatically learning method
0:20:32	okay let's go to
0:20:34	okay thank you

COMBINING HMM-BASED MELODY EXTRACTION AND NMF-BASED SOFT MASKING FOR SEPARATING VOICE AND ACCOMPANIMENT FROM MONAURAL AUDIO

Acoustic Source Separation

Presented by: Yun Wang, Author(s): Yun Wang, Zhijian Ou, Tsinghua University, China