Speech Transcript - ITAKURA-SAITO NONNEGATIVE MATRIX FACTORIZATION WITH GROUP SPARSITY

0:00:26	oh
0:00:27	hi
0:00:28	this is uh a joint work with uh four
0:00:30	back from my in yeah this shot team in a selective but
0:00:33	and a
0:00:34	from a you know as an
0:00:36	just sticks team from a to to comply that can we uh
0:00:39	i'm not and going
0:00:40	to talk about
0:00:42	tech was say to a non-negative matrix factorization with group sparsity so uh there have been a several talks about
0:00:47	the the quite site two
0:00:50	a negative matrix factorization so
0:00:51	uh
0:00:52	we have been working on
0:00:53	adding uh uh priors
0:00:56	uh
0:00:57	with
0:00:58	this frame
0:00:59	so i go over a quickly and non-negative matrix factorization uh
0:01:04	the the next your slides so uh
0:01:07	yeah
0:01:08	yeah you can see a a a a steep it's simple example of uh
0:01:11	a just signal
0:01:13	it's uh it's composed of a can or not
0:01:15	it's uh
0:01:16	it's a
0:01:18	uh and uh at to to
0:01:20	at
0:01:20	each each or you can see that uh first
0:01:23	for notes of can are light and then combinations of two which are
0:01:27	how money
0:01:28	which are
0:01:29	a a one up one dave uh
0:01:31	to the other
0:01:32	so uh this is a
0:01:34	this is an example of a very a very difficult to and source separation a compact
0:01:39	and uh what we can see here is that the so you have the data and uh a money to
0:01:43	matrix factorisation in learning
0:01:45	a basis dictionaries so with the basis spectra
0:01:48	and they time activations yeah we can see the dictionary and the time activations and you can see that
0:01:53	uh a very clearly you can see a the the notes
0:01:56	are uh a separate it and you can see that the make actually since a very
0:02:00	or easily so you can see that for notes are played together and then combination
0:02:04	of to notes
0:02:06	and uh uh there are still two components that are left one that explains the one that explains the noise
0:02:12	and one that you can see here
0:02:14	but uh uh so
0:02:17	if we could listen to it uh sounds like the hammer
0:02:19	of the of the channel
0:02:21	so this is an example of a or where uh the cross a to nonnegative matrix factorisation works are really
0:02:26	where i i'm seven example of a
0:02:29	a a nonnegative matrix factorization
0:02:31	using another a a a a uh another all us which is the euclidean us
0:02:35	so here's the the same
0:02:37	uh it's the same type plots
0:02:39	uh except
0:02:41	you can see that uh for um
0:02:44	or the first not reece is the the thought components here
0:02:47	uh you can see that the the top component gets split up with other components
0:02:52	so the separation is not so good as a as before
0:02:56	and that is explained by the fact that uh a take a to uh a measure of divergence
0:03:00	is uh more sensitive uh most since if to to uh
0:03:05	to high frequency uh and to choose
0:03:07	so it seems a
0:03:10	but a suppression for
0:03:14	so uh now if we want to more complicated uh
0:03:18	well just signals
0:03:20	uh
0:03:21	a problem uh a appears that uh
0:03:24	if uh if you have only two sources uh each source can "'em" meets several uh several different spectra for
0:03:30	example when i speak
0:03:31	uh there are so spectra but you can associate my course i'm not only
0:03:36	uh always saying that be the same thing
0:03:38	and uh so there is the problem of grouping the components
0:03:43	uh into two sources assigning several components to sources so
0:03:47	uh uh for instance you can you can simply a run and M F and look at uh the activation
0:03:51	coefficient
0:03:52	okay can see matrix H and you can see that the
0:03:55	in this uh in this very simple or uh example where are you have a base
0:04:00	and uh yeah
0:04:01	and there are overlapping
0:04:03	this the region
0:04:04	you can see that the for some components there is already
0:04:08	uh very clear means that that these components second be assigned to the base
0:04:12	and uh other components to uh
0:04:15	data
0:04:16	so uh one approach is to look at the dictionary and is are guided by a stick or just uh
0:04:22	with the yeah
0:04:23	uh the an engineer can uh
0:04:27	the design the best uh the best grouping of components and
0:04:30	two sources
0:04:32	but uh the problem is that
0:04:34	as the tracks get longer as you get a a a more tracks
0:04:38	and uh also as the dictionary get larger
0:04:41	uh is because more complicated so for the engine now because there is a a lot a more work to
0:04:45	do
0:04:46	uh uh and if you are used uh a get it by heuristic
0:04:49	uh this series stick will involve a considering or permutations
0:04:53	of of the of your generate so uh if you have five permutation mutation you have a factor your five
0:04:58	a limitations
0:04:59	to see too small
0:05:01	but if you want to ten
0:05:02	twenty uh components and of the jury this becomes
0:05:05	uh
0:05:06	wait
0:05:06	wait too long it uh so you would run uh and an F of thoughts to seconds and would spend
0:05:11	one day
0:05:12	considering all the permutations
0:05:14	a source of so
0:05:15	uh
0:05:16	with the
0:05:17	want to do is to include the grouping in the learning of the of the dictionary
0:05:22	so um
0:05:24	when way of uh when we have thinking uh a how to group for the components is to uh is
0:05:28	i think about the the the some levels of uh each source
0:05:32	at uh at a given time
0:05:33	so uh uh here uh for a given track a a uh i i are did the volume for
0:05:38	each so the base get down the voice
0:05:41	and uh
0:05:42	you can see that there are some uh he that you can use for instance uh uh at this time
0:05:46	you can see that the the basis a very low level
0:05:49	uh compared to the other sources so you could say that that's some points
0:05:53	one source is inactive or as the other as a are active
0:05:56	and also uh
0:05:58	another idea yeah is to exploit the fact that their shapes of uh is volume activations are are are very
0:06:03	different
0:06:04	so
0:06:05	uh
0:06:07	so uh
0:06:09	not coming back to the the from of was set the the notations a a little bit
0:06:13	so uh what we have been looking at uh so that that there is a you
0:06:17	of the power
0:06:18	spectrum
0:06:19	uh and at time you can consider out that uh in a model that if you of a
0:06:24	and it's eve uh you know model
0:06:26	uh there was a spectrogram
0:06:28	is
0:06:28	uh
0:06:29	is gonna but that that's the sum of uh
0:06:32	for several components
0:06:34	and
0:06:34	each component
0:06:36	uh
0:06:36	each components of the complex spectrum
0:06:38	uh it's not the gaussian
0:06:40	with a diagonal covariance and uh
0:06:44	nonnegative matrix factorization consists
0:06:46	in
0:06:46	uh computing using uh a factorization of uh the parameters of a matter
0:06:51	so uh
0:06:53	in this case in in the case of uh D uh tech why said to uh your a chance
0:06:58	this corresponds to uh i mean you gaussian model
0:07:01	uh which means uh that's we have a truly the additive model for the power spectrum runs so even if
0:07:07	uh
0:07:08	but it is not at a T for the observed or gram
0:07:11	really additive
0:07:12	for all what want to estimate that is the parameters and uh
0:07:16	it is the only model for which you can get the is uh this
0:07:19	i
0:07:19	this to be true
0:07:21	so i mean the gaussian a assumption and uh you can don't to uh looking at the power spectrogram in
0:07:26	it uh it means that actually the power spectrum is uh
0:07:29	distributed as an exponential
0:07:31	uh
0:07:32	with problem itself uh W and H got that we you use the bases dictionary and H uh that time
0:07:37	coefficients
0:07:37	the time activation
0:07:39	and so in my annotation uh H has several role and you want to uh
0:07:44	you want to find uh
0:07:46	and
0:07:47	want to uh
0:07:48	you you want to find a a a a a partition
0:07:51	of the rows of H in to uh say two groups uh but this may generalized
0:07:56	an trial are a number of rules
0:07:58	you want to find a partitions of the rows of H in two
0:08:01	so here had would be to groups we the the same number of uh the same number of uh of
0:08:05	from
0:08:07	now coming back to the coming back to the P just lies what is the volume in uh uh what
0:08:12	is the the some level of fit shots in a in a model
0:08:16	well if you assume that uh the sense of uh each column of uh W
0:08:21	sums to one
0:08:22	then the some level of one source will be the seven
0:08:25	of activation coefficients
0:08:27	of uh of a group one which corresponds to force one
0:08:31	so what we want mother is
0:08:33	uh these
0:08:34	these coefficients
0:08:38	so
0:08:39	inference we propose is to round the grouping at the same time as the factorization
0:08:45	uh so this corresponds to uh uh doing a a up
0:08:49	if close to to an F and and ink
0:08:51	we propose a adding a prior
0:08:53	uh that is that sector is in the groups
0:08:56	a all the different sources so uh
0:08:59	so yes since you have a a nonnegative coefficient this uh this uh and one um is just the the
0:09:05	sum of the coefficients of age for uh one schools
0:09:07	that is one group at a given time
0:09:10	and uh
0:09:12	and the i uh here we only assume that is a that it is a a concave function
0:09:17	uh
0:09:20	and so what this uh what this uh optimization problem tends you is that's you want to have a fit
0:09:24	to the data
0:09:25	but at the same time of uh you have a prior on that they are uh that uh
0:09:31	at a given time there is only a a that they are only a few sources that are active at
0:09:35	the same time
0:09:38	so uh
0:09:40	so in in you know that we have a
0:09:43	but that's choice for
0:09:44	side
0:09:45	but so if you if you if you look at the paper are you
0:09:48	you would see that the uh it to that it comes from a a graphical model with
0:09:51	we two layers
0:09:53	uh i
0:09:54	so
0:09:55	to much about this
0:09:57	um
0:09:58	and this corresponds actually uh so to uh
0:10:02	maximum like you an france
0:10:04	of uh of the problem of a model
0:10:07	or
0:10:07	even a a out to model of the data
0:10:09	and uh
0:10:11	a parameter
0:10:12	on H
0:10:14	um so about the inference of the parameters for the algorithm is uh in uh the the que c'est chance
0:10:21	it's uh
0:10:22	so it's very hard
0:10:23	uh to uh to have a a a and so the related methods to to the parameter on friends we
0:10:28	must the results to let's get to the date
0:10:30	uh because they go way faster
0:10:32	uh
0:10:32	here an example of uh a a a a a at the right uh window
0:10:36	running the algorithm with you know a a great in reading methods are or multiplicative at that's method and that
0:10:42	but you get you of that's them goes away faster and
0:10:45	actually are converges to but uh are a but on the pony no
0:10:50	so um
0:10:53	or or go with and uh a doesn't change significantly from a
0:10:57	stand
0:10:58	the class i two and F we just add uh
0:11:00	terms
0:11:01	which correspond to a to our prior
0:11:03	and uh since yeah size use a is a concave function
0:11:08	uh
0:11:09	you have that the
0:11:11	site in in upsets a prime uh in is with
0:11:14	uh that's some level of source one
0:11:17	so what the algorithm than that you is that that each step you are gonna a a bit H so
0:11:21	as to get to
0:11:22	a better fit of the data uh corresponding to
0:11:25	the the class a two and gets you
0:11:27	matrix like relation
0:11:28	uh
0:11:30	and the more source one uh
0:11:34	the the the less source one uh will be at a high volume
0:11:38	the more you will be uh but then broke coefficient at this time so uh
0:11:42	it means that the this uh so this algorithm
0:11:45	will push
0:11:47	uh a low amplitude sources to zero and keep i i'm should source
0:11:54	uh
0:11:55	and uh so it's on the fact that uh even if we have a a a a a a a
0:11:58	a you prior this doesn't change the speed of uh
0:12:01	this doesn't change at of the speed of the algorithm it's are compulsion
0:12:04	approximately a thousand iterations durations
0:12:07	the time uh the time customs algorithm is
0:12:10	read the same as a
0:12:11	the classic in
0:12:13	now one complicated aspect of uh i having this prior is that the you must to uh selection for the
0:12:18	i'd proper all so uh uh i a prime thousand are uh in a mother are and on that uh
0:12:22	and uh a a a of the choice of uh of side
0:12:26	so
0:12:27	even that we have a a given that we actually have a a a a a a graphical model that
0:12:32	explains the choice of this prior
0:12:34	uh we could result to uh we could is up to uh
0:12:38	a a bayesian tools to to estimate the was parameters that
0:12:41	uh
0:12:43	actually we uh we we devised a statistic it is a a a a uh
0:12:48	much more simple
0:12:49	uh to uh
0:12:50	it you on that so and it's
0:12:53	the principle of this to stick is that
0:12:56	if you become all the right palm tells then
0:12:58	V
0:12:59	given this parameter L should be exponentially distributed so
0:13:03	uh if you compute now uh
0:13:06	sadistic this thing that are we over the estimation of the you H
0:13:10	and you have a a a and and you are and you look at is uh at this a random
0:13:14	variable then it should be a distributed as an exponential one
0:13:17	and you have a a lot of samples of this because you have a a a a a a a
0:13:20	as many menu frequencies and it as many a frequency in is uh as many state is the statistics it
0:13:26	is you have a a time-frequency bins we have a lot of them
0:13:28	and uh then uh run uh computing a chroma graphs none of sadistic becomes a
0:13:34	very interesting because it it's a very cheap
0:13:36	and you can uh
0:13:37	and uh
0:13:38	we can just run a whole rid of experiments
0:13:41	and look at the parameter values for which you have the lowest a
0:13:44	we also have statistic
0:13:46	um and so we did that on uh that that to get that check a that a lot
0:13:51	or or uh
0:13:52	source so we have a
0:13:53	and the see that that to generated for from the model
0:13:57	and now you can look at uh so i we look at the different number of uh
0:14:02	uh a training sample for the mother
0:14:04	and you can look at uh at the top
0:14:06	yeah value of all set stick it is in blue
0:14:10	uh uh in red uh a measure of the mountains to good to my because uh in this uh
0:14:15	in this setting uh we have a a we generate it's synthetic that that from a non model so we
0:14:19	can uh actually compute
0:14:21	you and parameters to is the divergence good to mother
0:14:24	and yeah you can see uh uh a a class if you can should scroll also gets which can vacations
0:14:29	got is uh if that's a uh uh if a correct source one one source one is a
0:14:34	exactly
0:14:35	if we cover a a correct is source to and source
0:14:38	exactly
0:14:39	so uh when they are only a hundred observations
0:14:42	can see that the there is
0:14:44	with a good the classification accuracy but uh it is difficult to find a minimum of the is T
0:14:49	and as you increase uh the the number of points
0:14:52	in gets the the the set to get you were uh but on that there are and uh
0:14:59	more in see the the minimum of the statistic T
0:15:01	and also uh the
0:15:03	the development of a model uh says
0:15:06	yet but there are you get the rest
0:15:07	so so
0:15:08	this just means the model that i you have
0:15:10	the but are a a are our prior will uh estimate the
0:15:14	as as possible
0:15:17	this uh this is a based on to that at that time
0:15:20	we not want to uh experimental results so uh uh a a first the is to try
0:15:25	uh is to trade this in a simple segmentation task or you know that that
0:15:29	the
0:15:30	it's a given time that is only one still that that is a key
0:15:34	and uh uh a good thing is to uh compare are or them with uh just the simple idea of
0:15:40	doing a a and then F and then finding the best uh mutation
0:15:44	given a a a you given a statistics so he a a a a given a heuristic so he other
0:15:48	heuristic is uh
0:15:49	compute an and have
0:15:50	and uh find the permutation that the minimize is uh this quantity
0:15:55	this to give a faq compared
0:15:57	so this is a result of a an an F with the this a heuristic grouping so uh uh you
0:16:02	can see so the mix was
0:16:04	first uh a then speech
0:16:06	uh you can see each other sources
0:16:08	uh so that's the result of uh and an F with heuristic to groupings we can see that the still
0:16:12	a a a lot of uh missing up the
0:16:15	the sources are are
0:16:17	it's not a lot
0:16:18	and uh this is a result with a are them that's long the grouping at the same time as
0:16:23	uh the an F
0:16:25	so you i can see that uh uh the separation gets a a a a a uh uh
0:16:30	that lot uh
0:16:31	lot more yeah
0:16:32	uh
0:16:33	or
0:16:35	original result
0:16:36	and uh
0:16:38	the second experiment that run was on a a a a real of valid signals
0:16:42	so uh so
0:16:43	we took uh
0:16:44	to some from the C sec that the base
0:16:47	and we evaluated uh the quality of the separation
0:16:51	uh
0:16:52	when we vary the degree of overlap between the sources so
0:16:56	the them that up the
0:16:58	more difficult it becomes to
0:16:59	the separation
0:17:01	and and i insisting on the fact that we have no they're on the use so uh uh so uh
0:17:06	you can talk for perfect separation
0:17:08	but uh you would see that the
0:17:10	when you varies over a
0:17:12	the
0:17:12	the less of a lab you have a
0:17:14	but they'll the separation so you know you
0:17:16	very good separation for a thought T three percent on year
0:17:18	or that so that the sources is it is the
0:17:21	is is mix
0:17:23	uh this is a this
0:17:24	as the source
0:17:26	as deep dark
0:17:27	voice
0:17:28	you can see the that of thought people percent of a like to get the
0:17:31	very good
0:17:32	separation or T
0:17:34	terms of as yeah
0:17:35	and as the overlapping increases
0:17:37	it's
0:17:38	uh
0:17:39	works and mars
0:17:40	so
0:17:41	what the prior what they'll prior is that for that is
0:17:44	when
0:17:46	and uh not all sources are active at the same time it's a people the
0:17:50	the where separation so we can not
0:17:53	we can listen that uh
0:17:55	examples so and it doesn't work
0:17:58	that's
0:17:59	hasn't work
0:18:02	sources
0:18:16	i
0:18:37	i
0:18:51	oh
0:18:56	okay so let's six this this is first meets
0:19:11	i
0:19:12	um
0:19:13	uh
0:19:22	a
0:19:24	i
0:19:25	hmmm
0:19:33	this is this to don't
0:19:37	the
0:19:37	source
0:19:38	skip directly to the results
0:19:40	guess
0:19:41	but with source and that's C of would we have an estimate of uh an on is
0:19:45	for
0:19:46	oh
0:19:54	or
0:19:56	a
0:19:57	oh
0:19:57	oh
0:20:05	a
0:20:09	yeah
0:20:10	i
0:20:12	do you so we have a ten seconds that
0:20:14	a for the computer and so we have a proposed the simple sparsity prior
0:20:18	to do a a group uh grouping of the sources and solve the permutation problem in this uh
0:20:24	a single channel source separation case
0:20:26	uh and we show that the algorithm them but there was of the grouping with the as a a post
0:20:32	processing step
0:20:33	and if you in future work we will try to incorporate
0:20:35	smoothest might prior to uh
0:20:38	to understand the time the paul then a mix of H
0:20:42	i should
0:20:50	we have time for only one with question
0:21:02	no
0:21:03	so the the most you play
0:21:05	they are mostly
0:21:07	a love part E
0:21:08	component
0:21:09	and how mix a very much of like it because they playing
0:21:12	according to the egg are so that a single
0:21:15	close like it
0:21:16	and i am wondering how much the
0:21:19	sampling rate and the effect these signs in is you R
0:21:23	i mean if you would to
0:21:25	hi are and fifty resolution oh
0:21:27	what it be different
0:21:28	what do you yeah
0:21:31	that are separation and that you can see there
0:21:34	such a
0:21:35	yeah
0:21:35	oh O K so we don't talk about this in the article
0:21:39	now from uh from my parents this is the
0:21:42	does not is sensitive to the simply me that you choose
0:21:46	uh
0:21:48	for this experiment i chose a sampling rate of uh a twenty two Q do have
0:21:51	because uh it just just because of a computing time uh
0:21:56	concern
0:21:57	and
0:22:00	i guess uh for the example or well you have a time voice
0:22:05	since uh they play in the
0:22:07	approximately in the in the mid and high level of of the spectrogram
0:22:11	uh this wouldn't uh uh have too much effect because uh
0:22:15	because the
0:22:16	so the this i range the frequencies are pretty well separated
0:22:20	but uh uh if you have base
0:22:23	and another source
0:22:24	that purely having a good the resolution uh we will help for uh since we have no model
0:22:29	uh since we and number than for the basis of and
0:22:32	then a having a would sampling right to i mean have a good something rate will help because you then
0:22:36	you can get but the resolution uh
0:22:38	you can you can afford
0:22:39	a longer time window
0:22:41	and
0:22:41	but the resolution in the frequency range which is particularly important
0:22:45	in the low frequency one
0:22:48	a some that i i would say that the results are very robust
0:22:52	the problem goes from
0:22:54	okay thank you

ITAKURA-SAITO NONNEGATIVE MATRIX FACTORIZATION WITH GROUP SPARSITY

Acoustic Source Separation

Presented by: Augustin Lefevre, Author(s): Augustin Lefevre, Francis Bach, Ecole Normale Superieure, France; Cédric Févotte, CNRS LTCI / Télécom ParisTech, France