Speech Transcript - MAXIMUM MARGINAL LIKELIHOOD ESTIMATION FOR NONNEGATIVE DICTIONARY LEARNING

0:00:13	one
0:00:17	hello a one uh i'm a thing and uh i'm a structure that at telecom partake take i in paris
0:00:24	and this the joint work with uh sit able
0:00:27	uh the to of the uh a presentation is uh maximum likelihood a much margin like to the estimation for
0:00:33	nonnegative dictionary
0:00:34	learning
0:00:36	uh the outline of the talk will be as follows first i'll you the uh
0:00:41	problem definition and that is a learning the a dictionary parameters in the nmf problem from the marginal uh model
0:00:49	uh then i'll uh describe our model uh in which we it defined a prior for the uh
0:00:56	prior distribution for D activation code
0:00:59	uh than i'll describe to estimators uh or
0:01:03	we we which we apply on this model uh first one is the mike uh
0:01:07	max some joint likelihood uh estimator uh we which is which corresponds to the a a a standard uh nmf
0:01:14	F uh
0:01:15	uh that
0:01:16	estimates estimates are uh on the uh model that we use
0:01:19	and the other one is uh maximum marginal likelihood estimator are uh maybe which tries to learn the dictionary from
0:01:26	the uh marginal model
0:01:29	then i would be a a Q results uh
0:01:32	on two data sets the first one is a a real data set and the other one is a synthetic
0:01:36	of set
0:01:37	uh we will compare to dictionaries of your thing uh
0:01:41	uh which used to estimate as down i'll conclude
0:01:45	uh so as of now uh everyone one knows them about uh the problem we have a we we try
0:01:50	to uh
0:01:52	i approximate uh and non-negative matrix uh re a as a product of two other a non-negative matrix is W
0:01:59	and H
0:02:00	here double use uh
0:02:02	i can be seen as a dictionary parameters and H a matrix is the activation uh
0:02:08	i parameters
0:02:10	uh we try to do this approximation uh a a minimum minimizing at a die just uh
0:02:16	between these two matrices
0:02:18	and this type or just matrix can be anything uh like kullback-leibler divergence or or cross like to die versions
0:02:24	or euclidean distance
0:02:26	uh the generalized kullback-leibler uh
0:02:29	dive urgency uh is given like this
0:02:32	uh
0:02:33	hmmm
0:02:33	so
0:02:34	you what we want to learn W uh which is the dictionary uh and it's uh
0:02:40	the size of uh this dictionary constant while the uh H the expansion coefficients
0:02:46	at the size of this matrix increases as we it all are more samples
0:02:51	uh so uh the samples are in call as column vectors and uh B is of size F by and
0:02:59	and it a we have a F features and uh the data is represented by and so we as we
0:03:06	uh increase samples uh we have
0:03:08	uh the i and get
0:03:10	a a larger
0:03:11	so on this problem we have uh efficient algorithms which depend on make majorization unionisation uh which is a
0:03:20	uh optimization by uh also that of functions
0:03:23	uh so we have a a multiplicative update rules for this uh
0:03:28	uh a problem which i which converts fast and it which it sure uh non negativity
0:03:34	uh the same update rules can be also a also a paint by uh
0:03:39	i
0:03:39	expectation maximisation algorithm on a that's sickle uh model uh low
0:03:45	a a the kl about minimizing the yeah that versions corresponds to you
0:03:49	a maximum likelihood on reports an observation model
0:03:53	um
0:03:55	so our uh
0:03:57	problem is a
0:04:00	as
0:04:00	uh we we
0:04:02	have more samples
0:04:03	we V
0:04:04	have more parameters to learn
0:04:06	because the age uh the size of H also increases
0:04:10	but this size of the would you is a constant a a uh independent of the data
0:04:15	so uh for example uh a maximum like likelihood uh
0:04:19	estimator is a consistent estimator uh
0:04:22	that means
0:04:24	if you use all sort
0:04:26	uh infinite a number of samples your or uh
0:04:29	the
0:04:29	uh
0:04:31	the is
0:04:32	estimates you get are are uh are the same as the true parameters
0:04:36	but
0:04:37	since in this case as we also uh
0:04:40	a more samples we also have
0:04:42	more parameters
0:04:43	done this it consistency is not to show
0:04:47	so
0:04:48	are are he here is to uh
0:04:50	no the dictionary parameters from the marginal model
0:04:54	uh
0:04:54	the uh from the uh but by maximizing the uh margin like little of the
0:05:00	uh a dictionary terms
0:05:02	and this can be obtained by uh integrating out think i'll the A activation parameters from the model
0:05:08	so that's why we
0:05:10	uh so this is the joint
0:05:12	uh
0:05:13	uh
0:05:14	probability of
0:05:15	uh
0:05:16	B and H so we we define it uh a prior distribution on H and marginalise a we try to
0:05:23	a my the marginal lies
0:05:25	a the model by integrating out H so the
0:05:28	uh this
0:05:29	uh
0:05:30	is the marginal likelihood here uh the I so this is not if if it don't of data
0:05:35	uh so this is
0:05:36	still conditional on a dictionary
0:05:39	uh but of course this uh integration is not tractable uh and we will be uh
0:05:44	using variational approximation to
0:05:47	uh approximate this
0:05:48	margin like
0:05:51	um
0:05:51	so the model we have is i
0:05:54	oh uh uh here i'm going uh i'm presenting the us that's to model uh
0:05:59	with the on of the racial model and uh did this is actually equal to
0:06:04	uh minimize a so uh
0:06:07	maximum likelihood to the estimation on this model is uh equal to minimizing uh tick K the divergence between to
0:06:13	used to meet assist
0:06:15	so our observation
0:06:17	uh uh at the or at a single element of the observation matrix is a on distributed with the
0:06:23	with this
0:06:25	excitation parameters
0:06:26	uh
0:06:27	and we we can i introduce some latent variables C which are
0:06:33	sound to be uh
0:06:34	the observation model
0:06:36	and uh
0:06:37	the D these are also pause on uh
0:06:41	distributed with uh all these individuals some three
0:06:44	i've here
0:06:45	so on this model if we integrate out this latent variables C every also be in this
0:06:50	marginal model
0:06:52	i have so introducing this C variables uh
0:06:56	uh enables us to
0:06:57	to uh
0:06:58	have an em got someone this model
0:07:00	so
0:07:01	the uh posterior distribution of these C variables uh
0:07:06	has the standard font there it there a multinomial distributed
0:07:09	and uh since we have the posterior exact posterior of D distribution we can uh
0:07:16	uh we can calculate the uh
0:07:19	expectation in the e-step of the em algorithm uh and we will lead uh that the lead us uh to
0:07:25	the uh
0:07:27	uh
0:07:28	multiplicative updates rules that i've mentioned
0:07:31	so uh
0:07:32	the model is a a pretty standard but a we we also introduce a
0:07:36	prior distribution on H and this is a common distribution with scale parameters off five and shape shape parameters are
0:07:43	fine scale it just be to
0:07:45	so gamma distribution it is kinda to get for the uh a some of the racial models and this this
0:07:50	gives us a
0:07:52	uh
0:07:53	uh so we we use such a prior is so that are are go to algorithms will be uh easier
0:07:59	to saying
0:08:00	uh so uh a in this
0:08:03	uh model uh the conditional approval to use all uh
0:08:08	each H really able will be a also got model
0:08:12	come gamma distributed and in in our model we we don't have a prior distribution for
0:08:16	W this is the W is a deterministic variable and we are interested in pointed
0:08:24	uh
0:08:24	so be the first estimator is maximum joint likelihood estimator
0:08:30	uh so here we uh
0:08:32	i i as we do in the standard nmf problem we we try to uh
0:08:38	maximise the uh
0:08:40	joint likelihood with just back to W and H
0:08:42	here uh we have another term for the prior
0:08:46	of H
0:08:47	and uh we we can
0:08:48	i
0:08:49	by using all of the data functions uh
0:08:52	uh and major majorization minimization minimisation we can get this update rules which are uh
0:08:57	again again pretty uh
0:09:00	uh effective
0:09:01	so the black the terms in black are the standard H update
0:09:05	and the prior gives us these terms
0:09:09	uh and with uh i'll for a greater done or cool to one we we have a a non negativity
0:09:15	it should here uh and you fee set for to one
0:09:18	so that's it
0:09:19	a gamma distribution with skip and to one
0:09:22	i
0:09:23	then it this becomes a multiplicative update
0:09:26	uh
0:09:27	so the uh so this is
0:09:29	uh
0:09:30	and nmf problem
0:09:32	only difference is we have a prior on H
0:09:35	uh and the update rules for W are the say
0:09:39	so as the
0:09:41	maximum
0:09:42	uh as the uh maximum marginal likelihood estimator
0:09:46	so we as i mentioned that we we try to uh maximise disk want to but this quite this integral
0:09:52	is intractable and
0:09:53	uh we we don't have a form for this
0:09:55	so we you want to uh go with the em algorithm
0:09:59	uh a in an iterative a
0:10:01	uh
0:10:02	so in the yeah my and uh
0:10:05	the e-step we we try to uh calculate this uh
0:10:09	expectation of the uh a joint
0:10:12	uh distribution on there uh the posterior distributions of all the latent variables we have in the model
0:10:18	uh but
0:10:19	uh
0:10:21	we you don't have a form for this posterior as well so we can to exact em uh uh on
0:10:26	this smart model as well
0:10:28	so we uh we try to approximate this uh
0:10:32	e-step step of the em algorithm with
0:10:34	by using approximate a the sri uh
0:10:38	this uh approximate a posterior distributions
0:10:41	and we will uh we we can do this in a several ways and uh like we uh beige a
0:10:47	as uh like variational bayes or more want to call low mark mark chain that matter
0:10:53	uh
0:10:55	in this paper we we uh work with uh a variational bayes
0:10:59	so we really show ways
0:11:01	um
0:11:02	we try to approximate to be a posterior distribution of the latent variables using
0:11:06	some variational distributions which have simpler fonts uh
0:11:10	for example we selected uh
0:11:13	uh a factor as a well uh
0:11:16	variational distribution like this so for each
0:11:19	uh component variables and H variables we have a uh
0:11:24	independent independent of distributions and for comp uh for
0:11:28	components we have a multinomial distributions as i said before uh
0:11:33	uh do this
0:11:34	uh makes our uh so since T posterior is C variables and on the original model are uh a multinomial
0:11:41	will make our a calculations is easier
0:11:43	and for the a H variables we have
0:11:46	i
0:11:47	a uh variational distributions so
0:11:49	we try to minimize the
0:11:52	could like the divergence between to variational a distributions we set
0:11:56	and the actual posterior but uh of course we don't know of the actual posterior but this time can be
0:12:01	expanded like
0:12:03	uh this uh the margin like build of the uh
0:12:07	dictionary uh a just W and
0:12:10	the kullback-leibler divergence between
0:12:13	uh
0:12:13	the variational distributions and the joint uh distribution
0:12:17	uh
0:12:18	so
0:12:20	uh minimizing this also uh
0:12:23	which is back to the uh a variational distributions uh is also people to
0:12:28	uh
0:12:29	minimizing this because this is a independent of the uh a variational distribution
0:12:35	uh and since this the it is
0:12:38	uh greater than zero uh
0:12:41	so this
0:12:43	a the this edition is greater than zero so i
0:12:46	this the negative of this quantity here is that lower bound on the marginal likelihood
0:12:51	because this will be greater than signal
0:12:54	so
0:12:55	uh minimizing the codec libel live buttons here
0:12:58	which corresponds to minimizing this
0:13:01	i
0:13:01	can be done with these fixed point point creations because we have a
0:13:06	uh independent uh distributions for variational distributions
0:13:10	uh
0:13:11	this could like the diver which respect to one of these distributions here
0:13:15	um
0:13:16	corresponds to
0:13:19	well uh
0:13:20	minimizing minimising but like that i've buttons in that racial distribution and this expectation we have so this is done
0:13:26	by basically
0:13:27	by uh
0:13:29	equalising the uh hyper parameters of these distributions
0:13:33	i
0:13:34	so
0:13:35	uh
0:13:36	we have
0:13:37	and approximate for the uh posterior distributions which is given as the variational distributions
0:13:42	so we we have and approximate is that like this
0:13:46	and
0:13:46	here this integral can be calculated this time
0:13:49	uh because we have a a factor as well uh
0:13:53	variational distribution here
0:13:55	uh
0:13:56	and
0:13:59	as the end step one we a maximise a
0:14:02	this quantity which the spectra W we get some uh multiplicative updates
0:14:07	update rules
0:14:08	as we had a uh with the original model
0:14:12	uh
0:14:14	uh
0:14:15	and we also have a a a a and estimation for uh for the a
0:14:19	a a a lower bound for the marginal like
0:14:22	um
0:14:24	so uh
0:14:25	this
0:14:26	uh that are some related works uh
0:14:29	uh
0:14:30	with this work uh
0:14:32	so the model we uh
0:14:34	use here is uh
0:14:36	used by uh
0:14:38	can he uh for it for text classification purposes uh
0:14:42	is a gamma plus some model uh
0:14:44	so we introduce uh the gamma
0:14:49	um
0:14:51	uh
0:14:51	it's the
0:14:52	uh
0:14:54	so on this model that has been some work and that has been some variational approximations uh
0:15:00	uh
0:15:02	maybe i can just get to uh experiments so first three a we tried to
0:15:08	uh learn the dictionary uh
0:15:11	the J used for all my uh
0:15:14	uh
0:15:14	a time-frequency representation of that piano excerpt
0:15:18	so that we we have a a fifteen second uh
0:15:21	long uh piano excerpt and we use to make them to uh a spectrum
0:15:25	spectrogram
0:15:27	uh and in this uh signal we have a four four notes playing a the combinations of four notes are
0:15:33	playing for uh
0:15:34	it's seven times so
0:15:40	or
0:15:42	i
0:15:45	oh
0:15:47	and
0:15:51	so for three we compared the uh
0:15:54	behaviours because of these two estimators uh
0:15:57	as we increase the number of components
0:16:00	with the joint likelihood estimator as we increase the number of are components
0:16:04	the joint like bill
0:16:06	keeps on increasing and if you set
0:16:08	uh the the compound number two hundred we still will have a a a a a a high are uh
0:16:14	joint like to uh
0:16:16	uh we have you
0:16:17	well we did
0:16:18	do the same experiment with the marginal likelihood uh estimator
0:16:23	uh
0:16:24	first first uh
0:16:25	much like build increases and after some point
0:16:28	i it it starts increasing it's
0:16:30	stagnate nights
0:16:32	uh
0:16:34	and uh
0:16:35	since we have a lower bound here uh we we we cannot be sure about uh whether this lower bound
0:16:42	is tight three we compared this lower bound with uh other marginal like loop
0:16:47	uh
0:16:48	estimation techniques and we we so that at this uh lower bound is tight
0:16:52	uh
0:16:53	so
0:16:55	up there six
0:16:56	a components here uh we also are that some of the uh
0:17:01	columns of the dictionary a
0:17:04	a a a a a a a and converge to zero so they is that the dictionary is thing i
0:17:10	uh with
0:17:11	uh with the joint likelihood estimator we uh we uh
0:17:15	run the uh
0:17:18	i'll go them with time components
0:17:19	we get time components as we so in the neck
0:17:22	in the previous slide
0:17:23	with D margin like to the estimator we have six
0:17:27	a non-zero columns and four
0:17:29	are uh
0:17:31	pruned out uh
0:17:32	uh and the these
0:17:34	components
0:17:39	oh
0:17:41	i
0:17:48	he's not once uh corresponding the individual not actually so the first
0:17:54	for R D
0:18:11	we have a another a points for the hammer ons uh the it's range and of D P a piano
0:18:17	and we have another one for the noise so and
0:18:21	uh
0:18:23	so
0:18:23	since we observe this uh pruning effect to get this uh
0:18:28	i F more B V uh worked on it
0:18:30	a a data set we are we we know the uh
0:18:35	uh
0:18:36	a a dictionary big and the ground truth of the dictionary here is not so we we
0:18:40	a
0:18:41	did the same experiments on this they to set so these are
0:18:44	uh
0:18:47	rough sketches of is to remote which has for a uh
0:18:50	joints and each a joint has uh
0:18:53	you can have a a
0:18:55	for angle so we have a a dictionary of uh sixteen here
0:18:59	and this is the ground truth
0:19:00	so we
0:19:02	performed from these same experiments on this dataset and with the uh a joint likelihood estimator we have
0:19:08	for example here ten
0:19:10	non-zero components and we have some
0:19:12	a because of the dictionary
0:19:15	but as with a maximum
0:19:18	martin like load estimator we have a sixteen
0:19:21	a different components and we have for uh
0:19:24	uh
0:19:25	pruned uh
0:19:26	dig dictionary is the
0:19:28	uh is or the through model is at both component here because it i it is cool X is than
0:19:34	in all of the uh
0:19:36	uh
0:19:36	all all of the data so uh
0:19:39	it will be
0:19:40	so it is that taste to the right
0:19:43	a a leg all this to matt and so with just dictionary it's
0:19:47	a a in all of the
0:19:48	reconstructions as well
0:19:51	so as conclusions
0:19:52	uh in this work we tried to compare uh
0:19:55	uh_huh the dictionary real obtain with uh
0:19:58	two different estimators the joint likelihood estimator is
0:20:02	uh core corresponds to the uh
0:20:04	standard nmf met does and marginal likelihood is that uh
0:20:09	then we learn it on the marginal models
0:20:12	uh uh actually and we set the a
0:20:15	uh
0:20:16	compound numbers to to a to the optimal number
0:20:19	these two uh uh estimate as um
0:20:21	power from similarly but
0:20:23	we uh set at high number for K the marginal likelihood estimator are uh
0:20:28	automatic to the
0:20:30	uh uh relevant uh
0:20:32	columns of the dictionary uh
0:20:35	so it does uh and automatic of mother or order selection
0:20:39	and this way of doing model order selection is uh
0:20:44	more efficient than for example a leading
0:20:47	the evidence of data on different number different a values of K
0:20:53	and selecting the best model
0:20:55	uh so in this case we we only select that
0:20:58	large enough to really you for uh K and B you get the uh
0:21:03	uh
0:21:05	affective teaching
0:21:06	thank
0:21:30	i can understand this
0:21:40	uh
0:21:46	actually be all of these uh expense we uh we we have it
0:21:51	uh
0:21:52	but lower bound which is tight
0:21:54	uh we we
0:21:55	uh compared it to other uh
0:21:58	estimates as well
0:22:00	uh a and like that that's that's so uh i have never seen
0:22:04	a a a case where we we couldn't uh
0:22:07	i will like to
0:22:09	uh
0:22:10	uh with enough
0:22:14	maybe to this should be in a ski
0:22:26	oh
0:22:37	was a much like to door
0:22:40	uh
0:22:41	so in those works uh no no no one uh does that
0:22:45	marginal likelihood
0:22:46	uh
0:22:47	uh estimation
0:22:49	so this is the first work
0:22:50	and we compared it to other methods so
0:22:53	and we are pretty sure that
0:22:55	the estimation is

MAXIMUM MARGINAL LIKELIHOOD ESTIMATION FOR NONNEGATIVE DICTIONARY LEARNING

Non-negative Tensor Factorization and Blind Separation

Presented by: Onur Dikmen, Author(s): Onur Dikmen, Cédric Févotte, CNRS LTCI / Télécom ParisTech, France