0:00:17hello a one uh i'm a thing and uh i'm a structure that at telecom partake take i in paris
0:00:24and this the joint work with uh sit able
0:00:27uh the to of the uh a presentation is uh maximum likelihood a much margin like to the estimation for
0:00:33nonnegative dictionary
0:00:36uh the outline of the talk will be as follows first i'll you the uh
0:00:41problem definition and that is a learning the a dictionary parameters in the nmf problem from the marginal uh model
0:00:49uh then i'll uh describe our model uh in which we it defined a prior for the uh
0:00:56prior distribution for D activation code
0:00:59uh than i'll describe to estimators uh or
0:01:03we we which we apply on this model uh first one is the mike uh
0:01:07max some joint likelihood uh estimator uh we which is which corresponds to the a a a standard uh nmf
0:01:14F uh
0:01:15uh that
0:01:16estimates estimates are uh on the uh model that we use
0:01:19and the other one is uh maximum marginal likelihood estimator are uh maybe which tries to learn the dictionary from
0:01:26the uh marginal model
0:01:29then i would be a a Q results uh
0:01:32on two data sets the first one is a a real data set and the other one is a synthetic
0:01:36of set
0:01:37uh we will compare to dictionaries of your thing uh
0:01:41uh which used to estimate as down i'll conclude
0:01:45uh so as of now uh everyone one knows them about uh the problem we have a we we try
0:01:50to uh
0:01:52i approximate uh and non-negative matrix uh re a as a product of two other a non-negative matrix is W
0:01:59and H
0:02:00here double use uh
0:02:02i can be seen as a dictionary parameters and H a matrix is the activation uh
0:02:08i parameters
0:02:10uh we try to do this approximation uh a a minimum minimizing at a die just uh
0:02:16between these two matrices
0:02:18and this type or just matrix can be anything uh like kullback-leibler divergence or or cross like to die versions
0:02:24or euclidean distance
0:02:26uh the generalized kullback-leibler uh
0:02:29dive urgency uh is given like this
0:02:34you what we want to learn W uh which is the dictionary uh and it's uh
0:02:40the size of uh this dictionary constant while the uh H the expansion coefficients
0:02:46at the size of this matrix increases as we it all are more samples
0:02:51uh so uh the samples are in call as column vectors and uh B is of size F by and
0:02:59and it a we have a F features and uh the data is represented by and so we as we
0:03:06uh increase samples uh we have
0:03:08uh the i and get
0:03:10a a larger
0:03:11so on this problem we have uh efficient algorithms which depend on make majorization unionisation uh which is a
0:03:20uh optimization by uh also that of functions
0:03:23uh so we have a a multiplicative update rules for this uh
0:03:28uh a problem which i which converts fast and it which it sure uh non negativity
0:03:34uh the same update rules can be also a also a paint by uh
0:03:39expectation maximisation algorithm on a that's sickle uh model uh low
0:03:45a a the kl about minimizing the yeah that versions corresponds to you
0:03:49a maximum likelihood on reports an observation model
0:03:55so our uh
0:03:57problem is a
0:04:00uh we we
0:04:02have more samples
0:04:03we V
0:04:04have more parameters to learn
0:04:06because the age uh the size of H also increases
0:04:10but this size of the would you is a constant a a uh independent of the data
0:04:15so uh for example uh a maximum like likelihood uh
0:04:19estimator is a consistent estimator uh
0:04:22that means
0:04:24if you use all sort
0:04:26uh infinite a number of samples your or uh
0:04:31the is
0:04:32estimates you get are are uh are the same as the true parameters
0:04:37since in this case as we also uh
0:04:40a more samples we also have
0:04:42more parameters
0:04:43done this it consistency is not to show
0:04:48are are he here is to uh
0:04:50no the dictionary parameters from the marginal model
0:04:54the uh from the uh but by maximizing the uh margin like little of the
0:05:00uh a dictionary terms
0:05:02and this can be obtained by uh integrating out think i'll the A activation parameters from the model
0:05:08so that's why we
0:05:10uh so this is the joint
0:05:14probability of
0:05:16B and H so we we define it uh a prior distribution on H and marginalise a we try to
0:05:23a my the marginal lies
0:05:25a the model by integrating out H so the
0:05:28uh this
0:05:30is the marginal likelihood here uh the I so this is not if if it don't of data
0:05:35uh so this is
0:05:36still conditional on a dictionary
0:05:39uh but of course this uh integration is not tractable uh and we will be uh
0:05:44using variational approximation to
0:05:47uh approximate this
0:05:48margin like
0:05:51so the model we have is i
0:05:54oh uh uh here i'm going uh i'm presenting the us that's to model uh
0:05:59with the on of the racial model and uh did this is actually equal to
0:06:04uh minimize a so uh
0:06:07maximum likelihood to the estimation on this model is uh equal to minimizing uh tick K the divergence between to
0:06:13used to meet assist
0:06:15so our observation
0:06:17uh uh at the or at a single element of the observation matrix is a on distributed with the
0:06:23with this
0:06:25excitation parameters
0:06:27and we we can i introduce some latent variables C which are
0:06:33sound to be uh
0:06:34the observation model
0:06:36and uh
0:06:37the D these are also pause on uh
0:06:41distributed with uh all these individuals some three
0:06:44i've here
0:06:45so on this model if we integrate out this latent variables C every also be in this
0:06:50marginal model
0:06:52i have so introducing this C variables uh
0:06:56uh enables us to
0:06:57to uh
0:06:58have an em got someone this model
0:07:01the uh posterior distribution of these C variables uh
0:07:06has the standard font there it there a multinomial distributed
0:07:09and uh since we have the posterior exact posterior of D distribution we can uh
0:07:16uh we can calculate the uh
0:07:19expectation in the e-step of the em algorithm uh and we will lead uh that the lead us uh to
0:07:25the uh
0:07:28multiplicative updates rules that i've mentioned
0:07:31so uh
0:07:32the model is a a pretty standard but a we we also introduce a
0:07:36prior distribution on H and this is a common distribution with scale parameters off five and shape shape parameters are
0:07:43fine scale it just be to
0:07:45so gamma distribution it is kinda to get for the uh a some of the racial models and this this
0:07:50gives us a
0:07:53uh so we we use such a prior is so that are are go to algorithms will be uh easier
0:07:59to saying
0:08:00uh so uh a in this
0:08:03uh model uh the conditional approval to use all uh
0:08:08each H really able will be a also got model
0:08:12come gamma distributed and in in our model we we don't have a prior distribution for
0:08:16W this is the W is a deterministic variable and we are interested in pointed
0:08:24so be the first estimator is maximum joint likelihood estimator
0:08:30uh so here we uh
0:08:32i i as we do in the standard nmf problem we we try to uh
0:08:38maximise the uh
0:08:40joint likelihood with just back to W and H
0:08:42here uh we have another term for the prior
0:08:46of H
0:08:47and uh we we can
0:08:49by using all of the data functions uh
0:08:52uh and major majorization minimization minimisation we can get this update rules which are uh
0:08:57again again pretty uh
0:09:00uh effective
0:09:01so the black the terms in black are the standard H update
0:09:05and the prior gives us these terms
0:09:09uh and with uh i'll for a greater done or cool to one we we have a a non negativity
0:09:15it should here uh and you fee set for to one
0:09:18so that's it
0:09:19a gamma distribution with skip and to one
0:09:23then it this becomes a multiplicative update
0:09:27so the uh so this is
0:09:30and nmf problem
0:09:32only difference is we have a prior on H
0:09:35uh and the update rules for W are the say
0:09:39so as the
0:09:42uh as the uh maximum marginal likelihood estimator
0:09:46so we as i mentioned that we we try to uh maximise disk want to but this quite this integral
0:09:52is intractable and
0:09:53uh we we don't have a form for this
0:09:55so we you want to uh go with the em algorithm
0:09:59uh a in an iterative a
0:10:02so in the yeah my and uh
0:10:05the e-step we we try to uh calculate this uh
0:10:09expectation of the uh a joint
0:10:12uh distribution on there uh the posterior distributions of all the latent variables we have in the model
0:10:18uh but
0:10:21we you don't have a form for this posterior as well so we can to exact em uh uh on
0:10:26this smart model as well
0:10:28so we uh we try to approximate this uh
0:10:32e-step step of the em algorithm with
0:10:34by using approximate a the sri uh
0:10:38this uh approximate a posterior distributions
0:10:41and we will uh we we can do this in a several ways and uh like we uh beige a
0:10:47as uh like variational bayes or more want to call low mark mark chain that matter
0:10:55in this paper we we uh work with uh a variational bayes
0:10:59so we really show ways
0:11:02we try to approximate to be a posterior distribution of the latent variables using
0:11:06some variational distributions which have simpler fonts uh
0:11:10for example we selected uh
0:11:13uh a factor as a well uh
0:11:16variational distribution like this so for each
0:11:19uh component variables and H variables we have a uh
0:11:24independent independent of distributions and for comp uh for
0:11:28components we have a multinomial distributions as i said before uh
0:11:33uh do this
0:11:34uh makes our uh so since T posterior is C variables and on the original model are uh a multinomial
0:11:41will make our a calculations is easier
0:11:43and for the a H variables we have
0:11:47a uh variational distributions so
0:11:49we try to minimize the
0:11:52could like the divergence between to variational a distributions we set
0:11:56and the actual posterior but uh of course we don't know of the actual posterior but this time can be
0:12:01expanded like
0:12:03uh this uh the margin like build of the uh
0:12:07dictionary uh a just W and
0:12:10the kullback-leibler divergence between
0:12:13the variational distributions and the joint uh distribution
0:12:20uh minimizing this also uh
0:12:23which is back to the uh a variational distributions uh is also people to
0:12:29minimizing this because this is a independent of the uh a variational distribution
0:12:35uh and since this the it is
0:12:38uh greater than zero uh
0:12:41so this
0:12:43a the this edition is greater than zero so i
0:12:46this the negative of this quantity here is that lower bound on the marginal likelihood
0:12:51because this will be greater than signal
0:12:55uh minimizing the codec libel live buttons here
0:12:58which corresponds to minimizing this
0:13:01can be done with these fixed point point creations because we have a
0:13:06uh independent uh distributions for variational distributions
0:13:11this could like the diver which respect to one of these distributions here
0:13:16corresponds to
0:13:19well uh
0:13:20minimizing minimising but like that i've buttons in that racial distribution and this expectation we have so this is done
0:13:26by basically
0:13:27by uh
0:13:29equalising the uh hyper parameters of these distributions
0:13:36we have
0:13:37and approximate for the uh posterior distributions which is given as the variational distributions
0:13:42so we we have and approximate is that like this
0:13:46here this integral can be calculated this time
0:13:49uh because we have a a factor as well uh
0:13:53variational distribution here
0:13:59as the end step one we a maximise a
0:14:02this quantity which the spectra W we get some uh multiplicative updates
0:14:07update rules
0:14:08as we had a uh with the original model
0:14:15and we also have a a a a and estimation for uh for the a
0:14:19a a a lower bound for the marginal like
0:14:24so uh
0:14:26uh that are some related works uh
0:14:30with this work uh
0:14:32so the model we uh
0:14:34use here is uh
0:14:36used by uh
0:14:38can he uh for it for text classification purposes uh
0:14:42is a gamma plus some model uh
0:14:44so we introduce uh the gamma
0:14:51it's the
0:14:54so on this model that has been some work and that has been some variational approximations uh
0:15:02maybe i can just get to uh experiments so first three a we tried to
0:15:08uh learn the dictionary uh
0:15:11the J used for all my uh
0:15:14a time-frequency representation of that piano excerpt
0:15:18so that we we have a a fifteen second uh
0:15:21long uh piano excerpt and we use to make them to uh a spectrum
0:15:27uh and in this uh signal we have a four four notes playing a the combinations of four notes are
0:15:33playing for uh
0:15:34it's seven times so
0:15:51so for three we compared the uh
0:15:54behaviours because of these two estimators uh
0:15:57as we increase the number of components
0:16:00with the joint likelihood estimator as we increase the number of are components
0:16:04the joint like bill
0:16:06keeps on increasing and if you set
0:16:08uh the the compound number two hundred we still will have a a a a a a high are uh
0:16:14joint like to uh
0:16:16uh we have you
0:16:17well we did
0:16:18do the same experiment with the marginal likelihood uh estimator
0:16:24first first uh
0:16:25much like build increases and after some point
0:16:28i it it starts increasing it's
0:16:30stagnate nights
0:16:34and uh
0:16:35since we have a lower bound here uh we we we cannot be sure about uh whether this lower bound
0:16:42is tight three we compared this lower bound with uh other marginal like loop
0:16:48estimation techniques and we we so that at this uh lower bound is tight
0:16:55up there six
0:16:56a components here uh we also are that some of the uh
0:17:01columns of the dictionary a
0:17:04a a a a a a a and converge to zero so they is that the dictionary is thing i
0:17:10uh with
0:17:11uh with the joint likelihood estimator we uh we uh
0:17:15run the uh
0:17:18i'll go them with time components
0:17:19we get time components as we so in the neck
0:17:22in the previous slide
0:17:23with D margin like to the estimator we have six
0:17:27a non-zero columns and four
0:17:29are uh
0:17:31pruned out uh
0:17:32uh and the these
0:17:48he's not once uh corresponding the individual not actually so the first
0:17:54for R D
0:18:11we have a another a points for the hammer ons uh the it's range and of D P a piano
0:18:17and we have another one for the noise so and
0:18:23since we observe this uh pruning effect to get this uh
0:18:28i F more B V uh worked on it
0:18:30a a data set we are we we know the uh
0:18:36a a dictionary big and the ground truth of the dictionary here is not so we we
0:18:41did the same experiments on this they to set so these are
0:18:47rough sketches of is to remote which has for a uh
0:18:50joints and each a joint has uh
0:18:53you can have a a
0:18:55for angle so we have a a dictionary of uh sixteen here
0:18:59and this is the ground truth
0:19:00so we
0:19:02performed from these same experiments on this dataset and with the uh a joint likelihood estimator we have
0:19:08for example here ten
0:19:10non-zero components and we have some
0:19:12a because of the dictionary
0:19:15but as with a maximum
0:19:18martin like load estimator we have a sixteen
0:19:21a different components and we have for uh
0:19:25pruned uh
0:19:26dig dictionary is the
0:19:28uh is or the through model is at both component here because it i it is cool X is than
0:19:34in all of the uh
0:19:36all all of the data so uh
0:19:39it will be
0:19:40so it is that taste to the right
0:19:43a a leg all this to matt and so with just dictionary it's
0:19:47a a in all of the
0:19:48reconstructions as well
0:19:51so as conclusions
0:19:52uh in this work we tried to compare uh
0:19:55uh_huh the dictionary real obtain with uh
0:19:58two different estimators the joint likelihood estimator is
0:20:02uh core corresponds to the uh
0:20:04standard nmf met does and marginal likelihood is that uh
0:20:09then we learn it on the marginal models
0:20:12uh uh actually and we set the a
0:20:16compound numbers to to a to the optimal number
0:20:19these two uh uh estimate as um
0:20:21power from similarly but
0:20:23we uh set at high number for K the marginal likelihood estimator are uh
0:20:28automatic to the
0:20:30uh uh relevant uh
0:20:32columns of the dictionary uh
0:20:35so it does uh and automatic of mother or order selection
0:20:39and this way of doing model order selection is uh
0:20:44more efficient than for example a leading
0:20:47the evidence of data on different number different a values of K
0:20:53and selecting the best model
0:20:55uh so in this case we we only select that
0:20:58large enough to really you for uh K and B you get the uh
0:21:05affective teaching
0:21:30i can understand this
0:21:46actually be all of these uh expense we uh we we have it
0:21:52but lower bound which is tight
0:21:54uh we we
0:21:55uh compared it to other uh
0:21:58estimates as well
0:22:00uh a and like that that's that's so uh i have never seen
0:22:04a a a case where we we couldn't uh
0:22:07i will like to
0:22:10uh with enough
0:22:14maybe to this should be in a ski
0:22:37was a much like to door
0:22:41so in those works uh no no no one uh does that
0:22:45marginal likelihood
0:22:47uh estimation
0:22:49so this is the first work
0:22:50and we compared it to other methods so
0:22:53and we are pretty sure that
0:22:55the estimation is