one hello a one uh i'm a thing and uh i'm a structure that at telecom partake take i in paris and this the joint work with uh sit able uh the to of the uh a presentation is uh maximum likelihood a much margin like to the estimation for nonnegative dictionary learning uh the outline of the talk will be as follows first i'll you the uh problem definition and that is a learning the a dictionary parameters in the nmf problem from the marginal uh model uh then i'll uh describe our model uh in which we it defined a prior for the uh prior distribution for D activation code uh than i'll describe to estimators uh or we we which we apply on this model uh first one is the mike uh max some joint likelihood uh estimator uh we which is which corresponds to the a a a standard uh nmf F uh uh that estimates estimates are uh on the uh model that we use and the other one is uh maximum marginal likelihood estimator are uh maybe which tries to learn the dictionary from the uh marginal model then i would be a a Q results uh on two data sets the first one is a a real data set and the other one is a synthetic of set uh we will compare to dictionaries of your thing uh uh which used to estimate as down i'll conclude uh so as of now uh everyone one knows them about uh the problem we have a we we try to uh i approximate uh and non-negative matrix uh re a as a product of two other a non-negative matrix is W and H here double use uh i can be seen as a dictionary parameters and H a matrix is the activation uh i parameters uh we try to do this approximation uh a a minimum minimizing at a die just uh between these two matrices and this type or just matrix can be anything uh like kullback-leibler divergence or or cross like to die versions or euclidean distance uh the generalized kullback-leibler uh dive urgency uh is given like this uh hmmm so you what we want to learn W uh which is the dictionary uh and it's uh the size of uh this dictionary constant while the uh H the expansion coefficients at the size of this matrix increases as we it all are more samples uh so uh the samples are in call as column vectors and uh B is of size F by and and it a we have a F features and uh the data is represented by and so we as we uh increase samples uh we have uh the i and get a a larger so on this problem we have uh efficient algorithms which depend on make majorization unionisation uh which is a uh optimization by uh also that of functions uh so we have a a multiplicative update rules for this uh uh a problem which i which converts fast and it which it sure uh non negativity uh the same update rules can be also a also a paint by uh i expectation maximisation algorithm on a that's sickle uh model uh low a a the kl about minimizing the yeah that versions corresponds to you a maximum likelihood on reports an observation model um so our uh problem is a as uh we we have more samples we V have more parameters to learn because the age uh the size of H also increases but this size of the would you is a constant a a uh independent of the data so uh for example uh a maximum like likelihood uh estimator is a consistent estimator uh that means if you use all sort uh infinite a number of samples your or uh the uh the is estimates you get are are uh are the same as the true parameters but since in this case as we also uh a more samples we also have more parameters done this it consistency is not to show so are are he here is to uh no the dictionary parameters from the marginal model uh the uh from the uh but by maximizing the uh margin like little of the uh a dictionary terms and this can be obtained by uh integrating out think i'll the A activation parameters from the model so that's why we uh so this is the joint uh uh probability of uh B and H so we we define it uh a prior distribution on H and marginalise a we try to a my the marginal lies a the model by integrating out H so the uh this uh is the marginal likelihood here uh the I so this is not if if it don't of data uh so this is still conditional on a dictionary uh but of course this uh integration is not tractable uh and we will be uh using variational approximation to uh approximate this margin like um so the model we have is i oh uh uh here i'm going uh i'm presenting the us that's to model uh with the on of the racial model and uh did this is actually equal to uh minimize a so uh maximum likelihood to the estimation on this model is uh equal to minimizing uh tick K the divergence between to used to meet assist so our observation uh uh at the or at a single element of the observation matrix is a on distributed with the with this excitation parameters uh and we we can i introduce some latent variables C which are sound to be uh the observation model and uh the D these are also pause on uh distributed with uh all these individuals some three i've here so on this model if we integrate out this latent variables C every also be in this marginal model i have so introducing this C variables uh uh enables us to to uh have an em got someone this model so the uh posterior distribution of these C variables uh has the standard font there it there a multinomial distributed and uh since we have the posterior exact posterior of D distribution we can uh uh we can calculate the uh expectation in the e-step of the em algorithm uh and we will lead uh that the lead us uh to the uh uh multiplicative updates rules that i've mentioned so uh the model is a a pretty standard but a we we also introduce a prior distribution on H and this is a common distribution with scale parameters off five and shape shape parameters are fine scale it just be to so gamma distribution it is kinda to get for the uh a some of the racial models and this this gives us a uh uh so we we use such a prior is so that are are go to algorithms will be uh easier to saying uh so uh a in this uh model uh the conditional approval to use all uh each H really able will be a also got model come gamma distributed and in in our model we we don't have a prior distribution for W this is the W is a deterministic variable and we are interested in pointed uh so be the first estimator is maximum joint likelihood estimator uh so here we uh i i as we do in the standard nmf problem we we try to uh maximise the uh joint likelihood with just back to W and H here uh we have another term for the prior of H and uh we we can i by using all of the data functions uh uh and major majorization minimization minimisation we can get this update rules which are uh again again pretty uh uh effective so the black the terms in black are the standard H update and the prior gives us these terms uh and with uh i'll for a greater done or cool to one we we have a a non negativity it should here uh and you fee set for to one so that's it a gamma distribution with skip and to one i then it this becomes a multiplicative update uh so the uh so this is uh and nmf problem only difference is we have a prior on H uh and the update rules for W are the say so as the maximum uh as the uh maximum marginal likelihood estimator so we as i mentioned that we we try to uh maximise disk want to but this quite this integral is intractable and uh we we don't have a form for this so we you want to uh go with the em algorithm uh a in an iterative a uh so in the yeah my and uh the e-step we we try to uh calculate this uh expectation of the uh a joint uh distribution on there uh the posterior distributions of all the latent variables we have in the model uh but uh we you don't have a form for this posterior as well so we can to exact em uh uh on this smart model as well so we uh we try to approximate this uh e-step step of the em algorithm with by using approximate a the sri uh this uh approximate a posterior distributions and we will uh we we can do this in a several ways and uh like we uh beige a as uh like variational bayes or more want to call low mark mark chain that matter uh in this paper we we uh work with uh a variational bayes so we really show ways um we try to approximate to be a posterior distribution of the latent variables using some variational distributions which have simpler fonts uh for example we selected uh uh a factor as a well uh variational distribution like this so for each uh component variables and H variables we have a uh independent independent of distributions and for comp uh for components we have a multinomial distributions as i said before uh uh do this uh makes our uh so since T posterior is C variables and on the original model are uh a multinomial will make our a calculations is easier and for the a H variables we have i a uh variational distributions so we try to minimize the could like the divergence between to variational a distributions we set and the actual posterior but uh of course we don't know of the actual posterior but this time can be expanded like uh this uh the margin like build of the uh dictionary uh a just W and the kullback-leibler divergence between uh the variational distributions and the joint uh distribution uh so uh minimizing this also uh which is back to the uh a variational distributions uh is also people to uh minimizing this because this is a independent of the uh a variational distribution uh and since this the it is uh greater than zero uh so this a the this edition is greater than zero so i this the negative of this quantity here is that lower bound on the marginal likelihood because this will be greater than signal so uh minimizing the codec libel live buttons here which corresponds to minimizing this i can be done with these fixed point point creations because we have a uh independent uh distributions for variational distributions uh this could like the diver which respect to one of these distributions here um corresponds to well uh minimizing minimising but like that i've buttons in that racial distribution and this expectation we have so this is done by basically by uh equalising the uh hyper parameters of these distributions i so uh we have and approximate for the uh posterior distributions which is given as the variational distributions so we we have and approximate is that like this and here this integral can be calculated this time uh because we have a a factor as well uh variational distribution here uh and as the end step one we a maximise a this quantity which the spectra W we get some uh multiplicative updates update rules as we had a uh with the original model uh uh and we also have a a a a and estimation for uh for the a a a a lower bound for the marginal like um so uh this uh that are some related works uh uh with this work uh so the model we uh use here is uh used by uh can he uh for it for text classification purposes uh is a gamma plus some model uh so we introduce uh the gamma um uh it's the uh so on this model that has been some work and that has been some variational approximations uh uh maybe i can just get to uh experiments so first three a we tried to uh learn the dictionary uh the J used for all my uh uh a time-frequency representation of that piano excerpt so that we we have a a fifteen second uh long uh piano excerpt and we use to make them to uh a spectrum spectrogram uh and in this uh signal we have a four four notes playing a the combinations of four notes are playing for uh it's seven times so or i oh and so for three we compared the uh behaviours because of these two estimators uh as we increase the number of components with the joint likelihood estimator as we increase the number of are components the joint like bill keeps on increasing and if you set uh the the compound number two hundred we still will have a a a a a a high are uh joint like to uh uh we have you well we did do the same experiment with the marginal likelihood uh estimator uh first first uh much like build increases and after some point i it it starts increasing it's stagnate nights uh and uh since we have a lower bound here uh we we we cannot be sure about uh whether this lower bound is tight three we compared this lower bound with uh other marginal like loop uh estimation techniques and we we so that at this uh lower bound is tight uh so up there six a components here uh we also are that some of the uh columns of the dictionary a a a a a a a a and converge to zero so they is that the dictionary is thing i uh with uh with the joint likelihood estimator we uh we uh run the uh i'll go them with time components we get time components as we so in the neck in the previous slide with D margin like to the estimator we have six a non-zero columns and four are uh pruned out uh uh and the these components oh i he's not once uh corresponding the individual not actually so the first for R D we have a another a points for the hammer ons uh the it's range and of D P a piano and we have another one for the noise so and uh so since we observe this uh pruning effect to get this uh i F more B V uh worked on it a a data set we are we we know the uh uh a a dictionary big and the ground truth of the dictionary here is not so we we a did the same experiments on this they to set so these are uh rough sketches of is to remote which has for a uh joints and each a joint has uh you can have a a for angle so we have a a dictionary of uh sixteen here and this is the ground truth so we performed from these same experiments on this dataset and with the uh a joint likelihood estimator we have for example here ten non-zero components and we have some a because of the dictionary but as with a maximum martin like load estimator we have a sixteen a different components and we have for uh uh pruned uh dig dictionary is the uh is or the through model is at both component here because it i it is cool X is than in all of the uh uh all all of the data so uh it will be so it is that taste to the right a a leg all this to matt and so with just dictionary it's a a in all of the reconstructions as well so as conclusions uh in this work we tried to compare uh uh_huh the dictionary real obtain with uh two different estimators the joint likelihood estimator is uh core corresponds to the uh standard nmf met does and marginal likelihood is that uh then we learn it on the marginal models uh uh actually and we set the a uh compound numbers to to a to the optimal number these two uh uh estimate as um power from similarly but we uh set at high number for K the marginal likelihood estimator are uh automatic to the uh uh relevant uh columns of the dictionary uh so it does uh and automatic of mother or order selection and this way of doing model order selection is uh more efficient than for example a leading the evidence of data on different number different a values of K and selecting the best model uh so in this case we we only select that large enough to really you for uh K and B you get the uh uh affective teaching thank i can understand this uh actually be all of these uh expense we uh we we have it uh but lower bound which is tight uh we we uh compared it to other uh estimates as well uh a and like that that's that's so uh i have never seen a a a case where we we couldn't uh i will like to uh uh with enough maybe to this should be in a ski oh was a much like to door uh so in those works uh no no no one uh does that marginal likelihood uh uh estimation so this is the first work and we compared it to other methods so and we are pretty sure that the estimation is