one

hello a one uh i'm a thing and uh i'm a structure that at telecom partake take i in paris

and this the joint work with uh sit able

uh the to of the uh a presentation is uh maximum likelihood a much margin like to the estimation for

nonnegative dictionary

learning

uh the outline of the talk will be as follows first i'll you the uh

problem definition and that is a learning the a dictionary parameters in the nmf problem from the marginal uh model

uh then i'll uh describe our model uh in which we it defined a prior for the uh

prior distribution for D activation code

uh than i'll describe to estimators uh or

we we which we apply on this model uh first one is the mike uh

max some joint likelihood uh estimator uh we which is which corresponds to the a a a standard uh nmf

F uh

uh that

estimates estimates are uh on the uh model that we use

and the other one is uh maximum marginal likelihood estimator are uh maybe which tries to learn the dictionary from

the uh marginal model

then i would be a a Q results uh

on two data sets the first one is a a real data set and the other one is a synthetic

of set

uh we will compare to dictionaries of your thing uh

uh which used to estimate as down i'll conclude

uh so as of now uh everyone one knows them about uh the problem we have a we we try

to uh

i approximate uh and non-negative matrix uh re a as a product of two other a non-negative matrix is W

and H

here double use uh

i can be seen as a dictionary parameters and H a matrix is the activation uh

i parameters

uh we try to do this approximation uh a a minimum minimizing at a die just uh

between these two matrices

and this type or just matrix can be anything uh like kullback-leibler divergence or or cross like to die versions

or euclidean distance

uh the generalized kullback-leibler uh

dive urgency uh is given like this

uh

hmmm

so

you what we want to learn W uh which is the dictionary uh and it's uh

the size of uh this dictionary constant while the uh H the expansion coefficients

at the size of this matrix increases as we it all are more samples

uh so uh the samples are in call as column vectors and uh B is of size F by and

and it a we have a F features and uh the data is represented by and so we as we

uh increase samples uh we have

uh the i and get

a a larger

so on this problem we have uh efficient algorithms which depend on make majorization unionisation uh which is a

uh optimization by uh also that of functions

uh so we have a a multiplicative update rules for this uh

uh a problem which i which converts fast and it which it sure uh non negativity

uh the same update rules can be also a also a paint by uh

i

expectation maximisation algorithm on a that's sickle uh model uh low

a a the kl about minimizing the yeah that versions corresponds to you

a maximum likelihood on reports an observation model

um

so our uh

problem is a

as

uh we we

have more samples

we V

have more parameters to learn

because the age uh the size of H also increases

but this size of the would you is a constant a a uh independent of the data

so uh for example uh a maximum like likelihood uh

estimator is a consistent estimator uh

that means

if you use all sort

uh infinite a number of samples your or uh

the

uh

the is

estimates you get are are uh are the same as the true parameters

but

since in this case as we also uh

a more samples we also have

more parameters

done this it consistency is not to show

so

are are he here is to uh

no the dictionary parameters from the marginal model

uh

the uh from the uh but by maximizing the uh margin like little of the

uh a dictionary terms

and this can be obtained by uh integrating out think i'll the A activation parameters from the model

so that's why we

uh so this is the joint

uh

uh

probability of

uh

B and H so we we define it uh a prior distribution on H and marginalise a we try to

a my the marginal lies

a the model by integrating out H so the

uh this

uh

is the marginal likelihood here uh the I so this is not if if it don't of data

uh so this is

still conditional on a dictionary

uh but of course this uh integration is not tractable uh and we will be uh

using variational approximation to

uh approximate this

margin like

um

so the model we have is i

oh uh uh here i'm going uh i'm presenting the us that's to model uh

with the on of the racial model and uh did this is actually equal to

uh minimize a so uh

maximum likelihood to the estimation on this model is uh equal to minimizing uh tick K the divergence between to

used to meet assist

so our observation

uh uh at the or at a single element of the observation matrix is a on distributed with the

with this

excitation parameters

uh

and we we can i introduce some latent variables C which are

sound to be uh

the observation model

and uh

the D these are also pause on uh

distributed with uh all these individuals some three

i've here

so on this model if we integrate out this latent variables C every also be in this

marginal model

i have so introducing this C variables uh

uh enables us to

to uh

have an em got someone this model

so

the uh posterior distribution of these C variables uh

has the standard font there it there a multinomial distributed

and uh since we have the posterior exact posterior of D distribution we can uh

uh we can calculate the uh

expectation in the e-step of the em algorithm uh and we will lead uh that the lead us uh to

the uh

uh

multiplicative updates rules that i've mentioned

so uh

the model is a a pretty standard but a we we also introduce a

prior distribution on H and this is a common distribution with scale parameters off five and shape shape parameters are

fine scale it just be to

so gamma distribution it is kinda to get for the uh a some of the racial models and this this

gives us a

uh

uh so we we use such a prior is so that are are go to algorithms will be uh easier

to saying

uh so uh a in this

uh model uh the conditional approval to use all uh

each H really able will be a also got model

come gamma distributed and in in our model we we don't have a prior distribution for

W this is the W is a deterministic variable and we are interested in pointed

uh

so be the first estimator is maximum joint likelihood estimator

uh so here we uh

i i as we do in the standard nmf problem we we try to uh

maximise the uh

joint likelihood with just back to W and H

here uh we have another term for the prior

of H

and uh we we can

i

by using all of the data functions uh

uh and major majorization minimization minimisation we can get this update rules which are uh

again again pretty uh

uh effective

so the black the terms in black are the standard H update

and the prior gives us these terms

uh and with uh i'll for a greater done or cool to one we we have a a non negativity

it should here uh and you fee set for to one

so that's it

a gamma distribution with skip and to one

i

then it this becomes a multiplicative update

uh

so the uh so this is

uh

and nmf problem

only difference is we have a prior on H

uh and the update rules for W are the say

so as the

maximum

uh as the uh maximum marginal likelihood estimator

so we as i mentioned that we we try to uh maximise disk want to but this quite this integral

is intractable and

uh we we don't have a form for this

so we you want to uh go with the em algorithm

uh a in an iterative a

uh

so in the yeah my and uh

the e-step we we try to uh calculate this uh

expectation of the uh a joint

uh distribution on there uh the posterior distributions of all the latent variables we have in the model

uh but

uh

we you don't have a form for this posterior as well so we can to exact em uh uh on

this smart model as well

so we uh we try to approximate this uh

e-step step of the em algorithm with

by using approximate a the sri uh

this uh approximate a posterior distributions

and we will uh we we can do this in a several ways and uh like we uh beige a

as uh like variational bayes or more want to call low mark mark chain that matter

uh

in this paper we we uh work with uh a variational bayes

so we really show ways

um

we try to approximate to be a posterior distribution of the latent variables using

some variational distributions which have simpler fonts uh

for example we selected uh

uh a factor as a well uh

variational distribution like this so for each

uh component variables and H variables we have a uh

independent independent of distributions and for comp uh for

components we have a multinomial distributions as i said before uh

uh do this

uh makes our uh so since T posterior is C variables and on the original model are uh a multinomial

will make our a calculations is easier

and for the a H variables we have

i

a uh variational distributions so

we try to minimize the

could like the divergence between to variational a distributions we set

and the actual posterior but uh of course we don't know of the actual posterior but this time can be

expanded like

uh this uh the margin like build of the uh

dictionary uh a just W and

the kullback-leibler divergence between

uh

the variational distributions and the joint uh distribution

uh

so

uh minimizing this also uh

which is back to the uh a variational distributions uh is also people to

uh

minimizing this because this is a independent of the uh a variational distribution

uh and since this the it is

uh greater than zero uh

so this

a the this edition is greater than zero so i

this the negative of this quantity here is that lower bound on the marginal likelihood

because this will be greater than signal

so

uh minimizing the codec libel live buttons here

which corresponds to minimizing this

i

can be done with these fixed point point creations because we have a

uh independent uh distributions for variational distributions

uh

this could like the diver which respect to one of these distributions here

um

corresponds to

well uh

minimizing minimising but like that i've buttons in that racial distribution and this expectation we have so this is done

by basically

by uh

equalising the uh hyper parameters of these distributions

i

so

uh

we have

and approximate for the uh posterior distributions which is given as the variational distributions

so we we have and approximate is that like this

and

here this integral can be calculated this time

uh because we have a a factor as well uh

variational distribution here

uh

and

as the end step one we a maximise a

this quantity which the spectra W we get some uh multiplicative updates

update rules

as we had a uh with the original model

uh

uh

and we also have a a a a and estimation for uh for the a

a a a lower bound for the marginal like

um

so uh

this

uh that are some related works uh

uh

with this work uh

so the model we uh

use here is uh

used by uh

can he uh for it for text classification purposes uh

is a gamma plus some model uh

so we introduce uh the gamma

um

uh

it's the

uh

so on this model that has been some work and that has been some variational approximations uh

uh

maybe i can just get to uh experiments so first three a we tried to

uh learn the dictionary uh

the J used for all my uh

uh

a time-frequency representation of that piano excerpt

so that we we have a a fifteen second uh

long uh piano excerpt and we use to make them to uh a spectrum

spectrogram

uh and in this uh signal we have a four four notes playing a the combinations of four notes are

playing for uh

it's seven times so

or

i

oh

and

so for three we compared the uh

behaviours because of these two estimators uh

as we increase the number of components

with the joint likelihood estimator as we increase the number of are components

the joint like bill

keeps on increasing and if you set

uh the the compound number two hundred we still will have a a a a a a high are uh

joint like to uh

uh we have you

well we did

do the same experiment with the marginal likelihood uh estimator

uh

first first uh

much like build increases and after some point

i it it starts increasing it's

stagnate nights

uh

and uh

since we have a lower bound here uh we we we cannot be sure about uh whether this lower bound

is tight three we compared this lower bound with uh other marginal like loop

uh

estimation techniques and we we so that at this uh lower bound is tight

uh

so

up there six

a components here uh we also are that some of the uh

columns of the dictionary a

a a a a a a a and converge to zero so they is that the dictionary is thing i

uh with

uh with the joint likelihood estimator we uh we uh

run the uh

i'll go them with time components

we get time components as we so in the neck

in the previous slide

with D margin like to the estimator we have six

a non-zero columns and four

are uh

pruned out uh

uh and the these

components

oh

i

he's not once uh corresponding the individual not actually so the first

for R D

we have a another a points for the hammer ons uh the it's range and of D P a piano

and we have another one for the noise so and

uh

so

since we observe this uh pruning effect to get this uh

i F more B V uh worked on it

a a data set we are we we know the uh

uh

a a dictionary big and the ground truth of the dictionary here is not so we we

a

did the same experiments on this they to set so these are

uh

rough sketches of is to remote which has for a uh

joints and each a joint has uh

you can have a a

for angle so we have a a dictionary of uh sixteen here

and this is the ground truth

so we

performed from these same experiments on this dataset and with the uh a joint likelihood estimator we have

for example here ten

non-zero components and we have some

a because of the dictionary

but as with a maximum

martin like load estimator we have a sixteen

a different components and we have for uh

uh

pruned uh

dig dictionary is the

uh is or the through model is at both component here because it i it is cool X is than

in all of the uh

uh

all all of the data so uh

it will be

so it is that taste to the right

a a leg all this to matt and so with just dictionary it's

a a in all of the

reconstructions as well

so as conclusions

uh in this work we tried to compare uh

uh_huh the dictionary real obtain with uh

two different estimators the joint likelihood estimator is

uh core corresponds to the uh

standard nmf met does and marginal likelihood is that uh

then we learn it on the marginal models

uh uh actually and we set the a

uh

compound numbers to to a to the optimal number

these two uh uh estimate as um

power from similarly but

we uh set at high number for K the marginal likelihood estimator are uh

automatic to the

uh uh relevant uh

columns of the dictionary uh

so it does uh and automatic of mother or order selection

and this way of doing model order selection is uh

more efficient than for example a leading

the evidence of data on different number different a values of K

and selecting the best model

uh so in this case we we only select that

large enough to really you for uh K and B you get the uh

uh

affective teaching

thank

i can understand this

uh

actually be all of these uh expense we uh we we have it

uh

but lower bound which is tight

uh we we

uh compared it to other uh

estimates as well

uh a and like that that's that's so uh i have never seen

a a a case where we we couldn't uh

i will like to

uh

uh with enough

maybe to this should be in a ski

oh

was a much like to door

uh

so in those works uh no no no one uh does that

marginal likelihood

uh

uh estimation

so this is the first work

and we compared it to other methods so

and we are pretty sure that

the estimation is