0:00:13 | one |
---|---|

0:00:17 | hello a one uh i'm a thing and uh i'm a structure that at telecom partake take i in paris |

0:00:24 | and this the joint work with uh sit able |

0:00:27 | uh the to of the uh a presentation is uh maximum likelihood a much margin like to the estimation for |

0:00:33 | nonnegative dictionary |

0:00:34 | learning |

0:00:36 | uh the outline of the talk will be as follows first i'll you the uh |

0:00:41 | problem definition and that is a learning the a dictionary parameters in the nmf problem from the marginal uh model |

0:00:49 | uh then i'll uh describe our model uh in which we it defined a prior for the uh |

0:00:56 | prior distribution for D activation code |

0:00:59 | uh than i'll describe to estimators uh or |

0:01:03 | we we which we apply on this model uh first one is the mike uh |

0:01:07 | max some joint likelihood uh estimator uh we which is which corresponds to the a a a standard uh nmf |

0:01:14 | F uh |

0:01:15 | uh that |

0:01:16 | estimates estimates are uh on the uh model that we use |

0:01:19 | and the other one is uh maximum marginal likelihood estimator are uh maybe which tries to learn the dictionary from |

0:01:26 | the uh marginal model |

0:01:29 | then i would be a a Q results uh |

0:01:32 | on two data sets the first one is a a real data set and the other one is a synthetic |

0:01:36 | of set |

0:01:37 | uh we will compare to dictionaries of your thing uh |

0:01:41 | uh which used to estimate as down i'll conclude |

0:01:45 | uh so as of now uh everyone one knows them about uh the problem we have a we we try |

0:01:50 | to uh |

0:01:52 | i approximate uh and non-negative matrix uh re a as a product of two other a non-negative matrix is W |

0:01:59 | and H |

0:02:00 | here double use uh |

0:02:02 | i can be seen as a dictionary parameters and H a matrix is the activation uh |

0:02:08 | i parameters |

0:02:10 | uh we try to do this approximation uh a a minimum minimizing at a die just uh |

0:02:16 | between these two matrices |

0:02:18 | and this type or just matrix can be anything uh like kullback-leibler divergence or or cross like to die versions |

0:02:24 | or euclidean distance |

0:02:26 | uh the generalized kullback-leibler uh |

0:02:29 | dive urgency uh is given like this |

0:02:32 | uh |

0:02:33 | hmmm |

0:02:33 | so |

0:02:34 | you what we want to learn W uh which is the dictionary uh and it's uh |

0:02:40 | the size of uh this dictionary constant while the uh H the expansion coefficients |

0:02:46 | at the size of this matrix increases as we it all are more samples |

0:02:51 | uh so uh the samples are in call as column vectors and uh B is of size F by and |

0:02:59 | and it a we have a F features and uh the data is represented by and so we as we |

0:03:06 | uh increase samples uh we have |

0:03:08 | uh the i and get |

0:03:10 | a a larger |

0:03:11 | so on this problem we have uh efficient algorithms which depend on make majorization unionisation uh which is a |

0:03:20 | uh optimization by uh also that of functions |

0:03:23 | uh so we have a a multiplicative update rules for this uh |

0:03:28 | uh a problem which i which converts fast and it which it sure uh non negativity |

0:03:34 | uh the same update rules can be also a also a paint by uh |

0:03:39 | i |

0:03:39 | expectation maximisation algorithm on a that's sickle uh model uh low |

0:03:45 | a a the kl about minimizing the yeah that versions corresponds to you |

0:03:49 | a maximum likelihood on reports an observation model |

0:03:53 | um |

0:03:55 | so our uh |

0:03:57 | problem is a |

0:04:00 | as |

0:04:00 | uh we we |

0:04:02 | have more samples |

0:04:03 | we V |

0:04:04 | have more parameters to learn |

0:04:06 | because the age uh the size of H also increases |

0:04:10 | but this size of the would you is a constant a a uh independent of the data |

0:04:15 | so uh for example uh a maximum like likelihood uh |

0:04:19 | estimator is a consistent estimator uh |

0:04:22 | that means |

0:04:24 | if you use all sort |

0:04:26 | uh infinite a number of samples your or uh |

0:04:29 | the |

0:04:29 | uh |

0:04:31 | the is |

0:04:32 | estimates you get are are uh are the same as the true parameters |

0:04:36 | but |

0:04:37 | since in this case as we also uh |

0:04:40 | a more samples we also have |

0:04:42 | more parameters |

0:04:43 | done this it consistency is not to show |

0:04:47 | so |

0:04:48 | are are he here is to uh |

0:04:50 | no the dictionary parameters from the marginal model |

0:04:54 | uh |

0:04:54 | the uh from the uh but by maximizing the uh margin like little of the |

0:05:00 | uh a dictionary terms |

0:05:02 | and this can be obtained by uh integrating out think i'll the A activation parameters from the model |

0:05:08 | so that's why we |

0:05:10 | uh so this is the joint |

0:05:12 | uh |

0:05:13 | uh |

0:05:14 | probability of |

0:05:15 | uh |

0:05:16 | B and H so we we define it uh a prior distribution on H and marginalise a we try to |

0:05:23 | a my the marginal lies |

0:05:25 | a the model by integrating out H so the |

0:05:28 | uh this |

0:05:29 | uh |

0:05:30 | is the marginal likelihood here uh the I so this is not if if it don't of data |

0:05:35 | uh so this is |

0:05:36 | still conditional on a dictionary |

0:05:39 | uh but of course this uh integration is not tractable uh and we will be uh |

0:05:44 | using variational approximation to |

0:05:47 | uh approximate this |

0:05:48 | margin like |

0:05:51 | um |

0:05:51 | so the model we have is i |

0:05:54 | oh uh uh here i'm going uh i'm presenting the us that's to model uh |

0:05:59 | with the on of the racial model and uh did this is actually equal to |

0:06:04 | uh minimize a so uh |

0:06:07 | maximum likelihood to the estimation on this model is uh equal to minimizing uh tick K the divergence between to |

0:06:13 | used to meet assist |

0:06:15 | so our observation |

0:06:17 | uh uh at the or at a single element of the observation matrix is a on distributed with the |

0:06:23 | with this |

0:06:25 | excitation parameters |

0:06:26 | uh |

0:06:27 | and we we can i introduce some latent variables C which are |

0:06:33 | sound to be uh |

0:06:34 | the observation model |

0:06:36 | and uh |

0:06:37 | the D these are also pause on uh |

0:06:41 | distributed with uh all these individuals some three |

0:06:44 | i've here |

0:06:45 | so on this model if we integrate out this latent variables C every also be in this |

0:06:50 | marginal model |

0:06:52 | i have so introducing this C variables uh |

0:06:56 | uh enables us to |

0:06:57 | to uh |

0:06:58 | have an em got someone this model |

0:07:00 | so |

0:07:01 | the uh posterior distribution of these C variables uh |

0:07:06 | has the standard font there it there a multinomial distributed |

0:07:09 | and uh since we have the posterior exact posterior of D distribution we can uh |

0:07:16 | uh we can calculate the uh |

0:07:19 | expectation in the e-step of the em algorithm uh and we will lead uh that the lead us uh to |

0:07:25 | the uh |

0:07:27 | uh |

0:07:28 | multiplicative updates rules that i've mentioned |

0:07:31 | so uh |

0:07:32 | the model is a a pretty standard but a we we also introduce a |

0:07:36 | prior distribution on H and this is a common distribution with scale parameters off five and shape shape parameters are |

0:07:43 | fine scale it just be to |

0:07:45 | so gamma distribution it is kinda to get for the uh a some of the racial models and this this |

0:07:50 | gives us a |

0:07:52 | uh |

0:07:53 | uh so we we use such a prior is so that are are go to algorithms will be uh easier |

0:07:59 | to saying |

0:08:00 | uh so uh a in this |

0:08:03 | uh model uh the conditional approval to use all uh |

0:08:08 | each H really able will be a also got model |

0:08:12 | come gamma distributed and in in our model we we don't have a prior distribution for |

0:08:16 | W this is the W is a deterministic variable and we are interested in pointed |

0:08:24 | uh |

0:08:24 | so be the first estimator is maximum joint likelihood estimator |

0:08:30 | uh so here we uh |

0:08:32 | i i as we do in the standard nmf problem we we try to uh |

0:08:38 | maximise the uh |

0:08:40 | joint likelihood with just back to W and H |

0:08:42 | here uh we have another term for the prior |

0:08:46 | of H |

0:08:47 | and uh we we can |

0:08:48 | i |

0:08:49 | by using all of the data functions uh |

0:08:52 | uh and major majorization minimization minimisation we can get this update rules which are uh |

0:08:57 | again again pretty uh |

0:09:00 | uh effective |

0:09:01 | so the black the terms in black are the standard H update |

0:09:05 | and the prior gives us these terms |

0:09:09 | uh and with uh i'll for a greater done or cool to one we we have a a non negativity |

0:09:15 | it should here uh and you fee set for to one |

0:09:18 | so that's it |

0:09:19 | a gamma distribution with skip and to one |

0:09:22 | i |

0:09:23 | then it this becomes a multiplicative update |

0:09:26 | uh |

0:09:27 | so the uh so this is |

0:09:29 | uh |

0:09:30 | and nmf problem |

0:09:32 | only difference is we have a prior on H |

0:09:35 | uh and the update rules for W are the say |

0:09:39 | so as the |

0:09:41 | maximum |

0:09:42 | uh as the uh maximum marginal likelihood estimator |

0:09:46 | so we as i mentioned that we we try to uh maximise disk want to but this quite this integral |

0:09:52 | is intractable and |

0:09:53 | uh we we don't have a form for this |

0:09:55 | so we you want to uh go with the em algorithm |

0:09:59 | uh a in an iterative a |

0:10:01 | uh |

0:10:02 | so in the yeah my and uh |

0:10:05 | the e-step we we try to uh calculate this uh |

0:10:09 | expectation of the uh a joint |

0:10:12 | uh distribution on there uh the posterior distributions of all the latent variables we have in the model |

0:10:18 | uh but |

0:10:19 | uh |

0:10:21 | we you don't have a form for this posterior as well so we can to exact em uh uh on |

0:10:26 | this smart model as well |

0:10:28 | so we uh we try to approximate this uh |

0:10:32 | e-step step of the em algorithm with |

0:10:34 | by using approximate a the sri uh |

0:10:38 | this uh approximate a posterior distributions |

0:10:41 | and we will uh we we can do this in a several ways and uh like we uh beige a |

0:10:47 | as uh like variational bayes or more want to call low mark mark chain that matter |

0:10:53 | uh |

0:10:55 | in this paper we we uh work with uh a variational bayes |

0:10:59 | so we really show ways |

0:11:01 | um |

0:11:02 | we try to approximate to be a posterior distribution of the latent variables using |

0:11:06 | some variational distributions which have simpler fonts uh |

0:11:10 | for example we selected uh |

0:11:13 | uh a factor as a well uh |

0:11:16 | variational distribution like this so for each |

0:11:19 | uh component variables and H variables we have a uh |

0:11:24 | independent independent of distributions and for comp uh for |

0:11:28 | components we have a multinomial distributions as i said before uh |

0:11:33 | uh do this |

0:11:34 | uh makes our uh so since T posterior is C variables and on the original model are uh a multinomial |

0:11:41 | will make our a calculations is easier |

0:11:43 | and for the a H variables we have |

0:11:46 | i |

0:11:47 | a uh variational distributions so |

0:11:49 | we try to minimize the |

0:11:52 | could like the divergence between to variational a distributions we set |

0:11:56 | and the actual posterior but uh of course we don't know of the actual posterior but this time can be |

0:12:01 | expanded like |

0:12:03 | uh this uh the margin like build of the uh |

0:12:07 | dictionary uh a just W and |

0:12:10 | the kullback-leibler divergence between |

0:12:13 | uh |

0:12:13 | the variational distributions and the joint uh distribution |

0:12:17 | uh |

0:12:18 | so |

0:12:20 | uh minimizing this also uh |

0:12:23 | which is back to the uh a variational distributions uh is also people to |

0:12:28 | uh |

0:12:29 | minimizing this because this is a independent of the uh a variational distribution |

0:12:35 | uh and since this the it is |

0:12:38 | uh greater than zero uh |

0:12:41 | so this |

0:12:43 | a the this edition is greater than zero so i |

0:12:46 | this the negative of this quantity here is that lower bound on the marginal likelihood |

0:12:51 | because this will be greater than signal |

0:12:54 | so |

0:12:55 | uh minimizing the codec libel live buttons here |

0:12:58 | which corresponds to minimizing this |

0:13:01 | i |

0:13:01 | can be done with these fixed point point creations because we have a |

0:13:06 | uh independent uh distributions for variational distributions |

0:13:10 | uh |

0:13:11 | this could like the diver which respect to one of these distributions here |

0:13:15 | um |

0:13:16 | corresponds to |

0:13:19 | well uh |

0:13:20 | minimizing minimising but like that i've buttons in that racial distribution and this expectation we have so this is done |

0:13:26 | by basically |

0:13:27 | by uh |

0:13:29 | equalising the uh hyper parameters of these distributions |

0:13:33 | i |

0:13:34 | so |

0:13:35 | uh |

0:13:36 | we have |

0:13:37 | and approximate for the uh posterior distributions which is given as the variational distributions |

0:13:42 | so we we have and approximate is that like this |

0:13:46 | and |

0:13:46 | here this integral can be calculated this time |

0:13:49 | uh because we have a a factor as well uh |

0:13:53 | variational distribution here |

0:13:55 | uh |

0:13:56 | and |

0:13:59 | as the end step one we a maximise a |

0:14:02 | this quantity which the spectra W we get some uh multiplicative updates |

0:14:07 | update rules |

0:14:08 | as we had a uh with the original model |

0:14:12 | uh |

0:14:14 | uh |

0:14:15 | and we also have a a a a and estimation for uh for the a |

0:14:19 | a a a lower bound for the marginal like |

0:14:22 | um |

0:14:24 | so uh |

0:14:25 | this |

0:14:26 | uh that are some related works uh |

0:14:29 | uh |

0:14:30 | with this work uh |

0:14:32 | so the model we uh |

0:14:34 | use here is uh |

0:14:36 | used by uh |

0:14:38 | can he uh for it for text classification purposes uh |

0:14:42 | is a gamma plus some model uh |

0:14:44 | so we introduce uh the gamma |

0:14:49 | um |

0:14:51 | uh |

0:14:51 | it's the |

0:14:52 | uh |

0:14:54 | so on this model that has been some work and that has been some variational approximations uh |

0:15:00 | uh |

0:15:02 | maybe i can just get to uh experiments so first three a we tried to |

0:15:08 | uh learn the dictionary uh |

0:15:11 | the J used for all my uh |

0:15:14 | uh |

0:15:14 | a time-frequency representation of that piano excerpt |

0:15:18 | so that we we have a a fifteen second uh |

0:15:21 | long uh piano excerpt and we use to make them to uh a spectrum |

0:15:25 | spectrogram |

0:15:27 | uh and in this uh signal we have a four four notes playing a the combinations of four notes are |

0:15:33 | playing for uh |

0:15:34 | it's seven times so |

0:15:40 | or |

0:15:42 | i |

0:15:45 | oh |

0:15:47 | and |

0:15:51 | so for three we compared the uh |

0:15:54 | behaviours because of these two estimators uh |

0:15:57 | as we increase the number of components |

0:16:00 | with the joint likelihood estimator as we increase the number of are components |

0:16:04 | the joint like bill |

0:16:06 | keeps on increasing and if you set |

0:16:08 | uh the the compound number two hundred we still will have a a a a a a high are uh |

0:16:14 | joint like to uh |

0:16:16 | uh we have you |

0:16:17 | well we did |

0:16:18 | do the same experiment with the marginal likelihood uh estimator |

0:16:23 | uh |

0:16:24 | first first uh |

0:16:25 | much like build increases and after some point |

0:16:28 | i it it starts increasing it's |

0:16:30 | stagnate nights |

0:16:32 | uh |

0:16:34 | and uh |

0:16:35 | since we have a lower bound here uh we we we cannot be sure about uh whether this lower bound |

0:16:42 | is tight three we compared this lower bound with uh other marginal like loop |

0:16:47 | uh |

0:16:48 | estimation techniques and we we so that at this uh lower bound is tight |

0:16:52 | uh |

0:16:53 | so |

0:16:55 | up there six |

0:16:56 | a components here uh we also are that some of the uh |

0:17:01 | columns of the dictionary a |

0:17:04 | a a a a a a a and converge to zero so they is that the dictionary is thing i |

0:17:10 | uh with |

0:17:11 | uh with the joint likelihood estimator we uh we uh |

0:17:15 | run the uh |

0:17:18 | i'll go them with time components |

0:17:19 | we get time components as we so in the neck |

0:17:22 | in the previous slide |

0:17:23 | with D margin like to the estimator we have six |

0:17:27 | a non-zero columns and four |

0:17:29 | are uh |

0:17:31 | pruned out uh |

0:17:32 | uh and the these |

0:17:34 | components |

0:17:39 | oh |

0:17:41 | i |

0:17:48 | he's not once uh corresponding the individual not actually so the first |

0:17:54 | for R D |

0:18:11 | we have a another a points for the hammer ons uh the it's range and of D P a piano |

0:18:17 | and we have another one for the noise so and |

0:18:21 | uh |

0:18:23 | so |

0:18:23 | since we observe this uh pruning effect to get this uh |

0:18:28 | i F more B V uh worked on it |

0:18:30 | a a data set we are we we know the uh |

0:18:35 | uh |

0:18:36 | a a dictionary big and the ground truth of the dictionary here is not so we we |

0:18:40 | a |

0:18:41 | did the same experiments on this they to set so these are |

0:18:44 | uh |

0:18:47 | rough sketches of is to remote which has for a uh |

0:18:50 | joints and each a joint has uh |

0:18:53 | you can have a a |

0:18:55 | for angle so we have a a dictionary of uh sixteen here |

0:18:59 | and this is the ground truth |

0:19:00 | so we |

0:19:02 | performed from these same experiments on this dataset and with the uh a joint likelihood estimator we have |

0:19:08 | for example here ten |

0:19:10 | non-zero components and we have some |

0:19:12 | a because of the dictionary |

0:19:15 | but as with a maximum |

0:19:18 | martin like load estimator we have a sixteen |

0:19:21 | a different components and we have for uh |

0:19:24 | uh |

0:19:25 | pruned uh |

0:19:26 | dig dictionary is the |

0:19:28 | uh is or the through model is at both component here because it i it is cool X is than |

0:19:34 | in all of the uh |

0:19:36 | uh |

0:19:36 | all all of the data so uh |

0:19:39 | it will be |

0:19:40 | so it is that taste to the right |

0:19:43 | a a leg all this to matt and so with just dictionary it's |

0:19:47 | a a in all of the |

0:19:48 | reconstructions as well |

0:19:51 | so as conclusions |

0:19:52 | uh in this work we tried to compare uh |

0:19:55 | uh_huh the dictionary real obtain with uh |

0:19:58 | two different estimators the joint likelihood estimator is |

0:20:02 | uh core corresponds to the uh |

0:20:04 | standard nmf met does and marginal likelihood is that uh |

0:20:09 | then we learn it on the marginal models |

0:20:12 | uh uh actually and we set the a |

0:20:15 | uh |

0:20:16 | compound numbers to to a to the optimal number |

0:20:19 | these two uh uh estimate as um |

0:20:21 | power from similarly but |

0:20:23 | we uh set at high number for K the marginal likelihood estimator are uh |

0:20:28 | automatic to the |

0:20:30 | uh uh relevant uh |

0:20:32 | columns of the dictionary uh |

0:20:35 | so it does uh and automatic of mother or order selection |

0:20:39 | and this way of doing model order selection is uh |

0:20:44 | more efficient than for example a leading |

0:20:47 | the evidence of data on different number different a values of K |

0:20:53 | and selecting the best model |

0:20:55 | uh so in this case we we only select that |

0:20:58 | large enough to really you for uh K and B you get the uh |

0:21:03 | uh |

0:21:05 | affective teaching |

0:21:06 | thank |

0:21:30 | i can understand this |

0:21:40 | uh |

0:21:46 | actually be all of these uh expense we uh we we have it |

0:21:51 | uh |

0:21:52 | but lower bound which is tight |

0:21:54 | uh we we |

0:21:55 | uh compared it to other uh |

0:21:58 | estimates as well |

0:22:00 | uh a and like that that's that's so uh i have never seen |

0:22:04 | a a a case where we we couldn't uh |

0:22:07 | i will like to |

0:22:09 | uh |

0:22:10 | uh with enough |

0:22:14 | maybe to this should be in a ski |

0:22:26 | oh |

0:22:37 | was a much like to door |

0:22:40 | uh |

0:22:41 | so in those works uh no no no one uh does that |

0:22:45 | marginal likelihood |

0:22:46 | uh |

0:22:47 | uh estimation |

0:22:49 | so this is the first work |

0:22:50 | and we compared it to other methods so |

0:22:53 | and we are pretty sure that |

0:22:55 | the estimation is |