0:01:04So the outline will be like this, I am going have a short introduction to
0:01:09Restricted Boltzmann Machines.
0:01:10And then I will talk a little bit about deep and sparse Boltzmann Machines
0:01:15Then I am going to propose some topologies that are relative to speaker recognition.
0:01:22And some experiments will follow.
0:01:28So, RBM, as you have already known from the keynote speaker, a bipartite undirected graphical
0:01:34with visible and hidden layers.
0:01:37The building blocks of the deep belief nets and deep Boltzmann machines.
0:01:45Of course they are generative models.
0:01:49Although you can tune them into discriminative ones, but we won't do that.
0:01:56Another key you have to know is, in fact that the joint distribution forms an
0:02:03exponential family.
0:02:04And that is why you are going to see many expressions look very familiar to
0:02:12The main issue here to see is that there is no connection between nodes of
0:02:19the same layer.
0:02:22This allows a very fast training algorithm, namely the blocked-Gibbs sampling.
0:02:28Meaning that we can sample a layer at once,
0:02:35but not node by node.
0:02:42The main issue to know this is that although don't have such connection in the
0:02:50visible layer,
0:02:51correlation is still present.
0:02:54when you consider the marginal likelihood of your data,
0:02:58of you incomplete data, the v.
0:03:02As the feature extractor, you can realize that the hidden variables capture higher information, higher
0:03:11level information, more structured information.
0:03:16Here are two examples, you have MNIST digits, the standard database.
0:03:23The below w is called receipted field.
0:03:29They are not good to say that the Eigen vectors seem to be an analysis.
0:03:35The higher order information that is able to be captured.
0:03:40As a generative model, what you need to reconstruct is the p, considered the pixel
0:03:49You seem to project the h onto the i-th row.
0:03:55This transpose, by the way, is unnecessary.
0:03:58This gives you the p i, p i is the parameter from the knowledge prior,
0:04:06so no need to go back and binarize it. Simply do it by sampling Bernoulli
0:04:14distribution with this p i.
0:04:17The g is the logistic function, the sigmoid, that maps continuously from zero to one.
0:04:28Some useful expressions, the joint distribution looks like this. I denote with a star, p-star,
0:04:35the unnormalized density . Zeta is the so-called partition function, as you can see very
0:04:45And it forms the exponential function.
0:04:58So consider binary and forget about the zero biases, assume they are zero.
0:05:05You see that the conditional on both v and h.
0:05:11Have this nice product form, this is not approximation, this is due to the restricted
0:05:20structure of the RBM.
0:05:22And this is a very useful result when you do regularity.
0:05:27So how do you do learning?
0:05:29How you do it? You simply maximize the log likelihood of theta given some observations.
0:05:40Simply consider that you won't estimate, for example the w matrix, assuming that the biases
0:05:46are all zero.
0:05:48What's the difference here, you end up with this familiar expression.
0:05:53So, we have the data dependent term and the data independent term.
0:05:58In the case of RBM, it's not that you exactly build this product form.
0:06:04It's very trivial to calculate the first term, the data dependent term.
0:06:08You have your data, the empirical distribution. All you have to do is to complete
0:06:13Based on the conditional of the h, even the angular product form is very trivial.
0:06:23the second term, that is the model dependent term,
0:06:28is really hard to compute. By the way what does the term mean?
0:06:33This term seems to be a different expression, a different parameterization of w.
0:06:42So you have a current estimate of your w, of your model, but it is
0:06:48defined on a different space, it is defined on the canonical space for the ?
0:06:54What you want to do is to map it to the expectation space, that is
0:06:59where is your sufficient statistics are defined.
0:07:02So all you need for the training, here is nothing more than trying to map
0:07:08this w to a different space, a space of the sufficient statistics to form the
0:07:19So, Contrastive Divergence.
0:07:21First of all, how you proceed? You have batch, you split it into minibatches.
0:07:28Say one hundred each size, typical size.
0:07:34Proceed with one of the minibatch at a time and you set for the epochs.
0:07:40As momentum term is not to be more smooth, and it decreases with the epoch
0:07:49So the contrastive divergence, goes like this.
0:07:53What she can found, was in fact that if you start, not randomly, but at
0:08:00each data point.
0:08:01And then you can simply sample, by successive conditioning.
0:08:08And you can just sample on state. And if you do so, you have a
0:08:13pretty nice algorithm to train it very fast.
0:08:15But that's not what we do actually, that's good for isolation.
0:08:23We want to do a serious Gibbs sampling, then you have to start from random,
0:08:31and let the chain loops for many steps.
0:08:35So having completed this short introduction of RBM.
0:08:41Deep Boltzmann Machines and Deep Belief Nets.
0:08:46I'm looking at ? about the belief nets.
0:08:50As you see RBM starts the main building, for both.
0:08:56The Boltzmann Machines are completely undirected, they are MRF actually, with hidden variables.
0:09:01And both of them are constructed from the information of higher levels.
0:09:10Here is the typical Deep Boltzmann Machine.
0:09:13So you want to train the thing both,
0:09:15You start with this conventional version of greedy layer by layer
0:09:21And then you refine it with the so-called Persistent Contrastive Divergence.
0:09:27What do you have to know here is that
0:09:29this nice product form of the conditional breaks down here.
0:09:35So you have to apply mean-field of approximation
0:09:38to approximate the first term.
0:09:42The second term which is the model term, it is the same, all you have
0:09:47to do is transform it from one space to another.
0:09:54So here is log likelihood.
0:09:57It's very straightforward.
0:10:08You have also this l that connects visible with visible.
0:10:13And this j that connects hidden with hidden layers.
0:10:19So there are three matrices, last a biases, that you want to train.
0:10:26With respect to every other node.
0:10:29The g, we call the g just the logistic function.
0:10:33These are the close-form expressions.
0:10:36However, the three conditionals are not the product of this conditional.
0:10:49So that's the way to proceed. Assume a factorization, again this is a standard ?
0:10:56based. Next, assume a factorized posterior of this form.
0:11:00And recall that the log likelihood is like this.
0:11:03And simply consider the Bayesian lower bounds.
0:11:06This is typical if you do that. h is the entropy of this posterior.
0:11:14This is Bayes based posterior.
0:11:17It replaces h with the expectation, that is the miu.
0:11:24The other is just the formula for the entropy.
0:11:28I repeat.
0:11:31This formula to estimate the h q.
0:11:37So during training, what you have to do is to complete your data.
0:11:44You have the visible, you have to complete the data.
0:11:47You data with the estimation of h. So what must you do?
0:11:52To approximate the approximation. And when you evaluate,
0:11:58You use the variation lower bound instead of the marginal log likelihood.
0:12:05So this is how the Persistent Contrastive Divergence, this is the complete picture.
0:12:10You first initialize with ?. You might have initialized the visible already with some contrastive
0:12:19divergence training, pretraining.
0:12:21And for each batch and minibatch and epoch, repeat until convergence.
0:12:27First, do the variation approximation. you need that in order to approximate the first term.
0:12:32So that you complete your data.
0:12:36So you do this iteratively until it converges.
0:12:42And then you have the stochastic approximation.
0:12:44That is to transform the current estimation to the expectation parameterization.
0:12:50How do you do that? With Gibbs Sampling.
0:12:53That's how you do that.
0:12:55And you take parameter updating.
0:12:58There is a w here, but there also the other matrices are half relative to
0:13:03the same formulas.
0:13:05You see here, first step is to approximate using this , and the other using
0:13:12this. That's stochastic approximation.
0:13:14And of course you have a learning rate that decreases with the number of epoch
0:13:23So, how you can do classification? Some examples.
0:13:29Here is the Boltzmann Machine, you can use the outermost layer for the labels.
0:13:34You may consider that as your data.
0:13:44You want to evaluate this thing, how you can do it?
0:13:47Well, like the hypothesis testing. so you have an v and you want to classify
0:13:57Try placing this hypothesis using one for the occurrence of the nodes of your all
0:14:07And that's why you calculate the largest ? for each class.
0:14:13The point here is that you don't require to estimate zeta, the normalizer, which is
0:14:20really hard. You know why? The likelihood ratio will not get that at all.
0:14:27PLDA, this is another approximation, you have the simple RBM.
0:14:37We are going to represent it like this.
0:14:43This is the typical example on how you can do PLDA.
0:14:59So this is the model we will examine.
0:15:02It's called the Siamese twin.
0:15:07What does it model? The first model, on your right, is the h0 hypothesis,
0:15:13that makes the two speakers are not the same.
0:15:18We model the RBM, the distribution of the complement supervisor.
0:15:56How do you train this model?
0:15:57You first train this, which is somehow between the model distribution of the i-vectors.
0:16:08And the h1, the Siamese twin.
0:16:12to capture correlation between the layers?
0:16:19These are symmetric matrices, I mean the x and y are symmetric matrices, and we
0:16:25try to capture correlation .
0:16:27The h0 hypothesis completely relies on the statistical independent assumption.
0:16:34We don't try to model the h0 hypothesis using negative examples.
0:16:43We are going to compute this statistical.
0:16:51So, how do we train that? As I told you, we first train the singleton
0:16:59model, this is simply RBM.
0:17:02And then you collect first i-vector of the same speaker.
0:17:07And then split them into minibatches.
0:17:11And then initialize based on the w0, you singleton model, initialize your twin model.
0:17:19Apply several epochs of this contrastive divergence algorithms.
0:17:25To evaluate it, similar in other layers, and use variational Bayesian lower bound for both
0:17:34Partition functions are not required, the threshold will absorb them.
0:17:43They are data independent.
0:17:45So it's no reason actually to compute the partition function, that will be absorbed by
0:17:52your threshold.
0:17:53So, experiment.
0:19:24So, this is the configuration.
0:19:28It's a standard, we applied the standard like this, unfortunately we tried multiply, but we
0:19:36So, in such of case, let's at least use this standard ?, so we are
0:19:46having better cosine distance.
0:19:48That's what we are doing.
0:19:52To replace that, we covered our work in Interspeech that somehow tries to make some
0:20:02of these supervised learnings approaches using, again Deep Boltzmann Machines.
0:20:10The results in terms of that are like this.
0:20:15That annotation means Boltzmann Machines' first layer, how many nodes in the first layer, how
0:20:22many nodes in the second layer.
0:20:24So you see that the configuration of two hundred computers to ? was the best,
0:20:31and this is the cosine distance.
0:21:24These are the results so for evaluating for the female portion.
0:21:35I think in terms of error rate, they are quite comparable.
0:21:54Conclusions of this Boltzmann Machines is forming a really fledging framework for combining generative and
0:22:05discriminative models.
0:22:07It's ideal when large amount of them are unlabeled data that are available some limited
0:22:14amount of labeled data.
0:22:16It's an alternative way to introduce hierarchies and extract higher level representations, and maybe Bayesian
0:22:25inference can be applied, although you have some ? approach.