So the outline will be like this, I am going have a short introduction to

Restricted Boltzmann Machines.

And then I will talk a little bit about deep and sparse Boltzmann Machines

Then I am going to propose some topologies that are relative to speaker recognition.

And some experiments will follow.

So, RBM, as you have already known from the keynote speaker, a bipartite undirected graphical

model

with visible and hidden layers.

The building blocks of the deep belief nets and deep Boltzmann machines.

Of course they are generative models.

Although you can tune them into discriminative ones, but we won't do that.

Another key you have to know is, in fact that the joint distribution forms an

exponential family.

And that is why you are going to see many expressions look very familiar to

you.

The main issue here to see is that there is no connection between nodes of

the same layer.

This allows a very fast training algorithm, namely the blocked-Gibbs sampling.

Meaning that we can sample a layer at once,

but not node by node.

The main issue to know this is that although don't have such connection in the

visible layer,

correlation is still present.

when you consider the marginal likelihood of your data,

of you incomplete data, the v.

As the feature extractor, you can realize that the hidden variables capture higher information, higher

level information, more structured information.

Here are two examples, you have MNIST digits, the standard database.

The below w is called receipted field.

They are not good to say that the Eigen vectors seem to be an analysis.

The higher order information that is able to be captured.

As a generative model, what you need to reconstruct is the p, considered the pixel

i.

You seem to project the h onto the i-th row.

This transpose, by the way, is unnecessary.

This gives you the p i, p i is the parameter from the knowledge prior,

so no need to go back and binarize it. Simply do it by sampling Bernoulli

distribution with this p i.

The g is the logistic function, the sigmoid, that maps continuously from zero to one.

Some useful expressions, the joint distribution looks like this. I denote with a star, p-star,

the unnormalized density . Zeta is the so-called partition function, as you can see very

clearly.

And it forms the exponential function.

So consider binary and forget about the zero biases, assume they are zero.

You see that the conditional on both v and h.

Have this nice product form, this is not approximation, this is due to the restricted

structure of the RBM.

And this is a very useful result when you do regularity.

So how do you do learning?

How you do it? You simply maximize the log likelihood of theta given some observations.

Simply consider that you won't estimate, for example the w matrix, assuming that the biases

are all zero.

What's the difference here, you end up with this familiar expression.

So, we have the data dependent term and the data independent term.

In the case of RBM, it's not that you exactly build this product form.

It's very trivial to calculate the first term, the data dependent term.

You have your data, the empirical distribution. All you have to do is to complete

them.

Based on the conditional of the h, even the angular product form is very trivial.

However,

the second term, that is the model dependent term,

is really hard to compute. By the way what does the term mean?

This term seems to be a different expression, a different parameterization of w.

So you have a current estimate of your w, of your model, but it is

defined on a different space, it is defined on the canonical space for the ?

parameterization.

What you want to do is to map it to the expectation space, that is

where is your sufficient statistics are defined.

So all you need for the training, here is nothing more than trying to map

this w to a different space, a space of the sufficient statistics to form the

difference.

So, Contrastive Divergence.

First of all, how you proceed? You have batch, you split it into minibatches.

Say one hundred each size, typical size.

Proceed with one of the minibatch at a time and you set for the epochs.

As momentum term is not to be more smooth, and it decreases with the epoch

count.

So the contrastive divergence, goes like this.

What she can found, was in fact that if you start, not randomly, but at

each data point.

And then you can simply sample, by successive conditioning.

And you can just sample on state. And if you do so, you have a

pretty nice algorithm to train it very fast.

But that's not what we do actually, that's good for isolation.

We want to do a serious Gibbs sampling, then you have to start from random,

and let the chain loops for many steps.

So having completed this short introduction of RBM.

Deep Boltzmann Machines and Deep Belief Nets.

I'm looking at ? about the belief nets.

As you see RBM starts the main building, for both.

The Boltzmann Machines are completely undirected, they are MRF actually, with hidden variables.

And both of them are constructed from the information of higher levels.

Here is the typical Deep Boltzmann Machine.

So you want to train the thing both,

You start with this conventional version of greedy layer by layer

And then you refine it with the so-called Persistent Contrastive Divergence.

What do you have to know here is that

this nice product form of the conditional breaks down here.

So you have to apply mean-field of approximation

to approximate the first term.

The second term which is the model term, it is the same, all you have

to do is transform it from one space to another.

So here is log likelihood.

It's very straightforward.

You have also this l that connects visible with visible.

And this j that connects hidden with hidden layers.

So there are three matrices, last a biases, that you want to train.

With respect to every other node.

The g, we call the g just the logistic function.

These are the close-form expressions.

However, the three conditionals are not the product of this conditional.

So that's the way to proceed. Assume a factorization, again this is a standard ?

based. Next, assume a factorized posterior of this form.

And recall that the log likelihood is like this.

And simply consider the Bayesian lower bounds.

This is typical if you do that. h is the entropy of this posterior.

This is Bayes based posterior.

It replaces h with the expectation, that is the miu.

The other is just the formula for the entropy.

I repeat.

This formula to estimate the h q.

So during training, what you have to do is to complete your data.

You have the visible, you have to complete the data.

You data with the estimation of h. So what must you do?

To approximate the approximation. And when you evaluate,

You use the variation lower bound instead of the marginal log likelihood.

So this is how the Persistent Contrastive Divergence, this is the complete picture.

You first initialize with ?. You might have initialized the visible already with some contrastive

divergence training, pretraining.

And for each batch and minibatch and epoch, repeat until convergence.

First, do the variation approximation. you need that in order to approximate the first term.

So that you complete your data.

So you do this iteratively until it converges.

And then you have the stochastic approximation.

That is to transform the current estimation to the expectation parameterization.

How do you do that? With Gibbs Sampling.

That's how you do that.

And you take parameter updating.

There is a w here, but there also the other matrices are half relative to

the same formulas.

You see here, first step is to approximate using this , and the other using

this. That's stochastic approximation.

And of course you have a learning rate that decreases with the number of epoch

count.

So, how you can do classification? Some examples.

Here is the Boltzmann Machine, you can use the outermost layer for the labels.

You may consider that as your data.

You want to evaluate this thing, how you can do it?

Well, like the hypothesis testing. so you have an v and you want to classify

it.

Try placing this hypothesis using one for the occurrence of the nodes of your all

nodes.

And that's why you calculate the largest ? for each class.

The point here is that you don't require to estimate zeta, the normalizer, which is

really hard. You know why? The likelihood ratio will not get that at all.

PLDA, this is another approximation, you have the simple RBM.

We are going to represent it like this.

This is the typical example on how you can do PLDA.

So this is the model we will examine.

It's called the Siamese twin.

What does it model? The first model, on your right, is the h0 hypothesis,

that makes the two speakers are not the same.

We model the RBM, the distribution of the complement supervisor.

{Q&A}

How do you train this model?

You first train this, which is somehow between the model distribution of the i-vectors.

And the h1, the Siamese twin.

to capture correlation between the layers?

These are symmetric matrices, I mean the x and y are symmetric matrices, and we

try to capture correlation .

The h0 hypothesis completely relies on the statistical independent assumption.

We don't try to model the h0 hypothesis using negative examples.

We are going to compute this statistical.

So, how do we train that? As I told you, we first train the singleton

model, this is simply RBM.

And then you collect first i-vector of the same speaker.

And then split them into minibatches.

And then initialize based on the w0, you singleton model, initialize your twin model.

Apply several epochs of this contrastive divergence algorithms.

To evaluate it, similar in other layers, and use variational Bayesian lower bound for both

hypotheses.

Partition functions are not required, the threshold will absorb them.

They are data independent.

So it's no reason actually to compute the partition function, that will be absorbed by

your threshold.

So, experiment.

So, this is the configuration.

It's a standard, we applied the standard like this, unfortunately we tried multiply, but we

failed.

So, in such of case, let's at least use this standard ?, so we are

having better cosine distance.

That's what we are doing.

To replace that, we covered our work in Interspeech that somehow tries to make some

of these supervised learnings approaches using, again Deep Boltzmann Machines.

The results in terms of that are like this.

That annotation means Boltzmann Machines' first layer, how many nodes in the first layer, how

many nodes in the second layer.

So you see that the configuration of two hundred computers to ? was the best,

and this is the cosine distance.

These are the results so for evaluating for the female portion.

I think in terms of error rate, they are quite comparable.

Conclusions of this Boltzmann Machines is forming a really fledging framework for combining generative and

discriminative models.

It's ideal when large amount of them are unlabeled data that are available some limited

amount of labeled data.

It's an alternative way to introduce hierarchies and extract higher level representations, and maybe Bayesian

inference can be applied, although you have some ? approach.