The next talk could be Variational Bayes logistic regression as regularized fusion for NIST SRE

2010

OK

My name is Hautomaki, and

the topic is the fusion topic so

I think it is likely only to be fusion topic this time for speech recognition

This time, I have tried to do variational Bayes for NIST SRE evaluation corpora.

OK, start with fusion. while we do fusion, why we don't have just a single

best system. The motivation is that the fusion work better than single best and so

we can take multiple classifiers and so on.

And on the another hand, when some classifiers are not well behaved on development data,

the fusion can help to smooth out.

There are conventional wisdom in fusion. that complementary classifier should be selected in the fusion

pool. This systems can be the main question in our working here.

So, if we are going to do fusion instead of one single best system, how

to select the classifier for the fusion pool or ensemble

We are working on some kind of thinking that there are some complementariness in the

feature set or classifier or something else

But it is difficult to quantify

What does the mean of complementariness ?

So we can first look at the mutual information of classifier output and the class

label and looking at the Fano inequality.

Maximizing mutual information means minimize classification error.

Gavin Brown in the paper Information theoretic Perspective on Multiple Classifier Systems 2009, showed that

multi-way mutual information, which we take all classifier from 1 to L (that are the

potential pool) , can be composed into three different terms

The first term is very familiar to us. It basically is sum of individual classifier

accuracy

You are usually try to maximize this term

Actually, maximizing this term can lead to maximizing mutual information.

But we have to subtract the second term here, so it is not very nice.

This term here, I am not going to talk too much in detail, is a

kind of mutual information

Here we just have classification output, we don't have class label here

It is basically the correlation term, but we take all sub set of all classifier

Basically, minimizing the correlation of the subset and maximizing the first term can lead to

maximize mutual information

The last term is interesting term, that you take all the sub set and you

compute again the mutual information again and you take the condition of class label here.

This term is additive term.

The conclusion is that: we do not only minimize the correlation between classifier on this

kind of higher level, group of increasing ensemble size, but also we have to move

onto strong half of conditional correlation

I think it doesn't make general sense, but I think it can give some kind

of idea that the conventional wisdom on the complementariness might be not so accurate.

What the topic of this talk is that we are using sparseness to do kind

of this task automatically. We don't considering the composition

I don't try to handed optimize this ensemble.

We are not based any optimization of diversity measurement for selection. We don't try to

do anything like that.

And I use sparseness regression to optimize the ensemble size

But the regularization regression introduce extra parameter, regularization parameter, and it is not very nice.

I would like to get rid of this parameter. I have considered that as higher

parameter and optimize at the same time we're optimizing the actual classifier on fusion device

The attempt here was to use variational Bayes because it is a nice framework

We can integrate half of parameter estimation into the same goal

We are attempting to integrate all parameter to get the posterior

OK, the motivation is likely to be this. Why we use regularization as sparseness of

the fusion.

Here we have two classifiers and CWLR is a cost function

x-axis describes classifier one, y-axis describes classifier two. The figure on the left is for

training set, the figure in the middle is for development set and the figure on

the right is for the final evaluation set.

If you use L2-norm as regularization, we can see that:

The red round point is the optimum that we find, and the red cross point

is unconstrained optimum and the black square point is the L1 optimum. We can see

that with training data, L2 optimum is closer than unconstrained optimum.

For the development set this is also true. But when we apply for test set

you can see the classifier w2 has been erased, actually given better success from posterior

than the true minimum for that set

The minimum has changed, the CRWL function has changed on the SRE set. Of course

we are lucky now.

Suppose it happen in real data

So, if you would have to optimize the unconstrained minimum, the solution could be here.

And now we are much closer to it - the true minimum.

so it tells us that we can definitely

zeros out, and classify, and it will give us

the real benefit

Basically, we have to find what Lagrange coefficient value to use. This is done by

cross validation

We have the discriminative probabilistic framework here and we are using it to optimize the

fusion weights

Optimization of logistic sigmoid function lead to cross-entropy cost

and so the...

the thing Nickle has proposed is to see the whole LR

and ... weighted by the

the ratio for the

the actual training utterance training set

but also we have here

additive term

logic

logic effective pi

The idea of regularization is we try to do MAP estimate and the last supplier

is double exponential.

Basically, we can see the probability of having a regular of parameter for each of

classifier j.

Here we can assume that the parameter is the same for all dimensions. In case

of Ridge regression, we have isotropic Gaussian.

In Variational Bayes, we follow the treatment in Bishop book chapter 10

Now, we are not looking to estimate MAP but the whole posterior. It is approximated

as q(w) q(a) q(t). Here we factorize all hidden parameters.

But we have additional problem here. The likelihood terms has to be approximated by h(w,z).

We have one scalar number of z for each training set score vector. It has

to be optimized on the same VB loop.

Here the natural assumption for distribution of alpha (we can see alpha in previous slide)

is Gamma. Here Gamma become non informative.

The interesting point here is that the mean of predictive density can be in normal

product. It is slight consistence for us to use normal linear scoring

As I explained early, we are interested in half sparseness solution.

I try to use Automatic Relevance Determination p(W|alpha).

We have A is a diagonal matrix that we have one precision for each classifier.

And we have product of Gamma instead of having just one Gamma.

The general idea is that we put down to zero the classifier that doesn't play

any role.

Our setup is use NIST SRE 2008. The speech sample is extended and split into

two trial lists

One set for training confusion device, and the other is cross validate set.

. Then we apply the evaluation on NIST SRE 2010 core set

We can see the results of Variable logistic regression.

I forget to mention that now maybe in this set there are no cross validation

needed at all. And so the complexity is the operation is the same to the

standard for goal.

We can see in itv-itv that my result is the best result. My result in

minDCF is get slight improvement.

but the result of actual DCF is not calibrated.

Unfortunately there is a fact that there are only two classifier, for some reasons I

can't explain, didn't get about end of zero or any thing more than

I search on the literature, there are many comments about using ARD and some people

complain the ARD is some kind of under-fit database.

And I guess the solution here is that instead of using ARD I have to

use another stronger prior and has more stronger regularization ability.

So let looking another condition that some case actually, the standard logistic regression perform better.

On the other hand, the tel-tel case I got some improvement with just using the

standard in equal error rate (EER). So it is not intended consequence.

So the interesting point here is there are quite big problem with the calibration. At

least the Variational Bayes logistic regression doesn't calibrate well.

So I think we should need extra calibration step. I didn't try that now. Actually

I interested to change prior than to work on this baseline.

But on the other hand, we produce some score, so we can of course adding

subset selection now on this result.

And so here, this is a bit ad hoc in the case of variational Bayes.

So i can pause a .. the L0 norm on this result. So this won't

create me a ... I mean it forces me a sparse solution

So you will see here, what happen in standard logistic regression is when we scan

the ensemble size, definitely there is a minimum between 8-9

But any actually would be smaller than our predicted subset size.

This is logistic regression baseline. We can see there are a gain from 3.55 to

3.40 in EER. On the other hand, the ORACLE will tell us what is correct

size, what is the correct subset. And we have a large gain here.

But the interesting point here is when we apply Variational Bayes, there are actually much

closer behavior to our prediction. We can follow closely to the ORACLE bound.

Unfortunately, because of the fact that actual DCF is not well calibrated, so actual DCF

doesn't perform so nicely.

But we can do post calibration.

Here is the comparison table. I only did that for itv-itv set. We get an

improvement in equal error rate, from 3.48 for full set to 3.37 for subset. And

now we have only 6 classifiers left on our pool.

All ... we have exponential time complexity in the size of original classifier pool. Now

I have 12 classifiers in total and they doesn't cost so much. But it is

still really impractical in real application.

Now I will introduce again my regularization regression parameter here. It can bring some benefits.

The benefit between standard logistic regression and regression using VB is not great, but ...]

It is possible to add more extra parameter to our regularization. One is to use

elastic-net. Now we back to the MAP estimate. Elastic-net basically is convex sum of L1

and L2.

there is possibility that we can't talk about is that some time LASSO can regularize

hardly. So we can restrict LASSO to not regularize one classifier.

This method is called restricted LASSO.

Here we see that the result for complete set. Basically for itv-itv condition the ensemble

size of LASSO is 6. It is computational efficient method.

For interview-telephone we can observe the similar performance. The restricted LASSO get the best result.

Now interesting thing is in the restricted LASSO here. We want one classifier is not

regular. We can see that the restricted LASSO has smaller size than original LASSO for

interview-telephone sub condition. So we have selected one classifier that didn't collaborate regularize but still

affect to cause other classifier to be zero.

I can't explain the behavior but I found some kind of interesting size produced here.

Here is telephone-telephone sub condition. the EER here is of Variational Bayes. this is only

condition that sparsity didn't help in terms of actual DCF.

Other while, in the other conditions the sparsity does helps.

this is surprising the fact that the ARD was not able to put more classifier

to zero than standard Variational Bayes.

On the other hand, there is possibility that the prior has to be more stronger.

The elastic-net show most promise but we are not able to estimate parameter efficiently.

In the future work we will study method to automatically learn Elastic-net prior hyper parameter.

Of course we try to do Variational Bayes. We have higher prior parameter but then

we can set those higher prior parameter to be non-informative.

The comparison between standard logistic regression and the VB method is totally fair.

There are some study in the literature in the different fields that people do this

kind of thing. They observe that optimizing using cross validation regularized parameter bring better performance.

But using this kind of Bayes approach bring a stable performance.

It isn't so good but more predictable. So this is my goal.