0:00:15The next talk could be Variational Bayes logistic regression as regularized fusion for NIST SRE
0:00:42My name is Hautomaki, and
0:00:46the topic is the fusion topic so
0:00:51I think it is likely only to be fusion topic this time for speech recognition
0:01:01This time, I have tried to do variational Bayes for NIST SRE evaluation corpora.
0:01:12OK, start with fusion. while we do fusion, why we don't have just a single
0:01:21best system. The motivation is that the fusion work better than single best and so
0:01:30we can take multiple classifiers and so on.
0:01:35And on the another hand, when some classifiers are not well behaved on development data,
0:01:46the fusion can help to smooth out.
0:01:51There are conventional wisdom in fusion. that complementary classifier should be selected in the fusion
0:02:00pool. This systems can be the main question in our working here.
0:02:08So, if we are going to do fusion instead of one single best system, how
0:02:14to select the classifier for the fusion pool or ensemble
0:02:20We are working on some kind of thinking that there are some complementariness in the
0:02:27feature set or classifier or something else
0:02:31But it is difficult to quantify
0:02:35What does the mean of complementariness ?
0:02:39So we can first look at the mutual information of classifier output and the class
0:02:49label and looking at the Fano inequality.
0:02:53Maximizing mutual information means minimize classification error.
0:03:03Gavin Brown in the paper Information theoretic Perspective on Multiple Classifier Systems 2009, showed that
0:03:12multi-way mutual information, which we take all classifier from 1 to L (that are the
0:03:21potential pool) , can be composed into three different terms
0:03:27The first term is very familiar to us. It basically is sum of individual classifier
0:03:38You are usually try to maximize this term
0:03:42Actually, maximizing this term can lead to maximizing mutual information.
0:03:48But we have to subtract the second term here, so it is not very nice.
0:03:55This term here, I am not going to talk too much in detail, is a
0:04:07kind of mutual information
0:04:11Here we just have classification output, we don't have class label here
0:04:18It is basically the correlation term, but we take all sub set of all classifier
0:04:26Basically, minimizing the correlation of the subset and maximizing the first term can lead to
0:04:40maximize mutual information
0:04:43The last term is interesting term, that you take all the sub set and you
0:04:52compute again the mutual information again and you take the condition of class label here.
0:05:00This term is additive term.
0:05:04The conclusion is that: we do not only minimize the correlation between classifier on this
0:05:14kind of higher level, group of increasing ensemble size, but also we have to move
0:05:24onto strong half of conditional correlation
0:05:29I think it doesn't make general sense, but I think it can give some kind
0:05:38of idea that the conventional wisdom on the complementariness might be not so accurate.
0:05:46What the topic of this talk is that we are using sparseness to do kind
0:05:56of this task automatically. We don't considering the composition
0:06:02I don't try to handed optimize this ensemble.
0:06:17We are not based any optimization of diversity measurement for selection. We don't try to
0:06:25do anything like that.
0:06:28And I use sparseness regression to optimize the ensemble size
0:06:35But the regularization regression introduce extra parameter, regularization parameter, and it is not very nice.
0:06:45I would like to get rid of this parameter. I have considered that as higher
0:06:50parameter and optimize at the same time we're optimizing the actual classifier on fusion device
0:06:57The attempt here was to use variational Bayes because it is a nice framework
0:07:06We can integrate half of parameter estimation into the same goal
0:07:13We are attempting to integrate all parameter to get the posterior
0:07:32OK, the motivation is likely to be this. Why we use regularization as sparseness of
0:07:39the fusion.
0:07:41Here we have two classifiers and CWLR is a cost function
0:07:51x-axis describes classifier one, y-axis describes classifier two. The figure on the left is for
0:07:56training set, the figure in the middle is for development set and the figure on
0:08:02the right is for the final evaluation set.
0:08:06If you use L2-norm as regularization, we can see that:
0:08:12The red round point is the optimum that we find, and the red cross point
0:08:18is unconstrained optimum and the black square point is the L1 optimum. We can see
0:08:25that with training data, L2 optimum is closer than unconstrained optimum.
0:08:31For the development set this is also true. But when we apply for test set
0:08:39you can see the classifier w2 has been erased, actually given better success from posterior
0:08:46than the true minimum for that set
0:08:51The minimum has changed, the CRWL function has changed on the SRE set. Of course
0:09:11we are lucky now.
0:09:17Suppose it happen in real data
0:09:21So, if you would have to optimize the unconstrained minimum, the solution could be here.
0:09:28And now we are much closer to it - the true minimum.
0:09:35so it tells us that we can definitely
0:09:38zeros out, and classify, and it will give us
0:09:41the real benefit
0:09:44Basically, we have to find what Lagrange coefficient value to use. This is done by
0:09:54cross validation
0:10:00We have the discriminative probabilistic framework here and we are using it to optimize the
0:10:09fusion weights
0:10:20Optimization of logistic sigmoid function lead to cross-entropy cost
0:10:35and so the...
0:10:37the thing Nickle has proposed is to see the whole LR
0:10:40and ... weighted by the
0:10:44the ratio for the
0:10:47the actual training utterance training set
0:10:51but also we have here
0:10:53additive term
0:10:56logic effective pi
0:11:04The idea of regularization is we try to do MAP estimate and the last supplier
0:11:16is double exponential.
0:11:20Basically, we can see the probability of having a regular of parameter for each of
0:11:30classifier j.
0:11:36Here we can assume that the parameter is the same for all dimensions. In case
0:11:47of Ridge regression, we have isotropic Gaussian.
0:11:53In Variational Bayes, we follow the treatment in Bishop book chapter 10
0:11:59Now, we are not looking to estimate MAP but the whole posterior. It is approximated
0:12:09as q(w) q(a) q(t). Here we factorize all hidden parameters.
0:12:19But we have additional problem here. The likelihood terms has to be approximated by h(w,z).
0:12:30We have one scalar number of z for each training set score vector. It has
0:12:39to be optimized on the same VB loop.
0:12:45Here the natural assumption for distribution of alpha (we can see alpha in previous slide)
0:13:03is Gamma. Here Gamma become non informative.
0:13:13The interesting point here is that the mean of predictive density can be in normal
0:13:20product. It is slight consistence for us to use normal linear scoring
0:13:31As I explained early, we are interested in half sparseness solution.
0:13:46I try to use Automatic Relevance Determination p(W|alpha).
0:13:51We have A is a diagonal matrix that we have one precision for each classifier.
0:14:03And we have product of Gamma instead of having just one Gamma.
0:14:16The general idea is that we put down to zero the classifier that doesn't play
0:14:25any role.
0:14:37Our setup is use NIST SRE 2008. The speech sample is extended and split into
0:14:54two trial lists
0:14:58One set for training confusion device, and the other is cross validate set.
0:15:05. Then we apply the evaluation on NIST SRE 2010 core set
0:15:13We can see the results of Variable logistic regression.
0:15:19I forget to mention that now maybe in this set there are no cross validation
0:15:27needed at all. And so the complexity is the operation is the same to the
0:15:36standard for goal.
0:15:41We can see in itv-itv that my result is the best result. My result in
0:15:52minDCF is get slight improvement.
0:15:56but the result of actual DCF is not calibrated.
0:16:03Unfortunately there is a fact that there are only two classifier, for some reasons I
0:16:15can't explain, didn't get about end of zero or any thing more than
0:16:25I search on the literature, there are many comments about using ARD and some people
0:16:34complain the ARD is some kind of under-fit database.
0:16:39And I guess the solution here is that instead of using ARD I have to
0:16:46use another stronger prior and has more stronger regularization ability.
0:16:53So let looking another condition that some case actually, the standard logistic regression perform better.
0:17:07On the other hand, the tel-tel case I got some improvement with just using the
0:17:16standard in equal error rate (EER). So it is not intended consequence.
0:17:27So the interesting point here is there are quite big problem with the calibration. At
0:17:39least the Variational Bayes logistic regression doesn't calibrate well.
0:17:48So I think we should need extra calibration step. I didn't try that now. Actually
0:17:56I interested to change prior than to work on this baseline.
0:18:04But on the other hand, we produce some score, so we can of course adding
0:18:10subset selection now on this result.
0:18:14And so here, this is a bit ad hoc in the case of variational Bayes.
0:18:20So i can pause a .. the L0 norm on this result. So this won't
0:18:27create me a ... I mean it forces me a sparse solution
0:18:34So you will see here, what happen in standard logistic regression is when we scan
0:18:46the ensemble size, definitely there is a minimum between 8-9
0:19:19But any actually would be smaller than our predicted subset size.
0:19:26This is logistic regression baseline. We can see there are a gain from 3.55 to
0:19:363.40 in EER. On the other hand, the ORACLE will tell us what is correct
0:19:46size, what is the correct subset. And we have a large gain here.
0:20:03But the interesting point here is when we apply Variational Bayes, there are actually much
0:20:12closer behavior to our prediction. We can follow closely to the ORACLE bound.
0:20:22Unfortunately, because of the fact that actual DCF is not well calibrated, so actual DCF
0:20:27doesn't perform so nicely.
0:20:31But we can do post calibration.
0:20:37Here is the comparison table. I only did that for itv-itv set. We get an
0:20:46improvement in equal error rate, from 3.48 for full set to 3.37 for subset. And
0:20:54now we have only 6 classifiers left on our pool.
0:21:00All ... we have exponential time complexity in the size of original classifier pool. Now
0:21:09I have 12 classifiers in total and they doesn't cost so much. But it is
0:21:18still really impractical in real application.
0:21:22Now I will introduce again my regularization regression parameter here. It can bring some benefits.
0:21:33The benefit between standard logistic regression and regression using VB is not great, but ...]
0:21:48It is possible to add more extra parameter to our regularization. One is to use
0:21:59elastic-net. Now we back to the MAP estimate. Elastic-net basically is convex sum of L1
0:22:10and L2.
0:22:12there is possibility that we can't talk about is that some time LASSO can regularize
0:22:21hardly. So we can restrict LASSO to not regularize one classifier.
0:22:27This method is called restricted LASSO.
0:22:31Here we see that the result for complete set. Basically for itv-itv condition the ensemble
0:22:44size of LASSO is 6. It is computational efficient method.
0:22:58For interview-telephone we can observe the similar performance. The restricted LASSO get the best result.
0:23:14Now interesting thing is in the restricted LASSO here. We want one classifier is not
0:23:26regular. We can see that the restricted LASSO has smaller size than original LASSO for
0:23:39interview-telephone sub condition. So we have selected one classifier that didn't collaborate regularize but still
0:23:52affect to cause other classifier to be zero.
0:24:01I can't explain the behavior but I found some kind of interesting size produced here.
0:24:13Here is telephone-telephone sub condition. the EER here is of Variational Bayes. this is only
0:24:27condition that sparsity didn't help in terms of actual DCF.
0:24:37Other while, in the other conditions the sparsity does helps.
0:24:53this is surprising the fact that the ARD was not able to put more classifier
0:25:02to zero than standard Variational Bayes.
0:25:07On the other hand, there is possibility that the prior has to be more stronger.
0:25:15The elastic-net show most promise but we are not able to estimate parameter efficiently.
0:25:22In the future work we will study method to automatically learn Elastic-net prior hyper parameter.
0:26:16Of course we try to do Variational Bayes. We have higher prior parameter but then
0:26:28we can set those higher prior parameter to be non-informative.
0:27:01The comparison between standard logistic regression and the VB method is totally fair.
0:27:11There are some study in the literature in the different fields that people do this
0:27:19kind of thing. They observe that optimizing using cross validation regularized parameter bring better performance.
0:27:27But using this kind of Bayes approach bring a stable performance.
0:27:33It isn't so good but more predictable. So this is my goal.