0:00:18 | So, they asked me to do the introduction for the opening plenary talk here. And |
---|---|

0:00:23 | luckily, it's very easy to do, since we have Niko, who is... as everyone knows, |

0:00:29 | has been part of The Odyssey Workshops; has become part of the institution of the |

0:00:33 | Odyssey Workshops itself. He's been involved in the area of speaker and language recognition for |

0:00:38 | over twenty years. He started off working at Spescom and DataVoice and now he's the |

0:00:44 | chief scientist at AGNITIO. He receieved his Ph.D. in two thousand ten from the University |

0:00:49 | of Stellenbosch, where he also receieved |

0:00:51 | Undergraduate Master's Degree |

0:00:53 | He's been involved in the area of speaker and language recognition in various aspects of |

0:00:56 | it, those from |

0:00:57 | working on the core technologies of the |

0:01:01 | classifiers themselves: from generative models to discriminatively trained models. And working on the other side |

0:01:07 | of calibration and how you evaluate. And today talk Niko is going to that one |

0:01:12 | area that he's had a lot of contributions in over the years how we can |

0:01:16 | go about evaluating the systems |

0:01:18 | we told them: How do we know how well they're working and how can we |

0:01:21 | work in this in a way that's gonna show each utility for downstream applications? So, |

0:01:25 | with that, I hand it over to Niko to begin his talk |

0:01:39 | much, Doug |

0:01:41 | to be here, thank you |

0:01:45 | So, when Haizhou invited me, he asked me to say something about calibration and fusion |

0:01:54 | doing for |

0:01:56 | years. So, I'll do so by discussing proper scoring rules, the basic principle that underlies |

0:02:05 | all of this work. So, fusion you can do in many ways. Proper scoring rules |

0:02:09 | is a good way to do fusion, but it's not essential for fusion. But, in |

0:02:14 | my view, if you're talking about calibration, you do need proper scoring rules. |

0:02:20 | So, they've been around since nineteeen fifty, The Brier Score |

0:02:26 | proposed for the evaluation of |

0:02:29 | evaluating the goodness of a weather forecasting, probabilistic weather forecasting. Since then, they've been in |

0:02:36 | statistic literature, even up to the present. In pattern recognition/ machine learning speech processing they're |

0:02:43 | not that well known, but in fact, if you use maximum likelihood for generative training, |

0:02:50 | or cross- entropy for discriminative training, you are in practice using the logarithmic scoring rule. |

0:02:57 | So, you've probably all used it already |

0:03:01 | In the future, we may be seeing more of |

0:03:05 | scoring rules in machine learning. We've got these new restricted Boltzmann machines and |

0:03:13 | other energy based models, which are now becoming very popular. It's very difficult to train, |

0:03:19 | because you can't work out the likelihood. This... Hyvarinen proposed a proper scoring rule, how |

0:03:26 | to attack that problem, and if you google you'll find some papers... some recent papers |

0:03:30 | on that as well. |

0:03:32 | so, I'll concentrate |

0:03:35 | own application of proper scoring rules, in our field and |

0:03:41 | to promote better understanding of the concept of calibration itself and how to form |

0:03:48 | training algorithms which are calibration sensitive and... and the evaluation measures |

0:03:55 | So, I'll start by outlining the problem that we are trying to solve. And then, |

0:04:02 | than if you can see the grey... but then I'll... I'll introduce proper scoring rules |

0:04:07 | and then, the last section will be how to design proper scoring rules with the |

0:04:12 | several different ones, how to design them to do what you want them to do |

0:04:17 | So, not all pattern recognition needs to be probabilistic. You can build a nice recognizers |

0:04:24 | with an SVM classifier and you don't need to think about probabilities even to do |

0:04:29 | that. But in this talk, we're interested in probabilistic pattern recognition, where the output is |

0:04:37 | a probability, or a likelihood, or a likelihood ratio. So, if you can get the |

0:04:44 | calibration right, that part of output is more useful than just hard decisions. |

0:04:51 | In machine learning and also in speech recognition, if you do probabilistic recognition, you might |

0:04:57 | be used to seeing a posterior probability as an output. An example is a phone |

0:05:02 | recognizer where there will be forty or so posterior probabilities given |

0:05:08 | input frames. But in speaker and language, there are good reasons why we want to |

0:05:13 | use class likelihoods rather than posteriors. And if there are two classes, as in speaker |

0:05:19 | recognition, then likelihood ratio is the most convenient. So, what I'm about to |

0:05:26 | about... we can do all of those things, it doesn't really matter which of those |

0:05:31 | forms we use |

0:05:33 | so, we're interested |

0:05:35 | in a pattern recognizer that types some form of input, maybe the acoustic feature vectors, |

0:05:42 | maybe an i-vector, or maybe just even a score. And then the output will be |

0:05:47 | a small number of discreet classes, for example: target and non-target in speaker recognition, or |

0:05:52 | a language recognition a number of language classes |

0:05:55 | So, the output of the recognizer might be inaccurate form, so you have |

0:06:02 | given one piece of data, you have a likelihood for... for each of the classes |

0:06:09 | here |

0:06:10 | if you also have a prior; and for the purposes here you can consider the |

0:06:15 | prior as given, a prior distribution of the classes; then it's easy, we just plug |

0:06:20 | that in device rule and you get the posterior. So, you can go from the |

0:06:24 | posterior to the likelihoods or the other way round. They're equivalent, they have the same |

0:06:28 | information |

0:06:30 | Also, if the one side is well calibrated, we can say the other side is |

0:06:34 | well calibrated as well. So, it doesn't really matter on which side is Bayes rule, |

0:06:38 | we look at calibration. So, |

0:06:42 | the recognizer output, for the purposes of this presentation, will be here |

0:06:49 | look at measuring calibration on the other side of Bayes rule |

0:06:54 | So, why is calibration necessary? |

0:06:58 | Because our models are imperfect models of the data. |

0:07:02 | Even when that model manages to extract information that could in principle discriminate high accuracy |

0:07:08 | between the classes, the probabilistic representation might not be optimal. For example, it might be |

0:07:15 | out of the confident. The probabilities might all be very close to zero and one, |

0:07:19 | whereas the accuracy doesn't warn that kind of high confidence. So, that's the calibration problem |

0:07:27 | So, calibration analysis will help you to detect that problem and also to fix it. |

0:07:35 | So, calibration can have two meanings: as a measure of goodness, how good is the |

0:07:39 | calibration, and also as a... as a transformation. |

0:07:42 | So, this is |

0:07:45 | what the typical transformation might look like. We have a pattern recognizer, which outputs likelihoods. |

0:07:53 | That recognizer might be based on some probabilistic model. The joint probability here, by which |

0:08:00 | I want to indicate the model can be generative. Probability of data given class or |

0:08:06 | the other way round, discriminative probability class given data, doesn't matter. You're probably going to |

0:08:14 | do better if you recalibrate that output and again you... you could do that. This |

0:08:21 | time we're modeling the scores. The likelihoods that come out of this model, we call |

0:08:27 | them scores, feature, if you like |

0:08:30 | another probabilistic model: the scores are simpler, another dimension than the original input, so they're |

0:08:35 | easier to model. Again, you can do a generative or discriminative modeling of the scores. |

0:08:42 | What I'm about to show is going to be mostly about discriminative modeling, but you |

0:08:45 | can do generative as well. |

0:08:49 | so |

0:08:51 | How can we call the likelihoods that come out of the second stage calibrated? Because |

0:08:57 | we're going to measure them, we're going to measure how well they're calibrated and moreover, |

0:09:01 | we're going to force them to be... to be well calibrated. |

0:09:05 | so |

0:09:09 | If you set their likelihood Bayes rule, then you get the posterior and that's where |

0:09:13 | we're going to measure the calibration with the proper scoring rule. |

0:09:19 | so |

0:09:20 | Obviously, you need to do this kind of measurement with the supervised evaluation database. So, |

0:09:27 | you apply the proper scoring rule to every example in the database and then your |

0:09:31 | have reached the value of the proper scoring rule. That's your measure of goodness, of |

0:09:36 | your recognizer in this database and you plug that into the training algorithm, which is |

0:09:42 | your objective function, and you can adjust the calibration parameters, and that's the way you |

0:09:47 | force your calibrator to |

0:09:52 | produce calibrated likelihoods. |

0:09:54 | So, you can use the same assembly for fusion, if you have multiple system to |

0:09:59 | a final, to a fusion point; or more generally, you can just train your whole |

0:10:05 | recognizer. |

0:10:08 | with the same principle. |

0:10:11 | So in summary of this part, calibration is easiest applied to the likelihoods |

0:10:16 | simple, affine transforms work very well in log-likelihood domains, but the measurement is based on |

0:10:24 | the posteriors and they're going to be with proper scoring rules, so |

0:10:28 | let's introduce proper scoring rules |

0:10:33 | Our first talk about the classical definition of proper scoring rule; than, a more engineering |

0:10:39 | viewpoint |

0:10:42 | how you can define them by decision theory. It is also very useful to look |

0:10:46 | at them in information theory, that will tell you how much information the recognizer is |

0:10:51 | delivering to the user. But that won't be directly relevant to this talk, so I'll |

0:10:56 | just refer you to this reference. |

0:11:00 | So, we start with the |

0:11:03 | classical definition and the sort of canonical example is weather forecast. |

0:11:11 | so |

0:11:13 | We have a weather forecaster. He predicts whether it will rain tomorrow or not and |

0:11:17 | he has a probabilistic prediction, he gives us a probability for rain. |

0:11:22 | The next day, it rains or it doesn't. How do we decide whether that was |

0:11:27 | a good probability or not? |

0:11:29 | So, it's reasonable to choose some kind of a cost function. So, you put the |

0:11:34 | probability, the prediction in there, as well as the fact whether it rained or not. |

0:11:38 | So, what should this cost function look like? |

0:11:44 | It's not so obvious how this cost function should look. If, for example, temperature was |

0:11:50 | predicted, it's easy, you can compare the predicted against the actual temperature and just compute |

0:11:57 | some kind of the squared difference. But in this case, it's a probabilistic prediction on |

0:12:03 | the day that ir rains or not, there's no true probability for rain, so we |

0:12:08 | can't do that kind of |

0:12:09 | direct comparison |

0:12:11 | So, the solution to forming such a cost function is the family of cost functions |

0:12:19 | for proper scoring rules |

0:12:22 | and they have |

0:12:25 | two nice properties. That, first of all, they force prediction to be as accurate as |

0:12:30 | possible, but subject to honesty. You can't pretend that your |

0:12:37 | prediction is more accurate than... that actually should be. So, |

0:12:42 | you need these two things to work together. |

0:12:47 | so |

0:12:49 | this is a simple picture of how weather forecast might... might be done. You've got |

0:12:55 | the data, which comes from satellites and other sensors |

0:12:59 | and the probabilistic model, and then you compute the probability for rain, given the observations |

0:13:05 | in the model. Posterior probability. So, the weather forecaster might ask himself: Do I predict |

0:13:11 | what I calculated or do I output |

0:13:16 | some warping or reinterpretation of this probability, maybe that would be more useful for my |

0:13:22 | users? Maybe my boss will be happier with me if I pretend that my predictions |

0:13:27 | are more accurate than they really are? So, |

0:13:31 | If the weather forecaster trusts his modeling his data |

0:13:37 | then, we can't really do better than the weather forecaster, we're not a weather forecasters. |

0:13:42 | So, what we do want is his best probability, p, the one that he calculated, |

0:13:47 | not something else. So, how do we force him to do that? |

0:13:52 | so, we tell the weather forecaster: Tomorrow |

0:13:56 | when you've predicted some q, which might be different from p, which we really want; |

0:14:01 | we are going to evaluate you with the proper scoring rule, with this type of |

0:14:05 | cost function. Then, the weather forecaster, he doesn't know whether it's going to rain or |

0:14:10 | not. The best information he has is his prediction, p. So, he forms an expected |

0:14:16 | value for the way he's going to be evaluated tomorrow. What's my expected cost that |

0:14:23 | I'm going to be evaluated with tomorrow? So, |

0:14:26 | proper scoring rule satisfies this expectation requirement. So, |

0:14:34 | this probability, p, forms the expectation |

0:14:38 | q is what he submits and |

0:14:40 | proper scoring rule, you're always going to do better if you submit p instead of |

0:14:48 | so that is the way that the proper scoring rule motivates honesty |

0:14:55 | The same mechanism also motivates him to make it more accurate. So, |

0:15:03 | he might sit down and think: If I have a bigger computer, if I launch |

0:15:07 | more satellites, I could get a better prediction. And even though I don't have the |

0:15:12 | better prediction, if I had it, I would form my expectation with the better prediction. |

0:15:19 | And the same mechanism then says: well, we would do better with the better prediction, |

0:15:24 | it's kind of obvious, but the proper scoring rule makes that obvious statement work mathematically. |

0:15:33 | Here's another view. It turns out if you form the... if you look at the |

0:15:39 | expected cost of the proper scoring rule, as a function of the predicted probability, then |

0:15:44 | you get the minima at the vertices of the probability simplex. So, |

0:15:50 | this is very much like the entropy function. In fact, if you use the logarithmic |

0:15:54 | scoring rule, this is just change into P. so minimizing expected cost is the same |

0:16:00 | as... as |

0:16:02 | minimizing entropy uncertainty. So, |

0:16:06 | driving down expected cost tends to favour |

0:16:13 | sharper predictions, that they... they sometimes call it. But it has to be subject to |

0:16:18 | calibration as well. |

0:16:21 | so |

0:16:22 | why are we going on about what humans might do? We can motivate machines. That |

0:16:29 | is called discriminative training. And we can expect the same benefits. |

0:16:36 | some examples. In many different proper scoring rules too... the very well known ones are |

0:16:42 | the Brier score, which has this predictive form, |

0:16:45 | I'll show... I'll show a graph just now... and also the logarithmic score |

0:16:50 | both cases is really easy to show that they do satisfy this expectation requirement. |

0:16:58 | So, here's an example: if |

0:17:01 | the top left. If it does rain, we're looking at the green curve. If you |

0:17:07 | predicted zero probability for rain, that's bad, so the cost is high. If you predicted |

0:17:13 | one, probability one, that's good, so the cost is low. If it doesn't rain it |

0:17:18 | works the other way round. So, that's the Brier score. This is the logarithmic, very |

0:17:21 | similar, except it goes out to |

0:17:23 | infinity here. |

0:17:25 | take another view, you can do logadd transformation on the probability and then you see, |

0:17:30 | they look very different. |

0:17:33 | the logarithmic one tends... turns out to form nice convex objective functions, which are easier |

0:17:40 | to numerically optimize |

0:17:42 | the Brier score is a little bit harder to optimize |

0:17:48 | so now, let's switch to the |

0:17:51 | engineering view of proper scoring rule. So, we're building these recognizers because we actually want |

0:17:57 | to use them for some useful purpose, we want to |

0:18:03 | do whatever we're doing in a cost effective way. We want to minimize expected cost. |

0:18:08 | So, if you ask what are the consequences of Bayes decision that I can make |

0:18:15 | with some probabilistic prediction? Then you really already constructed the proper scoring |

0:18:21 | you just have to ask that very natural question. So, all proper scoring rules can |

0:18:26 | be interpreted in that way. |

0:18:29 | so |

0:18:32 | I'm assuming everybody knows this. This is the example of the NIST detection cost function. |

0:18:40 | You make some decision to accept or reject and it's a target or a non-target. |

0:18:44 | And if you get it wrong there's some cost, if you get it right everything |

0:18:48 | is good, the cost is zero. So, that's the consequence |

0:18:51 | now we are using the probabilistic recognizer, which gives us this probability distribution, q, for |

0:18:58 | target, One minus q for non target. And we want to make, we want to |

0:19:03 | use that to make a decision, so we are making a minimum expected cost Bayes |

0:19:08 | decision. We are assuming that input is well calibrated, that we can use it directly |

0:19:14 | in the minimum expected cost Bayes decision. So, on the two sides of the inequality, |

0:19:19 | ve'we got the expected cost. You choose the lowest expected cost and then you put |

0:19:23 | it into the cost function. So the cost function is used twice. You see, I |

0:19:28 | highlighted the cost parameters that are used twice and the end result is then the |

0:19:33 | proper scoring rule. So, you're comparing the probability distribution for the hypotheses with the true |

0:19:40 | hypotheses, and the proper scoring rule kills tells you how well these two match |

0:19:46 | so this is exactly how NIST this year will form their new evaluation criteria. All |

0:19:54 | the others up to two thousand and ten, they used just the DCF as is, |

0:20:00 | with hard input decisions. Our output decisions, this year they'll do the proper scoring rule |

0:20:07 | and they'll ask for likelihood ratios. Of course, we have to put that trough Bayes |

0:20:12 | rule to get posterior and then it goes into the proper scoring rule. |

0:20:17 | So, we can generalise this to more than two classes and you can really use |

0:20:22 | any cost function, you can use more complicated cost functions. This rule works |

0:20:26 | this trivial inequality that shows this type of construction of proper scoring rule |

0:20:33 | satisfies the expectation requirement |

0:20:39 | so |

0:20:40 | in summary of this part, this Bayes decision interpretation tells us: If you need the |

0:20:44 | proper scoring rule, take your favourite cost function |

0:20:47 | apply this recipe, apply Bayes decisions and you'll have a proper scoring rule and then |

0:20:53 | that will measure and optimize the cost effectiveness of your recognizer. |

0:21:02 | so just a last word about . discrimination/ calibration decomposition... |

0:21:09 | The Bayes decision measures the full cost of using the probabilistic recognizer to make decisions. |

0:21:17 | So, often it's useful to decompose this cost into two components. The first might be |

0:21:24 | the underlying inability of the recognizer to perfectly discriminate between the two classes. Even if |

0:21:32 | you get the calibration optimal, you still can't recognize the classes perfectly. And then, the |

0:21:38 | second component is the additional cost due to bad calibration. |

0:21:44 | So, we've all been looking for... in my case for more than a decade, the |

0:21:51 | NIST's actual DCF versus minimum DCF. That's very much the same kind of decomposition, but |

0:22:00 | in this case, calibration refers only to setting your decision threshold. So, if we move |

0:22:06 | to probabilistic output of the recognizer, that's a more general type of calibration, so does |

0:22:14 | that same recipe... can we do that same kind of decomposition? My answer is yes. |

0:22:21 | I've tried it over the last few years with speaker and language recognition and in |

0:22:26 | my opinion it's a useful thing to do. So, the recipe is |

0:22:31 | at the output end of your recognizer you isolate a few parameters that you call |

0:22:37 | the calibration parameters or you might add an extra stage and call that a calibration |

0:22:42 | stage. If it's multiclass, maybe there's some debate about how to choose these parameters |

0:22:48 | once you've done that, you choose whatever proper scoring rule you're going to use for |

0:22:55 | your evaluation metric. |

0:22:57 | and you use that over your supervised evaluation database that's called then the actual cost. |

0:23:05 | Then, the evaluator goes and, using a two class labels minimizes just those calibration parameters |

0:23:12 | and that reduces the cost |

0:23:17 | somewhat. And then, let's call that the minimum cost, and then you can compare the |

0:23:21 | actual to the minimum cost. If they are very close, you can say: My calibration |

0:23:25 | was good. Otherwise, let's go back and see what went wrong. |

0:23:32 | So, in the last part of the talk we're going to play around wtih proper |

0:23:36 | scoring rules a bit |

0:23:39 | I propose this CLLR for use in speaker recognition in two thousand and four |

0:23:48 | but |

0:23:49 | what I want to show here is that's not the only option |

0:23:53 | you can adjust the proper scoring rule to target your |

0:24:01 | application. So, |

0:24:04 | I'll show how to do that |

0:24:06 | so, the mechanism |

0:24:09 | for, let's call it binary proper scoring rules, is the fact that you can combine |

0:24:16 | proper scoring rules |

0:24:18 | just a weighted summation of the proper scoring rules and once you do that it's |

0:24:24 | still the proper scoring rule. So, you might have multiple different proper scoring rules representing |

0:24:28 | slightly different applications, applications that work in different operating points |

0:24:35 | if you do this kind of application... combination of those proper scoring rules, you get |

0:24:40 | a new proper scoring rule that represents a mixture of applications. Ĺo, real application probably |

0:24:46 | is not going to focus, it'll be used just at the single operating point. If |

0:24:50 | it's a probabilistic output, you can hope to apply it to a range of different |

0:24:54 | operating points. So, this type of |

0:24:58 | combination of proper scpring rules is then a nice way to |

0:25:02 | evaluate that kind of more generally applicable recognizer. |

0:25:08 | So, NIST is also going to do that, this year in SRE twelve. They will |

0:25:13 | use a combination of two discreet operating points in a proper scoring rule. So can |

0:25:20 | you do discreet combinations or continuous combinations, also, but... So, interesting thing is that |

0:25:29 | binary, two class proper scoring rules, can be described in this way, I'll show how |

0:25:35 | that is done. So this DCF turns out to be fundamental building block for |

0:25:42 | two class proper scoring rules |

0:25:45 | This is the same picture I had before. I just normalized the cost function, so |

0:25:52 | there's a cost of miss and false alarm, that's redundant. You don't really need those |

0:25:56 | two costs, we can reduce it to one parameter. |

0:25:59 | Because the magnitude of the proper scoring rule doesn't really tell us anything. So if |

0:26:04 | you normalize it like this, then the expected cost at the decision threshold is always |

0:26:09 | going to be one, no matter what the... what the parameters, no matter what the |

0:26:13 | operating point. So, the parameter that we're using |

0:26:18 | is the base decision threshold |

0:26:20 | posterior probability for the target |

0:26:23 | we compare that to this parameter t, which is the threshold |

0:26:27 | cost is one other t, the cost of miss and the cost of the false |

0:26:31 | alarm is one or the one minus t. You see, if t is close to |

0:26:34 | zero, the one cost goes to infinity; if it's close to one, the other cost |

0:26:37 | goes to infinity. So, you're covering the whole range of cost-ratio just by varying this |

0:26:44 | parameter t. So, we'll call this the normalized DCF scoring rule and I've got the |

0:26:51 | c star notation for it and the operating point is t. |

0:26:59 | So, what does it look like? |

0:27:02 | It's a very simple step function. |

0:27:04 | If your posterior probability for the target is too low, you're going to miss the |

0:27:09 | target; if you're below the threshold t, and you get hit with the miss cost. |

0:27:14 | If p is high enough, that suppose if it really is the target. P is |

0:27:18 | high enough, we pass t, the cost is zero. If it's not the target, it's |

0:27:23 | the other way round in the step function, the red line goes up |

0:27:28 | so |

0:27:29 | you now have four different values of t, so if you adjust the parameters, then |

0:27:35 | cost of miss is high, cost of false alarm is low. If you adjust it, |

0:27:38 | they ... they change. |

0:27:43 | in comparison I've got the logarythmic scoring rule |

0:27:47 | and you'll see, it looks very similar. It tends to follow the way that the |

0:27:54 | miss and false alarm cost change, so you'll find indeed, if you integrate over all |

0:27:59 | values of t |

0:28:01 | then you will get the logarythmic scoring rule |

0:28:07 | so |

0:28:08 | All binary proper scoring rules can be expressed as a... as an expectation over operating |

0:28:13 | points. So, the integrant here is the step functions, the c star guy, as well |

0:28:18 | as some weighting distribution. |

0:28:21 | so |

0:28:25 | the weighting distributions relative distribution. It tends to be non-zero, it tends to integrate to |

0:28:30 | one |

0:28:31 | and it determines |

0:28:34 | the nature of your proper scoring rule. Several properties depend on this weighting distribution and |

0:28:39 | it also tells you what relative importance do I place on different operating points. |

0:28:48 | is a rich variety of things you can do if you make the weighting function |

0:28:53 | an impulse. I shouldn't say function, it's a distribution |

0:28:57 | mathematics is not really a function |

0:29:00 | Any case, if it's an impulse, we're looking at a single operating point. If it's |

0:29:05 | a sum of impulses, we're looking at multiple operating points, discrete operating points. Or if |

0:29:12 | it's a smooth probability distribution then we're looking at the continuous range of operating points. |

0:29:20 | So, examples of the discrete ones, that could be the SRE ten operating point |

0:29:25 | is a step function that... I mean, an impulse at points nine or in SRE |

0:29:33 | twelve we'll have |

0:29:37 | you're looking at two operating points, a mixture of two points. |

0:29:42 | If you do smooth weighting, this quadratic form over here gives the Brier score |

0:29:48 | and the logarythmic score just uses a very simple constant, weighting. So, weighting matters a |

0:29:54 | lot. The Brier score, if you use it for discriminative training, it forms a non- |

0:30:01 | convex optimization objective, which also tends not to generalize that well. If you trained on |

0:30:09 | this data and then use recognizer on that data, it doesn't generalize that well, whereas |

0:30:17 | the logarythmic one |

0:30:19 | has a little bit of natural regularisation booting you can expect to do better on |

0:30:24 | new data |

0:30:28 | so |

0:30:30 | this in your own time, this is just an example how the integral works out. |

0:30:35 | The step function causes the probability that you submit to the proper scoring rule to |

0:30:40 | appear in the |

0:30:43 | boundary of the integral, it's very simple, you get this logarythmic form. |

0:30:49 | So now, let's do a case study |

0:30:52 | and lets' design a proper scoring rule to target the low false alarm region for... |

0:30:59 | of course, for |

0:31:00 | speaker recognition |

0:31:02 | detection |

0:31:04 | a range of thresholds |

0:31:07 | that's the threshold you place in the posterior probability |

0:31:10 | that corresponds |

0:31:12 | an operating point on the DET-curve. |

0:31:15 | So we can use this weighting function to tailor the proper scoring rule to target |

0:31:23 | only a part of the DET-curve if we want. So, George Doddington recently proposed |

0:31:31 | another way to achieve the same thing. He called it cllr and ten |

0:31:37 | it's mentioned in the new NIST evaluation plan. There's also |

0:31:41 | upcoming Interspeech paper, so |

0:31:45 | he used the standard logarythmic scoring rule, which is essentially just the same as cllr |

0:31:50 | that I proposed. And then, he just inverted some scores above some threshold, so that |

0:31:59 | you can tall that low false alarm region |

0:32:03 | he omitted the scores below the threshold |

0:32:07 | So unfortunately, cllr intent does not quite fit into this framework of a proper scoring |

0:32:13 | rule, because it's got threshold that's dependent on the miss rate of every system, so |

0:32:17 | the threshold is slightly different for different systems |

0:32:21 | make it the proper scoring rule. I'm just saying, let's use a fixed threshold and |

0:32:26 | then let's just call it the truncated cllr. |

0:32:29 | and then you can also express |

0:32:31 | the truncated cllr just with the weighting function |

0:32:35 | So, the original cllr logarythmic score has a flat weighting distribution. Truncated cllr uses a |

0:32:43 | unit step |

0:32:44 | which steps up at wherever you want to |

0:32:47 | threshold the scores. |

0:32:51 | so |

0:32:53 | there are several different things you can do. Let's call them variety of... variety of |

0:33:00 | cllr, variations of cllr. The original one is just a logarithmic |

0:33:08 | proper scoring rule, which you need to apply to a probability |

0:33:13 | go from log-likelihood ratio to a probability we need to have some prior and then |

0:33:18 | apply Bayes rule, so the prior that defines cllr is just half |

0:33:25 | you can shift cllr by using some other prior and I'll show you in what |

0:33:29 | sense it shifted in a graph just after this |

0:33:34 | that mechanism has been in the Focal toolkit and the most of us have probably |

0:33:38 | used that to do calibration and fusion |

0:33:44 | but I never explicitly recommended this as an evaluation criteria |

0:33:51 | and then this truncated cllr, which is very close to what George proposed, |

0:33:57 | uses u-step weighting |

0:34:00 | so there's this transformation between log- likelihood- ratio and posterior, so I'm going to show |

0:34:08 | a plot, where the threshold is log- likelihood- ratio threshold, so there's this transformation |

0:34:15 | transformation, and then |

0:34:16 | prior is also involved, the prior just shifts you along the x-axis. And you have |

0:34:22 | to remember, this transformation has Jacobian associated with it, so on right you're having the |

0:34:28 | posterior threshold domain |

0:34:35 | So, this is what graph looks like |

0:34:37 | the |

0:34:39 | the x-axis are the log- likelihood- ratio threshold |

0:34:43 | the y-axis is the relative weighting that the proper scoring rule assigns to different operating |

0:34:49 | points. So, in this view, this weighting function is the probability distribution. It looks almost |

0:34:54 | like Gaussian, it's not quite a Gaussian |

0:35:00 | and |

0:35:01 | now what we do is... we just change the prior, then you get the shifted |

0:35:05 | cmllr which is the best green curve, which is shifted to the right. So, that's |

0:35:11 | shifted towards the low false alarm region. So, I've labelled the regions here... the middle |

0:35:18 | one we can call the equal error rate region, close to log- likelihood- ratio zero |

0:35:23 | low mess rate, low mess rate region. If your threshold is low, you're not gonna |

0:35:29 | mess so many |

0:35:31 | targets. That's the low false alarm region |

0:35:35 | and the blue curve is truncated cllr, so |

0:35:42 | you basically ignore all scores on this side of the threshold and, of course, you |

0:35:47 | have to scale it by a factor of ten that integrates to one |

0:35:53 | So now, let's look at the different option, another final option. |

0:35:56 | There's this beta family of proper scoring rules, that was proposed by Buja. |

0:36:04 | It uses the beta distribution as this weighting distribution. It has two adjustable parameters |

0:36:12 | Why this... why not just the Gaussian? The answer is the integrals work out if |

0:36:16 | you use the beta. |

0:36:18 | It's also general enough, so by adjusting these parameters we can get the Brier, logarithmic |

0:36:23 | and also the c star which we've been using here |

0:36:27 | and |

0:36:29 | So it's a comfortable family used for this purpose |

0:36:34 | For this presentation I've chosen the parameters to be equal to ten and one and |

0:36:40 | that's gonna then |

0:36:43 | very similar to the truncated cllr |

0:36:46 | This is what the proper scoring rule looks like |

0:36:49 | I liked this logarithmic. If p goes close to one |

0:36:55 | the polynomial term doesn't do very much anymore, it's more or less constant. So, then |

0:37:03 | at the very low false alarm region this just becomes the logarithmic scoring rule again. |

0:37:10 | So that's what this new beta looks like. The red curve over here |

0:37:14 | it has it' s peak in the same place as the truncated one or the |

0:37:19 | shifted one. But, for example compared, the shifted one |

0:37:27 | more effectively ignores |

0:37:30 | the one side of the det curve. So, if you believe this is the way |

0:37:36 | to go forward, you really do want to ignore that side of the det curve. |

0:37:40 | You can tailor your proper scoring rule to do that. So, I've not tried the |

0:37:47 | blue or the red version here myself numerically |

0:37:51 | so I cannot recommend to you that you're going to do well is sre twelve |

0:37:56 | if you use one of these curves. It's up to you to experiment, so I |

0:38:00 | just like to point out: cllr is not the only proper scoring rule |

0:38:06 | They're very general, you can tailor them |

0:38:10 | play with them, see what you can get. |

0:38:16 | these guys are saying |

0:38:19 | we have to say something about multiclass |

0:38:21 | so I've one slide of multiclass |

0:38:25 | Multiclass turns out to be a lot more difficult to analyze |

0:38:31 | it's amazing, the complexity, if you go from two to three classes, the trouble you |

0:38:35 | can get into |

0:38:37 | But, it's useful to know that some of the same rules still apply. You can |

0:38:44 | construct |

0:38:45 | the proper scoring rule |

0:38:48 | choose some cost function and construct a proper scoring rule via the Bayes decision recipe. |

0:38:55 | You can also combine them, so the same rules apply. |

0:38:58 | And then, the logarithmic scoring rule is just very nice, it behaves nicely |

0:39:05 | it also turns to... how to be an expectation of weight of misclassification errors, very |

0:39:12 | similar to what I've shown before. The integral is a lot harder to show, it |

0:39:17 | works like that. And then, the logarithmic scoring rule does form a nice evaluation criterion |

0:39:23 | and nice discriminative training criteria and |

0:39:28 | it will be used as such in the Albayzin two thousand and twelve language recognition |

0:39:32 | evaluation. Nicholas here will be telling us more about that later this week. |

0:39:40 | so in conclusion |

0:39:41 | in my view, proper scoring rules are essential if you want to use for recognizing |

0:39:47 | the probabilistic output |

0:39:51 | They do work well for the discriminative training |

0:39:54 | you have to choose the right proper scoring rule for your training, but some of |

0:39:58 | them do work very well; average have a rich structure, they can be tailored, there's |

0:40:02 | not only one |

0:40:04 | and in future maybe we'll see them used more generally in machine learning, even for |

0:40:11 | generative training. |

0:40:15 | Some selected references. The first one, my Ph.D. dissertation has a lot more material about |

0:40:21 | proper scoring rules, many more references |

0:40:26 | and |

0:40:28 | a few questions |

0:41:08 | well |

0:41:10 | we've had a bit of a discussion |

0:41:13 | in the context of... of |

0:41:17 | recognizer that has to recognize the age of the... of the speaker and then if |

0:41:23 | you see... look at the age as a... as a continuous variable, then the nature |

0:41:28 | of the proper scoring rule changes. |

0:41:29 | and |

0:41:31 | there's a... there's a lot literature on that typeof proper scoring rule |

0:41:37 | there are extra issues |

0:41:42 | for example, you... you have to ask |

0:41:46 | even in a multi class case. In the multiclass case |

0:41:51 | you have to ask is there some association between the classes, are some of them |

0:41:56 | closer, so that if you make an error, but error is... is |

0:42:02 | well, let's take an example. If the language is really English and you... no, let's |

0:42:07 | say if language is really one of the Chinese languages and your recognizer says it's |

0:42:13 | one of the other Chinese languages, that error is not as bad as saying it's |

0:42:17 | English. |

0:42:19 | So, the logarithmic scoring rule, for example, doesn't do that. Any error is as bad |

0:42:26 | as |

0:42:27 | as... as any other error. |

0:42:30 | if you have a continuous range like age |

0:42:32 | if |

0:42:36 | if you... if the question is really thirty and you say it's thirty one, that's |

0:42:39 | not such a bad error. So, the logarithmic... there's a logarithmic version of the continuous |

0:42:45 | scoring rule |

0:42:48 | That one will not tell you that error is excusable. |

0:42:52 | So, there are ways to design scoring rules to take into account |

0:42:58 | some... some structure in the way you define your classes. |

0:43:28 | I like to think we've thought more about the problem |

0:43:34 | and I |

0:43:38 | I think one of the reasons for that are the NIST evaluations and specifically |

0:43:44 | the ... the DCF that we've been using in the NIST evaluation. |

0:43:50 | In machine learning they like to just do error |

0:43:53 | so, by Gyen from... from MRI to DCF it's a simple step, we're just weighting |

0:43:59 | the errors |

0:44:38 | You are never speaking about the constrains |

0:44:41 | concerning the datasets |

0:44:45 | if we are targeting |

0:44:49 | some part of the |

0:44:54 | curve, now the false alarm region we will have certainly some constraint |

0:45:01 | dataset, to have the balanced dataset. It's my first |

0:45:05 | the second one a whole lot easier so you get to be |

0:45:08 | dataset |

0:45:11 | maybe start now to speak about also the quantity of information we have in the |

0:45:19 | is it... I'm coming back to your example in language recognition. Is it the same |

0:45:23 | error if you |

0:45:26 | your choice of Chinese language when it was a different Chinese language compared to deciding |

0:45:31 | it's English |

0:45:35 | in speaker recognition is it the same error if you decide it's |

0:45:40 | it's not a target |

0:45:43 | you have nothing in the speech file, no information in the speech file, when you |

0:45:48 | decide that, with a very speech file, with already dead information |

0:45:57 | Here, let me answer the first ... first question, if I understood it correctly |

0:46:04 | you asked about the ... the size of your evaluation database. So, of course, that's... |

0:46:11 | that's very important |

0:46:14 | in |

0:46:16 | my presentation that I had, the ... this SRE analysis workshop in december last year |

0:46:22 | I adressed that... that issue. So, if you look at this view of the proper |

0:46:29 | scoring rule as an integral of error rates |

0:46:33 | then |

0:46:36 | if you move into... if you... if you write an operating point, where the error |

0:46:41 | rate is going to be low, which does happen to in the false alarm region, |

0:46:45 | then you need |

0:46:47 | enough data, so that you actually do |

0:46:50 | errors. If you don't have errors, how can you measure the error rate? So, one |

0:46:56 | has to be very careful |

0:47:00 | not to push your evaluation outside of the range |

0:47:03 | data can cover |

0:47:07 | and the second question is |

0:47:13 | the case that I covered is just the basics |

0:47:17 | if you want a more complicated cost functions |

0:47:21 | where you want to assign different costs to different flavours of errors, that does fit |

0:47:29 | into this framework, so |

0:47:30 | you can type any cost function |

0:47:33 | as long as it doesn't do something pathological again |

0:47:38 | two class, the cost functions simple; in multiclass you have to think really carefully how |

0:47:43 | to |

0:47:44 | construct the cost function that doesn't contradict itself |

0:47:49 | once you've formed a nice cost function |

0:47:53 | you can apply this recipe |

0:47:55 | just plug it into Bayes decision, back into the cost function and you'll have a |

0:48:00 | proper scoring rule. So, this framework does cover that. |

0:48:33 | are you dealing with people who are real? |

0:48:59 | okay |

0:49:00 | and going to tell us some more |