0:00:18So, they asked me to do the introduction for the opening plenary talk here. And
0:00:23luckily, it's very easy to do, since we have Niko, who is... as everyone knows,
0:00:29has been part of The Odyssey Workshops; has become part of the institution of the
0:00:33Odyssey Workshops itself. He's been involved in the area of speaker and language recognition for
0:00:38over twenty years. He started off working at Spescom and DataVoice and now he's the
0:00:44chief scientist at AGNITIO. He receieved his Ph.D. in two thousand ten from the University
0:00:49of Stellenbosch, where he also receieved
0:00:51Undergraduate Master's Degree
0:00:53He's been involved in the area of speaker and language recognition in various aspects of
0:00:56it, those from
0:00:57working on the core technologies of the
0:01:01classifiers themselves: from generative models to discriminatively trained models. And working on the other side
0:01:07of calibration and how you evaluate. And today talk Niko is going to that one
0:01:12area that he's had a lot of contributions in over the years how we can
0:01:16go about evaluating the systems
0:01:18we told them: How do we know how well they're working and how can we
0:01:21work in this in a way that's gonna show each utility for downstream applications? So,
0:01:25with that, I hand it over to Niko to begin his talk
0:01:39much, Doug
0:01:41to be here, thank you
0:01:45So, when Haizhou invited me, he asked me to say something about calibration and fusion
0:01:54doing for
0:01:56years. So, I'll do so by discussing proper scoring rules, the basic principle that underlies
0:02:05all of this work. So, fusion you can do in many ways. Proper scoring rules
0:02:09is a good way to do fusion, but it's not essential for fusion. But, in
0:02:14my view, if you're talking about calibration, you do need proper scoring rules.
0:02:20So, they've been around since nineteeen fifty, The Brier Score
0:02:26proposed for the evaluation of
0:02:29evaluating the goodness of a weather forecasting, probabilistic weather forecasting. Since then, they've been in
0:02:36statistic literature, even up to the present. In pattern recognition/ machine learning speech processing they're
0:02:43not that well known, but in fact, if you use maximum likelihood for generative training,
0:02:50or cross- entropy for discriminative training, you are in practice using the logarithmic scoring rule.
0:02:57So, you've probably all used it already
0:03:01In the future, we may be seeing more of
0:03:05scoring rules in machine learning. We've got these new restricted Boltzmann machines and
0:03:13other energy based models, which are now becoming very popular. It's very difficult to train,
0:03:19because you can't work out the likelihood. This... Hyvarinen proposed a proper scoring rule, how
0:03:26to attack that problem, and if you google you'll find some papers... some recent papers
0:03:30on that as well.
0:03:32so, I'll concentrate
0:03:35own application of proper scoring rules, in our field and
0:03:41to promote better understanding of the concept of calibration itself and how to form
0:03:48training algorithms which are calibration sensitive and... and the evaluation measures
0:03:55So, I'll start by outlining the problem that we are trying to solve. And then,
0:04:02than if you can see the grey... but then I'll... I'll introduce proper scoring rules
0:04:07and then, the last section will be how to design proper scoring rules with the
0:04:12several different ones, how to design them to do what you want them to do
0:04:17So, not all pattern recognition needs to be probabilistic. You can build a nice recognizers
0:04:24with an SVM classifier and you don't need to think about probabilities even to do
0:04:29that. But in this talk, we're interested in probabilistic pattern recognition, where the output is
0:04:37a probability, or a likelihood, or a likelihood ratio. So, if you can get the
0:04:44calibration right, that part of output is more useful than just hard decisions.
0:04:51In machine learning and also in speech recognition, if you do probabilistic recognition, you might
0:04:57be used to seeing a posterior probability as an output. An example is a phone
0:05:02recognizer where there will be forty or so posterior probabilities given
0:05:08input frames. But in speaker and language, there are good reasons why we want to
0:05:13use class likelihoods rather than posteriors. And if there are two classes, as in speaker
0:05:19recognition, then likelihood ratio is the most convenient. So, what I'm about to
0:05:26about... we can do all of those things, it doesn't really matter which of those
0:05:31forms we use
0:05:33so, we're interested
0:05:35in a pattern recognizer that types some form of input, maybe the acoustic feature vectors,
0:05:42maybe an i-vector, or maybe just even a score. And then the output will be
0:05:47a small number of discreet classes, for example: target and non-target in speaker recognition, or
0:05:52a language recognition a number of language classes
0:05:55So, the output of the recognizer might be inaccurate form, so you have
0:06:02given one piece of data, you have a likelihood for... for each of the classes
0:06:10if you also have a prior; and for the purposes here you can consider the
0:06:15prior as given, a prior distribution of the classes; then it's easy, we just plug
0:06:20that in device rule and you get the posterior. So, you can go from the
0:06:24posterior to the likelihoods or the other way round. They're equivalent, they have the same
0:06:30Also, if the one side is well calibrated, we can say the other side is
0:06:34well calibrated as well. So, it doesn't really matter on which side is Bayes rule,
0:06:38we look at calibration. So,
0:06:42the recognizer output, for the purposes of this presentation, will be here
0:06:49look at measuring calibration on the other side of Bayes rule
0:06:54So, why is calibration necessary?
0:06:58Because our models are imperfect models of the data.
0:07:02Even when that model manages to extract information that could in principle discriminate high accuracy
0:07:08between the classes, the probabilistic representation might not be optimal. For example, it might be
0:07:15out of the confident. The probabilities might all be very close to zero and one,
0:07:19whereas the accuracy doesn't warn that kind of high confidence. So, that's the calibration problem
0:07:27So, calibration analysis will help you to detect that problem and also to fix it.
0:07:35So, calibration can have two meanings: as a measure of goodness, how good is the
0:07:39calibration, and also as a... as a transformation.
0:07:42So, this is
0:07:45what the typical transformation might look like. We have a pattern recognizer, which outputs likelihoods.
0:07:53That recognizer might be based on some probabilistic model. The joint probability here, by which
0:08:00I want to indicate the model can be generative. Probability of data given class or
0:08:06the other way round, discriminative probability class given data, doesn't matter. You're probably going to
0:08:14do better if you recalibrate that output and again you... you could do that. This
0:08:21time we're modeling the scores. The likelihoods that come out of this model, we call
0:08:27them scores, feature, if you like
0:08:30another probabilistic model: the scores are simpler, another dimension than the original input, so they're
0:08:35easier to model. Again, you can do a generative or discriminative modeling of the scores.
0:08:42What I'm about to show is going to be mostly about discriminative modeling, but you
0:08:45can do generative as well.
0:08:51How can we call the likelihoods that come out of the second stage calibrated? Because
0:08:57we're going to measure them, we're going to measure how well they're calibrated and moreover,
0:09:01we're going to force them to be... to be well calibrated.
0:09:09If you set their likelihood Bayes rule, then you get the posterior and that's where
0:09:13we're going to measure the calibration with the proper scoring rule.
0:09:20Obviously, you need to do this kind of measurement with the supervised evaluation database. So,
0:09:27you apply the proper scoring rule to every example in the database and then your
0:09:31have reached the value of the proper scoring rule. That's your measure of goodness, of
0:09:36your recognizer in this database and you plug that into the training algorithm, which is
0:09:42your objective function, and you can adjust the calibration parameters, and that's the way you
0:09:47force your calibrator to
0:09:52produce calibrated likelihoods.
0:09:54So, you can use the same assembly for fusion, if you have multiple system to
0:09:59a final, to a fusion point; or more generally, you can just train your whole
0:10:08with the same principle.
0:10:11So in summary of this part, calibration is easiest applied to the likelihoods
0:10:16simple, affine transforms work very well in log-likelihood domains, but the measurement is based on
0:10:24the posteriors and they're going to be with proper scoring rules, so
0:10:28let's introduce proper scoring rules
0:10:33Our first talk about the classical definition of proper scoring rule; than, a more engineering
0:10:42how you can define them by decision theory. It is also very useful to look
0:10:46at them in information theory, that will tell you how much information the recognizer is
0:10:51delivering to the user. But that won't be directly relevant to this talk, so I'll
0:10:56just refer you to this reference.
0:11:00So, we start with the
0:11:03classical definition and the sort of canonical example is weather forecast.
0:11:13We have a weather forecaster. He predicts whether it will rain tomorrow or not and
0:11:17he has a probabilistic prediction, he gives us a probability for rain.
0:11:22The next day, it rains or it doesn't. How do we decide whether that was
0:11:27a good probability or not?
0:11:29So, it's reasonable to choose some kind of a cost function. So, you put the
0:11:34probability, the prediction in there, as well as the fact whether it rained or not.
0:11:38So, what should this cost function look like?
0:11:44It's not so obvious how this cost function should look. If, for example, temperature was
0:11:50predicted, it's easy, you can compare the predicted against the actual temperature and just compute
0:11:57some kind of the squared difference. But in this case, it's a probabilistic prediction on
0:12:03the day that ir rains or not, there's no true probability for rain, so we
0:12:08can't do that kind of
0:12:09direct comparison
0:12:11So, the solution to forming such a cost function is the family of cost functions
0:12:19for proper scoring rules
0:12:22and they have
0:12:25two nice properties. That, first of all, they force prediction to be as accurate as
0:12:30possible, but subject to honesty. You can't pretend that your
0:12:37prediction is more accurate than... that actually should be. So,
0:12:42you need these two things to work together.
0:12:49this is a simple picture of how weather forecast might... might be done. You've got
0:12:55the data, which comes from satellites and other sensors
0:12:59and the probabilistic model, and then you compute the probability for rain, given the observations
0:13:05in the model. Posterior probability. So, the weather forecaster might ask himself: Do I predict
0:13:11what I calculated or do I output
0:13:16some warping or reinterpretation of this probability, maybe that would be more useful for my
0:13:22users? Maybe my boss will be happier with me if I pretend that my predictions
0:13:27are more accurate than they really are? So,
0:13:31If the weather forecaster trusts his modeling his data
0:13:37then, we can't really do better than the weather forecaster, we're not a weather forecasters.
0:13:42So, what we do want is his best probability, p, the one that he calculated,
0:13:47not something else. So, how do we force him to do that?
0:13:52so, we tell the weather forecaster: Tomorrow
0:13:56when you've predicted some q, which might be different from p, which we really want;
0:14:01we are going to evaluate you with the proper scoring rule, with this type of
0:14:05cost function. Then, the weather forecaster, he doesn't know whether it's going to rain or
0:14:10not. The best information he has is his prediction, p. So, he forms an expected
0:14:16value for the way he's going to be evaluated tomorrow. What's my expected cost that
0:14:23I'm going to be evaluated with tomorrow? So,
0:14:26proper scoring rule satisfies this expectation requirement. So,
0:14:34this probability, p, forms the expectation
0:14:38q is what he submits and
0:14:40proper scoring rule, you're always going to do better if you submit p instead of
0:14:48so that is the way that the proper scoring rule motivates honesty
0:14:55The same mechanism also motivates him to make it more accurate. So,
0:15:03he might sit down and think: If I have a bigger computer, if I launch
0:15:07more satellites, I could get a better prediction. And even though I don't have the
0:15:12better prediction, if I had it, I would form my expectation with the better prediction.
0:15:19And the same mechanism then says: well, we would do better with the better prediction,
0:15:24it's kind of obvious, but the proper scoring rule makes that obvious statement work mathematically.
0:15:33Here's another view. It turns out if you form the... if you look at the
0:15:39expected cost of the proper scoring rule, as a function of the predicted probability, then
0:15:44you get the minima at the vertices of the probability simplex. So,
0:15:50this is very much like the entropy function. In fact, if you use the logarithmic
0:15:54scoring rule, this is just change into P. so minimizing expected cost is the same
0:16:00as... as
0:16:02minimizing entropy uncertainty. So,
0:16:06driving down expected cost tends to favour
0:16:13sharper predictions, that they... they sometimes call it. But it has to be subject to
0:16:18calibration as well.
0:16:22why are we going on about what humans might do? We can motivate machines. That
0:16:29is called discriminative training. And we can expect the same benefits.
0:16:36some examples. In many different proper scoring rules too... the very well known ones are
0:16:42the Brier score, which has this predictive form,
0:16:45I'll show... I'll show a graph just now... and also the logarithmic score
0:16:50both cases is really easy to show that they do satisfy this expectation requirement.
0:16:58So, here's an example: if
0:17:01the top left. If it does rain, we're looking at the green curve. If you
0:17:07predicted zero probability for rain, that's bad, so the cost is high. If you predicted
0:17:13one, probability one, that's good, so the cost is low. If it doesn't rain it
0:17:18works the other way round. So, that's the Brier score. This is the logarithmic, very
0:17:21similar, except it goes out to
0:17:23infinity here.
0:17:25take another view, you can do logadd transformation on the probability and then you see,
0:17:30they look very different.
0:17:33the logarithmic one tends... turns out to form nice convex objective functions, which are easier
0:17:40to numerically optimize
0:17:42the Brier score is a little bit harder to optimize
0:17:48so now, let's switch to the
0:17:51engineering view of proper scoring rule. So, we're building these recognizers because we actually want
0:17:57to use them for some useful purpose, we want to
0:18:03do whatever we're doing in a cost effective way. We want to minimize expected cost.
0:18:08So, if you ask what are the consequences of Bayes decision that I can make
0:18:15with some probabilistic prediction? Then you really already constructed the proper scoring
0:18:21you just have to ask that very natural question. So, all proper scoring rules can
0:18:26be interpreted in that way.
0:18:32I'm assuming everybody knows this. This is the example of the NIST detection cost function.
0:18:40You make some decision to accept or reject and it's a target or a non-target.
0:18:44And if you get it wrong there's some cost, if you get it right everything
0:18:48is good, the cost is zero. So, that's the consequence
0:18:51now we are using the probabilistic recognizer, which gives us this probability distribution, q, for
0:18:58target, One minus q for non target. And we want to make, we want to
0:19:03use that to make a decision, so we are making a minimum expected cost Bayes
0:19:08decision. We are assuming that input is well calibrated, that we can use it directly
0:19:14in the minimum expected cost Bayes decision. So, on the two sides of the inequality,
0:19:19ve'we got the expected cost. You choose the lowest expected cost and then you put
0:19:23it into the cost function. So the cost function is used twice. You see, I
0:19:28highlighted the cost parameters that are used twice and the end result is then the
0:19:33proper scoring rule. So, you're comparing the probability distribution for the hypotheses with the true
0:19:40hypotheses, and the proper scoring rule kills tells you how well these two match
0:19:46so this is exactly how NIST this year will form their new evaluation criteria. All
0:19:54the others up to two thousand and ten, they used just the DCF as is,
0:20:00with hard input decisions. Our output decisions, this year they'll do the proper scoring rule
0:20:07and they'll ask for likelihood ratios. Of course, we have to put that trough Bayes
0:20:12rule to get posterior and then it goes into the proper scoring rule.
0:20:17So, we can generalise this to more than two classes and you can really use
0:20:22any cost function, you can use more complicated cost functions. This rule works
0:20:26this trivial inequality that shows this type of construction of proper scoring rule
0:20:33satisfies the expectation requirement
0:20:40in summary of this part, this Bayes decision interpretation tells us: If you need the
0:20:44proper scoring rule, take your favourite cost function
0:20:47apply this recipe, apply Bayes decisions and you'll have a proper scoring rule and then
0:20:53that will measure and optimize the cost effectiveness of your recognizer.
0:21:02so just a last word about . discrimination/ calibration decomposition...
0:21:09The Bayes decision measures the full cost of using the probabilistic recognizer to make decisions.
0:21:17So, often it's useful to decompose this cost into two components. The first might be
0:21:24the underlying inability of the recognizer to perfectly discriminate between the two classes. Even if
0:21:32you get the calibration optimal, you still can't recognize the classes perfectly. And then, the
0:21:38second component is the additional cost due to bad calibration.
0:21:44So, we've all been looking for... in my case for more than a decade, the
0:21:51NIST's actual DCF versus minimum DCF. That's very much the same kind of decomposition, but
0:22:00in this case, calibration refers only to setting your decision threshold. So, if we move
0:22:06to probabilistic output of the recognizer, that's a more general type of calibration, so does
0:22:14that same recipe... can we do that same kind of decomposition? My answer is yes.
0:22:21I've tried it over the last few years with speaker and language recognition and in
0:22:26my opinion it's a useful thing to do. So, the recipe is
0:22:31at the output end of your recognizer you isolate a few parameters that you call
0:22:37the calibration parameters or you might add an extra stage and call that a calibration
0:22:42stage. If it's multiclass, maybe there's some debate about how to choose these parameters
0:22:48once you've done that, you choose whatever proper scoring rule you're going to use for
0:22:55your evaluation metric.
0:22:57and you use that over your supervised evaluation database that's called then the actual cost.
0:23:05Then, the evaluator goes and, using a two class labels minimizes just those calibration parameters
0:23:12and that reduces the cost
0:23:17somewhat. And then, let's call that the minimum cost, and then you can compare the
0:23:21actual to the minimum cost. If they are very close, you can say: My calibration
0:23:25was good. Otherwise, let's go back and see what went wrong.
0:23:32So, in the last part of the talk we're going to play around wtih proper
0:23:36scoring rules a bit
0:23:39I propose this CLLR for use in speaker recognition in two thousand and four
0:23:49what I want to show here is that's not the only option
0:23:53you can adjust the proper scoring rule to target your
0:24:01application. So,
0:24:04I'll show how to do that
0:24:06so, the mechanism
0:24:09for, let's call it binary proper scoring rules, is the fact that you can combine
0:24:16proper scoring rules
0:24:18just a weighted summation of the proper scoring rules and once you do that it's
0:24:24still the proper scoring rule. So, you might have multiple different proper scoring rules representing
0:24:28slightly different applications, applications that work in different operating points
0:24:35if you do this kind of application... combination of those proper scoring rules, you get
0:24:40a new proper scoring rule that represents a mixture of applications. Şo, real application probably
0:24:46is not going to focus, it'll be used just at the single operating point. If
0:24:50it's a probabilistic output, you can hope to apply it to a range of different
0:24:54operating points. So, this type of
0:24:58combination of proper scpring rules is then a nice way to
0:25:02evaluate that kind of more generally applicable recognizer.
0:25:08So, NIST is also going to do that, this year in SRE twelve. They will
0:25:13use a combination of two discreet operating points in a proper scoring rule. So can
0:25:20you do discreet combinations or continuous combinations, also, but... So, interesting thing is that
0:25:29binary, two class proper scoring rules, can be described in this way, I'll show how
0:25:35that is done. So this DCF turns out to be fundamental building block for
0:25:42two class proper scoring rules
0:25:45This is the same picture I had before. I just normalized the cost function, so
0:25:52there's a cost of miss and false alarm, that's redundant. You don't really need those
0:25:56two costs, we can reduce it to one parameter.
0:25:59Because the magnitude of the proper scoring rule doesn't really tell us anything. So if
0:26:04you normalize it like this, then the expected cost at the decision threshold is always
0:26:09going to be one, no matter what the... what the parameters, no matter what the
0:26:13operating point. So, the parameter that we're using
0:26:18is the base decision threshold
0:26:20posterior probability for the target
0:26:23we compare that to this parameter t, which is the threshold
0:26:27cost is one other t, the cost of miss and the cost of the false
0:26:31alarm is one or the one minus t. You see, if t is close to
0:26:34zero, the one cost goes to infinity; if it's close to one, the other cost
0:26:37goes to infinity. So, you're covering the whole range of cost-ratio just by varying this
0:26:44parameter t. So, we'll call this the normalized DCF scoring rule and I've got the
0:26:51c star notation for it and the operating point is t.
0:26:59So, what does it look like?
0:27:02It's a very simple step function.
0:27:04If your posterior probability for the target is too low, you're going to miss the
0:27:09target; if you're below the threshold t, and you get hit with the miss cost.
0:27:14If p is high enough, that suppose if it really is the target. P is
0:27:18high enough, we pass t, the cost is zero. If it's not the target, it's
0:27:23the other way round in the step function, the red line goes up
0:27:29you now have four different values of t, so if you adjust the parameters, then
0:27:35cost of miss is high, cost of false alarm is low. If you adjust it,
0:27:38they ... they change.
0:27:43in comparison I've got the logarythmic scoring rule
0:27:47and you'll see, it looks very similar. It tends to follow the way that the
0:27:54miss and false alarm cost change, so you'll find indeed, if you integrate over all
0:27:59values of t
0:28:01then you will get the logarythmic scoring rule
0:28:08All binary proper scoring rules can be expressed as a... as an expectation over operating
0:28:13points. So, the integrant here is the step functions, the c star guy, as well
0:28:18as some weighting distribution.
0:28:25the weighting distributions relative distribution. It tends to be non-zero, it tends to integrate to
0:28:31and it determines
0:28:34the nature of your proper scoring rule. Several properties depend on this weighting distribution and
0:28:39it also tells you what relative importance do I place on different operating points.
0:28:48is a rich variety of things you can do if you make the weighting function
0:28:53an impulse. I shouldn't say function, it's a distribution
0:28:57mathematics is not really a function
0:29:00Any case, if it's an impulse, we're looking at a single operating point. If it's
0:29:05a sum of impulses, we're looking at multiple operating points, discrete operating points. Or if
0:29:12it's a smooth probability distribution then we're looking at the continuous range of operating points.
0:29:20So, examples of the discrete ones, that could be the SRE ten operating point
0:29:25is a step function that... I mean, an impulse at points nine or in SRE
0:29:33twelve we'll have
0:29:37you're looking at two operating points, a mixture of two points.
0:29:42If you do smooth weighting, this quadratic form over here gives the Brier score
0:29:48and the logarythmic score just uses a very simple constant, weighting. So, weighting matters a
0:29:54lot. The Brier score, if you use it for discriminative training, it forms a non-
0:30:01convex optimization objective, which also tends not to generalize that well. If you trained on
0:30:09this data and then use recognizer on that data, it doesn't generalize that well, whereas
0:30:17the logarythmic one
0:30:19has a little bit of natural regularisation booting you can expect to do better on
0:30:24new data
0:30:30this in your own time, this is just an example how the integral works out.
0:30:35The step function causes the probability that you submit to the proper scoring rule to
0:30:40appear in the
0:30:43boundary of the integral, it's very simple, you get this logarythmic form.
0:30:49So now, let's do a case study
0:30:52and lets' design a proper scoring rule to target the low false alarm region for...
0:30:59of course, for
0:31:00speaker recognition
0:31:04a range of thresholds
0:31:07that's the threshold you place in the posterior probability
0:31:10that corresponds
0:31:12an operating point on the DET-curve.
0:31:15So we can use this weighting function to tailor the proper scoring rule to target
0:31:23only a part of the DET-curve if we want. So, George Doddington recently proposed
0:31:31another way to achieve the same thing. He called it cllr and ten
0:31:37it's mentioned in the new NIST evaluation plan. There's also
0:31:41upcoming Interspeech paper, so
0:31:45he used the standard logarythmic scoring rule, which is essentially just the same as cllr
0:31:50that I proposed. And then, he just inverted some scores above some threshold, so that
0:31:59you can tall that low false alarm region
0:32:03he omitted the scores below the threshold
0:32:07So unfortunately, cllr intent does not quite fit into this framework of a proper scoring
0:32:13rule, because it's got threshold that's dependent on the miss rate of every system, so
0:32:17the threshold is slightly different for different systems
0:32:21make it the proper scoring rule. I'm just saying, let's use a fixed threshold and
0:32:26then let's just call it the truncated cllr.
0:32:29and then you can also express
0:32:31the truncated cllr just with the weighting function
0:32:35So, the original cllr logarythmic score has a flat weighting distribution. Truncated cllr uses a
0:32:43unit step
0:32:44which steps up at wherever you want to
0:32:47threshold the scores.
0:32:53there are several different things you can do. Let's call them variety of... variety of
0:33:00cllr, variations of cllr. The original one is just a logarithmic
0:33:08proper scoring rule, which you need to apply to a probability
0:33:13go from log-likelihood ratio to a probability we need to have some prior and then
0:33:18apply Bayes rule, so the prior that defines cllr is just half
0:33:25you can shift cllr by using some other prior and I'll show you in what
0:33:29sense it shifted in a graph just after this
0:33:34that mechanism has been in the Focal toolkit and the most of us have probably
0:33:38used that to do calibration and fusion
0:33:44but I never explicitly recommended this as an evaluation criteria
0:33:51and then this truncated cllr, which is very close to what George proposed,
0:33:57uses u-step weighting
0:34:00so there's this transformation between log- likelihood- ratio and posterior, so I'm going to show
0:34:08a plot, where the threshold is log- likelihood- ratio threshold, so there's this transformation
0:34:15transformation, and then
0:34:16prior is also involved, the prior just shifts you along the x-axis. And you have
0:34:22to remember, this transformation has Jacobian associated with it, so on right you're having the
0:34:28posterior threshold domain
0:34:35So, this is what graph looks like
0:34:39the x-axis are the log- likelihood- ratio threshold
0:34:43the y-axis is the relative weighting that the proper scoring rule assigns to different operating
0:34:49points. So, in this view, this weighting function is the probability distribution. It looks almost
0:34:54like Gaussian, it's not quite a Gaussian
0:35:01now what we do is... we just change the prior, then you get the shifted
0:35:05cmllr which is the best green curve, which is shifted to the right. So, that's
0:35:11shifted towards the low false alarm region. So, I've labelled the regions here... the middle
0:35:18one we can call the equal error rate region, close to log- likelihood- ratio zero
0:35:23low mess rate, low mess rate region. If your threshold is low, you're not gonna
0:35:29mess so many
0:35:31targets. That's the low false alarm region
0:35:35and the blue curve is truncated cllr, so
0:35:42you basically ignore all scores on this side of the threshold and, of course, you
0:35:47have to scale it by a factor of ten that integrates to one
0:35:53So now, let's look at the different option, another final option.
0:35:56There's this beta family of proper scoring rules, that was proposed by Buja.
0:36:04It uses the beta distribution as this weighting distribution. It has two adjustable parameters
0:36:12Why this... why not just the Gaussian? The answer is the integrals work out if
0:36:16you use the beta.
0:36:18It's also general enough, so by adjusting these parameters we can get the Brier, logarithmic
0:36:23and also the c star which we've been using here
0:36:29So it's a comfortable family used for this purpose
0:36:34For this presentation I've chosen the parameters to be equal to ten and one and
0:36:40that's gonna then
0:36:43very similar to the truncated cllr
0:36:46This is what the proper scoring rule looks like
0:36:49I liked this logarithmic. If p goes close to one
0:36:55the polynomial term doesn't do very much anymore, it's more or less constant. So, then
0:37:03at the very low false alarm region this just becomes the logarithmic scoring rule again.
0:37:10So that's what this new beta looks like. The red curve over here
0:37:14it has it' s peak in the same place as the truncated one or the
0:37:19shifted one. But, for example compared, the shifted one
0:37:27more effectively ignores
0:37:30the one side of the det curve. So, if you believe this is the way
0:37:36to go forward, you really do want to ignore that side of the det curve.
0:37:40You can tailor your proper scoring rule to do that. So, I've not tried the
0:37:47blue or the red version here myself numerically
0:37:51so I cannot recommend to you that you're going to do well is sre twelve
0:37:56if you use one of these curves. It's up to you to experiment, so I
0:38:00just like to point out: cllr is not the only proper scoring rule
0:38:06They're very general, you can tailor them
0:38:10play with them, see what you can get.
0:38:16these guys are saying
0:38:19we have to say something about multiclass
0:38:21so I've one slide of multiclass
0:38:25Multiclass turns out to be a lot more difficult to analyze
0:38:31it's amazing, the complexity, if you go from two to three classes, the trouble you
0:38:35can get into
0:38:37But, it's useful to know that some of the same rules still apply. You can
0:38:45the proper scoring rule
0:38:48choose some cost function and construct a proper scoring rule via the Bayes decision recipe.
0:38:55You can also combine them, so the same rules apply.
0:38:58And then, the logarithmic scoring rule is just very nice, it behaves nicely
0:39:05it also turns to... how to be an expectation of weight of misclassification errors, very
0:39:12similar to what I've shown before. The integral is a lot harder to show, it
0:39:17works like that. And then, the logarithmic scoring rule does form a nice evaluation criterion
0:39:23and nice discriminative training criteria and
0:39:28it will be used as such in the Albayzin two thousand and twelve language recognition
0:39:32evaluation. Nicholas here will be telling us more about that later this week.
0:39:40so in conclusion
0:39:41in my view, proper scoring rules are essential if you want to use for recognizing
0:39:47the probabilistic output
0:39:51They do work well for the discriminative training
0:39:54you have to choose the right proper scoring rule for your training, but some of
0:39:58them do work very well; average have a rich structure, they can be tailored, there's
0:40:02not only one
0:40:04and in future maybe we'll see them used more generally in machine learning, even for
0:40:11generative training.
0:40:15Some selected references. The first one, my Ph.D. dissertation has a lot more material about
0:40:21proper scoring rules, many more references
0:40:28a few questions
0:41:10we've had a bit of a discussion
0:41:13in the context of... of
0:41:17recognizer that has to recognize the age of the... of the speaker and then if
0:41:23you see... look at the age as a... as a continuous variable, then the nature
0:41:28of the proper scoring rule changes.
0:41:31there's a... there's a lot literature on that typeof proper scoring rule
0:41:37there are extra issues
0:41:42for example, you... you have to ask
0:41:46even in a multi class case. In the multiclass case
0:41:51you have to ask is there some association between the classes, are some of them
0:41:56closer, so that if you make an error, but error is... is
0:42:02well, let's take an example. If the language is really English and you... no, let's
0:42:07say if language is really one of the Chinese languages and your recognizer says it's
0:42:13one of the other Chinese languages, that error is not as bad as saying it's
0:42:19So, the logarithmic scoring rule, for example, doesn't do that. Any error is as bad
0:42:27as... as any other error.
0:42:30if you have a continuous range like age
0:42:36if you... if the question is really thirty and you say it's thirty one, that's
0:42:39not such a bad error. So, the logarithmic... there's a logarithmic version of the continuous
0:42:45scoring rule
0:42:48That one will not tell you that error is excusable.
0:42:52So, there are ways to design scoring rules to take into account
0:42:58some... some structure in the way you define your classes.
0:43:28I like to think we've thought more about the problem
0:43:34and I
0:43:38I think one of the reasons for that are the NIST evaluations and specifically
0:43:44the ... the DCF that we've been using in the NIST evaluation.
0:43:50In machine learning they like to just do error
0:43:53so, by Gyen from... from MRI to DCF it's a simple step, we're just weighting
0:43:59the errors
0:44:38You are never speaking about the constrains
0:44:41concerning the datasets
0:44:45if we are targeting
0:44:49some part of the
0:44:54curve, now the false alarm region we will have certainly some constraint
0:45:01dataset, to have the balanced dataset. It's my first
0:45:05the second one a whole lot easier so you get to be
0:45:11maybe start now to speak about also the quantity of information we have in the
0:45:19is it... I'm coming back to your example in language recognition. Is it the same
0:45:23error if you
0:45:26your choice of Chinese language when it was a different Chinese language compared to deciding
0:45:31it's English
0:45:35in speaker recognition is it the same error if you decide it's
0:45:40it's not a target
0:45:43you have nothing in the speech file, no information in the speech file, when you
0:45:48decide that, with a very speech file, with already dead information
0:45:57Here, let me answer the first ... first question, if I understood it correctly
0:46:04you asked about the ... the size of your evaluation database. So, of course, that's...
0:46:11that's very important
0:46:16my presentation that I had, the ... this SRE analysis workshop in december last year
0:46:22I adressed that... that issue. So, if you look at this view of the proper
0:46:29scoring rule as an integral of error rates
0:46:36if you move into... if you... if you write an operating point, where the error
0:46:41rate is going to be low, which does happen to in the false alarm region,
0:46:45then you need
0:46:47enough data, so that you actually do
0:46:50errors. If you don't have errors, how can you measure the error rate? So, one
0:46:56has to be very careful
0:47:00not to push your evaluation outside of the range
0:47:03data can cover
0:47:07and the second question is
0:47:13the case that I covered is just the basics
0:47:17if you want a more complicated cost functions
0:47:21where you want to assign different costs to different flavours of errors, that does fit
0:47:29into this framework, so
0:47:30you can type any cost function
0:47:33as long as it doesn't do something pathological again
0:47:38two class, the cost functions simple; in multiclass you have to think really carefully how
0:47:44construct the cost function that doesn't contradict itself
0:47:49once you've formed a nice cost function
0:47:53you can apply this recipe
0:47:55just plug it into Bayes decision, back into the cost function and you'll have a
0:48:00proper scoring rule. So, this framework does cover that.
0:48:33are you dealing with people who are real?
0:49:00and going to tell us some more