Speech Transcript - Calibration of binary and multiclass probabilistic classifiers in automatic speaker and language recognition

so good morning sounds very much for inviting me

as better mention i'm not in mainstream speech recognition

i hope what i chose to talk about what will be interesting to you

before i go on we just

okay just

just about a medium i probably

okay do is a startup that's been around since about two thousand and four

i need to is latin for i recognise so i meet you specialises in automatic

speaker recognition just that

and it sells

a range of products

that make use of this technology in many different countries in the world has its

main office in madrid in spain

and also offices close to washington and california

and we have a small research lab in south africa so that's where i'm based

so just to make sure we'd on the same page

that we know what we're talking about

everybody knows

what

speech recognition is about

speaker recognition is who's

i'll from funded very difficult to explain to people after i've explain for K two

minutes that will still understand speech recognition

and then of course

there's automatic language recognition or called spoken language recognition

just to tell

given a speech segment

which language was this

in speaker and also language recognition

we've inherited some stuff from speech recognition

mostly just the acoustic modeling so

the features mfccs and gmms

we do but slide back with neural networks we haven't we've tried but that i

don't work as well as the gmms to

we take the acoustic modeling and then we do some relatively simple back and recognition

it's very simple compared to your language modeling and or decoders

this talk is going to be deep rather than why i'm going to concentrate just

on the back and recognition part and just on a tiny aspect of that

ninety calibration

and i hope maybe

you guys find something in the store useful that you can maybe use

what is calibration

it concerns the goodness of soft decisions so you have a recognizer that i put

some kind of a soft decision a classifier if you want

and

then it can be understood in two senses

first of all calibration is just how could

is the output of my recognizer

it's whatever you do to make it bit so if you make your the output

of your recognizer better you calibrating it

so will talk about but

i'm not and there's expecting everybody to understand

this diagram this is just

a road map of what we're going to talk about i'll come back to this

diagram

i'm going to motivate that if you want your recognizer output

a soft decision

likelihoods rather than posteriors

is what you want and

how to evaluate the goodness

oftentimes outputs final cross entropy

so the cross entropy gives you a calibration sensitive loss function to measure the goodness

of the output of the recognizer

then we can take the wrong put and we can somehow calibrated so

i'll talk about some simple calibrate there's

and then

you can have this kind of a feedback loop

to essentially optimize why the effect of the calibrated and then

that gives you in a since calibration insensitive less

which tells you

how well could i have done if my calibrate it's my system had been optimally

calibrated

and then you can compare the two

and that will tell you

a good was my calibration

so it's not at the beginning

the canonical speaker recognition problem

we usually view that as a two class classification problem so the input is a

pair of speech segments often just call the enrollment segment and the test segment

and then

the output one is its class one that segments have the same speaker tall close

to the different

so as an example of a multiclass classifier

we take language recognition

so there we can define a number of language classes

and

if the french and the audience are wondering why they're not there but all their

neighbors that's some other language

so let's look at the

i'll put

the form of the output of

a classifier recognizer so that you just form would just to put

a heart last okay

if you want to soft output that might be

posterior distribution

or we can go to the other side of bayes rule and output like your

distribution i'm going to motivate the last one is preferable

how decisions

there's some people

but it's a bad idea unless error rate is really low

and it cannot make use of context that cannot make use of independent prior information

posting idea is to end users stole intuitive to understand what the posteriors telling them

and it conveys confidence

you can you can

recover from an error because you see it coming you know you can make errors

you could make optimal

minimum expected cost bayes decisions if you have a posterior so that's a much more

useful the output

the problem with the posting idea is

the prior is implicit and hardcoded inside the posterior you can remember by dividing it

up but then you also need to know what was the posting

a clean a type of output

it's just the likelihood

and then you can afterwards supply any but i in any prior

the only downside is it somewhat harder to understand especially for end users but we

might end users so

bits let's go with the likelihood

in the end

for this implications that we're looking at there really isn't that much difference between the

do if you have the posterior and the prior in the back to likelihood

or you got the other way

there's a small number of discreet classes you can always normalized likelihood and you've but

posterior

that does look at some examples to might affect

we use the likelihood

and language recognizer would output the like to distribution across the number of languages

but then

it's a we know within ornaments today

we more likely to here

check being spoken on the street very unlikely to hear my home language afrikaans

and

you combine these two sources of information

via bayes rule and the and the

the posterior then gives you the complete picture

maybe you could have a phone recognizer the same sort of rescue applies

output a like you distribution

and

the prior is the context in which you try to recognise that phone

and then the decoder combines everything

essentially forms a formal a kind of posterior

let's go to speech recognition

this is and i realised

you

of what a for intra-speaker recognizer might do

there was someone who

was careless enough to get himself recorded what he was committing a crime

people is about all of this speech sample

there is the suspect

but also list the suspect nicely to provide another speech sample then you want to

compare the two

is this the same person or not

and because that are just two classes you can conveniently for the likelihood ratio between

those two possibilities

and then if you have a very nice

bayesian courtroom inside the core

by would

this

the total effect of all the other evidence

as a kind of the prior

and then

if you have a very clever jobs and the jewelry i might act like bayes

rule and i can combine these two sources of evidence so

that's probably never really going to happen

but

still for any this is the useful babble

to think what should my i would look like

what should i be thinking about if i want this likelihood ratio that i put

to be

to do its job as well as possible

this is an objective field but in practice recognizers are often badly calibrated in my

experience if you bowl the speaker or a language recognizer it's always badly calibrated

you can redesigned the thing and do what you want it's going to be badly

calibrated it might be very accurate

but

install badly calibrated so

you need to

adjust its output

to get the full benefit of the output of the recognizer so

the tools we need

but michael this happened first of all you need to measure the quality of the

calibration

and then

you need some to adjust it

first let's talk about the measurement

calibration applies to both posteriors and likelihoods

it's easier to explain this whole thing in terms of posteriors and then later we'll

go back to the likelihoods

so are use

two classes

as a running example again because it's easier to

explain and then like that we go we'll go to the multiclass case so

it is a recogniser we represented by the symbol are so all the posteriors of

conditioned on because it's output by the by the recognizer

the posterior tells you do things

first is which clusters at five

if the one element is greater than other one it wants to be recognising class

one in this case

but then it also tells us that degree of confidence how much more is the

one element greater than the other one

so we can form that's right so we can take the bible the right so

you could look at the entropy of the distribution

any anything like that will give you a measurement of the of the degree of

confidence

the question i'm trying to answer what this presentation is

the recognizer outputs a posterior distribution

we also known for this particular case which of the two classes was really true

was this a good but still not

another example would be

a weather predict there's id percent of the chance of rain tomorrow

the model rows

it doesn't right

how good was that

i would was that the prediction

so first of all

if it's is the one is greater than other one

and that was in the right direction we know it favours the correct class so

at least that aspect of the

posterior once could

i'll do we judge this degree of confidence

what can we do

we don't have a reference posterior would not given that in practice

we just given the true class what can we say about the posterior

we'll the sign some penalty function

but this graph is telling us

the

recognizer output the posterior distribution

posterior for each of the two classes

and we know the true class so then on the bottom axis we plot

that

we plot that posterior distribution

for a single case of just be a single point on the x-axis

and then

if the posterior for the true class was one that's good

it was very certain of the thing that really happened

but if it's a something that had really happened

is it possible according to the recognizer

that's thirty back so we give it a high penalty maybe even an infinite penalty

and internet penalty

it's might be a good idea

in practice

if you make a wrong decision it can have arbitrarily bad consequences

you like playing russian roulette

you've got the gun in your hand

perform some posterior to know hidden that's is there's no time in the time but

at the moment

and you put it should lead

the consequences of a bad posterior

can be arbitrarily but so i liked this idea but the penalty going up to

infinity

i've brought that to candidate functions here

it's easy to see

this should be monotonic function

what should the shape be what principle should we used to design this penalty function

we'll take an engineering upright will say what do we want to use that the

output for and how do how well there's a do that

what can we do it but the posterior

make minimum expected cost bayes decisions

in as a speech recognizer it might be sending the find posteriors into the decoder

but in the end it still gonna make some decision at some stage it's gone

output a transcription so in the end you're always making decisions

and then we just ask

how well

does it make these decisions and then

that very same cost that you optimising

but the minimum expected cost bayes decision is gonna tell you how well you did

good posteriors can be used to make cost effective decisions

but badly calibrated posteriors

maybe on the confident or overconfident in the wrong hypothesis and that will eventually lead

to a series of unnecessarily costly

errors

so let's look at decision cost function

the decision cost functions model the consequences of applying recognition technology in the real world

real world it's always more complex

engineers like

simple models but we can optimize

this should be very from idea

we first look at the case of a heart decision

the recognizer says its class one out loss to

and we know that the true class last one of the last two and then

we assign some cost coefficient

thus example

i might the cost coefficients when there and i was for the correct decision

there's yellow

and for errors there's a non-zero cost

so you might

want to work in terms of rewards

or you can even have a mixture of rewards and penalties this celebrate the

what's called the term weighted value of the keyword spotting

that's a mixture of a of a of every word and the penalty so

in the and all those are equivalent you can you can play around these cost

functions

and

for what we're gonna do

it's

it's container

two

not using what i'm just to

put the cost on the errors

now we apply it to a soft decision

so we let the recognizer output the posterior distribution

and then

when we evaluate its goodness

we make and minimum expected cost bayes decision so the base decision is made without

having the two class

and then

we treat that as a heart decision and evaluated with this cost matrix as before

what we have now

is the goodness of the posterior

that we've output

very simple thing

it she

what we've achieved so

a couple of slides ago

i try to convince you that

this kind of penalty function on the left is what we want

what we've achieved

is this step function

there's a threshold on the posterior

which is a function of the cost coefficients

but the cost is either some non-zero cost or also your

so at least the step function has the right

sense

it's bigger where it needs to be an small that where it needs to be

but

it's very crude and in effect it's only evaluating the goodness of your posterior up

with a single point

it doesn't say anything about making decisions at any of the operating point

we need to find a smoother solution

in order to smooth let's simplified just a little but

the bayes decision threshold

this simple ratio of the costs

so we might think of it in terms of give the costs

compute the threshold

but that's to at the other way round

so let's say

that's choose the threshold at which we but which we're going to evaluate with free

to choose any facial

and then

we might

cost the a function of the threshold and if you choose these simple reciprocal functions

we're still

applying the above equation

the above equation is able to

so let's look at let's look at this graphically

what we've achieved

the recognizer outputs Q one of the posteriors for class one you do is just

just flip access you'll get you do

the penalty when class one is true would be the red curve and the penalty

when close to is true

would be the look of and

the cost coefficients are a function of the threshold which we can

adjust at what so let's do that

we can move the threshold and with it

the cost coefficients

with which are gonna be penalised well

well

well also change if you press

the threshold right against you had or one penalty will be infinite but

that's good because then you want to yourself in the hit

by moving the threshold while we're evaluating the goodness of the posterior we in fact

exercising the decision making ability of the a posteriori over its full range

so we're almost done

that's just look at another view

this is the same thing

just another view we have the recognizer output the posterior

the posterior is compared against

the threshold

the threshold is a parameter chosen by the evaluators

and then it

you also need to know the true cost and outputs the cost so

note

is a function of three variables

the recognizer output the true value and this parameter feature

so now let's integrate out see

the integral and here is the state be cost function which are plotted a few

slides about

on the left hand side we get

a cost function which is not independent of the threshold because we've integrated

about the full range of the threshold

and

that turns out to be just this logarithmic cost function

you

bike the than algorithm of

the posterior have for the two class

and

that

is the goodness of the posterior of the recognizer so

there's napkins

this nice smooth shape

which we were looking for

that's two classes

now we're going to generalise to multiclass

so multi class

is a lot trickier

but the sign general principles apply

we still gonna work with minimum expected cost bayes decisions

but in this case will use of generalize threshold which all plot for you the

next slide

and we again we're going to integrate out the threshold and get the similar results

and the scroll we show the

output of i three class

recognizer

i chose three classes because i can plotted here on this nice flat screen

so Q one

is the posterior for class one

the vertical axis Q to the posterior for class to

and Q three we don't see but it's just the complement of the others to

so everything needs to live inside the simplex

then the

the tricky part

we now define a kind of a generalized facial so this threshold

has three components people want to and three

and we constrain them to sum to one so this threshold

is defined

by this point where the lines meet

and that also loves inside the same someplace

and now again

we've chosen the threshold

then we choose the cost function so the cost function again

is this little equation at the bottom again as just the reciprocal of the threshold

coefficients

and again

we can play around

we can move the

the

threshold

or lower bound

the interior of the simplex

we can exercise the decision making ability

of the

of the recognizer

i should have told you the these

the

these lines that the structure of the threshold that is just the consequence of making

the minimum expected cost bayes decision

once you assigned those cost functions

that's what the threshold is gonna look like so

again if Q one is large you gonna be in the region all one choose

class one region or to include choose plus two

and i three if the other to a small we gonna choose plus three

again

we seen that we can move the threshold around now we can integrated

the integral will cover

several slides

which i'm not going to show you

but the same kind of thing applies we just integrate out the

threshold

of at this stage cost function

and lo and behold

we get

the logarithmic function again

the whole recipe can be summarized like this

again the recognizer output of posterior distribution

in other words an element of the posterior for each of the classes

when we know what the true class is

we select that component and we just apply logarithm to it

if the recognizer says the true

the probability of the two classes one that's very good the penalties you error

if it's a is the

probability of the true class is zero that's very bad penalties is infinite

all of the preceding was for just one example

one input one output

if you have a whole database of data which is supervised

you can apply this

two

the whole database and you just average the logarithmic cost

and that is cross entropy which tries to you know very well

that's perhaps the most well-known discriminative training objective not just in speech recognition in

all of machine learning

and it forms the basis for all kinds of other things with other names like

mmi logistic regression

it's perhaps not so well known

that'd is a way of measuring calibration

you see that appearing from time to time for example

this book on a gaussian processes they use but use cross entropy to do essentially

all to measure calibration

and then statistics literature

this thing is referred to as the logarithmic proper scoring rule

you get a whole bunch of other proper scoring rules which

can be derived in a similar way you just need to what that integral but

the

the logarithmic one is very simple and generally just a good idea to use

so let's get back to the likelihoods

this is going to be very short and simple

we start with the recipe for the posterior which are show just now

and now we just flip to the other side of bayes rule

so now we also the recognizer give me likelihood a likely distribution instead of a

posterior distribution

and when evaluating its goodness

we just send them to softmax or by israel if you will

and then apply the logarithm

and then

we also provided with the prior

notice

that we need to now supply prior distribution

as a parameter to this evaluation recipe

you free to choose whatever prior

the prior there's not have to reflect the proportions of

the classes in your data so if you want to emphasise one class rather than

the other

for example better spoke about that this morning

i emphasise some classes

some ready data

you can do that of course

if you have data of one class multiplying that by some

on the other number isn't gonna might

data appeared magically

but

the prior does give you some control over

where you want to emphasise

so that's

let's get back to

the graph that we showed earlier

i've motive like to that

we want the recognizer output likelihoods

that cross entropy forms a nice calibration sensitive

cost function to tell you how well it's doing

now we can also send the output of the recognizer into a simple calibrated so

a calibrated

can be anything

in general it's a good idea to make it very simple

you spend a whole lot of

in the G on building have strong recognizer calibrate there should be

simple and easy to do

but you can gain about out of it

what the stress it does as of explained before

it doesn't trained on based optimize the calibrated to tell you how well could i

have done

if calibration originally had been

bit

and then

you can compare the two

and then

you can

the difference you can call the calibration loss

if you build a recognizer

and the output is

well calibrated in the calibration loss will be small and you can be very happy

otherwise you have to go and

apply some calibrated before you want to apply the recognizer right

thus will be brief how to

well calibrated

the theory

is very basic

it's a

we don't

some

basic recognizer

which

outputs class likelihoods so

then we just

for all the likelihoods into one vector call it like to distribution

and then we say well

we now

we've mentioned that

these likelihoods are not well calibrated that on my goodbyes decisions

so let's put another probabilistic modeling step on top of that

it's not be mapped

the state is of this likelihood vector this just one of the feature or a

score if you want

or already original recognise it might have been an svm the svm doesn't even pretend

to produce calibrated likelihoods

the output is just the score that's fine we can just

use that as the input to the next modelling stage so

you have complete freedom

what you going to use for the next modelling stage

it can be parametric could be non parametric could be more or less bayesian

it can be discriminant of it can be generative

as long as

as long as it works

i've tried and tested

various

calibration strategies

the one i'm showing you

stole my five that it's very simple

you

take the log likelihoods use kind of them with the class independent scale factors and

you shifted with the class dependent

offset

and that gives you a recalibrated

likelihood recalibrated log likelihood

so we train the

coefficients the scale and the of sets we train the discriminatively

and typically using again

cross entropy average logarithmic cost

and because

the cross entropy

optimizes calibration

this is why this recipe optimizes calibration supposed to discriminate the frisbee i've worked with

generative ones as well as i would do

i might just mention that

for example if you're doing automatic

language recognition you might

extract what we call an i-vector

so the i-vector represents the whole

input segments of speech

and then you can just go and do a large multi class logistic regression

and that will outputs likelihoods

that already uses cross entropy as an objective function why would you need to calibrate

the to get

so the problem is

to make the labs logistic regression

well you need to regularize the regularization

we'll typically skew the calibration course now we not

optimising

the

minimum expected cost bayes decisions anymore so regularization is necessary but it's cues calibration so

in practice

it's a good idea to

how about some data

to use for calibration so

part of your data you train your original recognizer

the held out set you use for training your calibrated

so in practice we found in speaker and in language recognition

this kind of recipe

works very well

and then the stress is just another form of logistic regression

just the very much constraint

you can of course

you can multiply this the simple fact it with the full matrix if you one

that would be than unconstrained logistic regression

that also works but you have to be a bit more carefully need enough data

the general this the simple recipes is very safe and very effective usually

i'll just give you one real world example not real will be it's i the

nist evaluation

almost real world

we look at an example of the two thousand and seven nist language mission evaluation

we look at the original accuracy of for one

systems that were competing in this evaluation

and then we look at the improvement after recalibration with the recipe which of just

shown

on the vertical axis is the evaluation criterion which was defined specifically for the language

recognition evaluation it's a little bit too complicated explained here but it's enough to know

this is the calibration sensitive criterion you do better if you're calibration is better

and

libraries but it's a cost function

the blue ones of the original submissions and after being recalibrated

you get you get an improvement in all the system so

i must mention that the recalibration was done on

not on the evaluation data but on some independent calibration data so this is not

to cheating recalibration

with done

time to summarize

the job of

posteriors or likelihoods

is in the end to mike cost effective decisions

if we're gonna user recognizers for anything in the end

it makes decisions it outputs some something heart or it does some action that was

all decisions

that's very cost

tells us how good the or

if we want them to minimize cost that cost tells us how good they all

and

cross entropy is just the representation

of that same cost it's just it's movie over a range of operating points

and calibration can be measured and improved

i've put this presentation that this finally U R L if you want to find

i have some of my

publications on calibration and some kind of is well there are some matlab toolkits

at the next url

and

there's also the url of

about meet you

although

you this

on the screen here

somebody goes and he wonders whether he's

he's got these recognizers well calibrated

please going try this recipe

this can tell you how good your calibration is so that's my take on message

probably have time for some questions

and the questions

how genetically using this is done

in terms of the number of classes

all these techniques

what if you want without some plastic

i honestly don't know

in language recognition

we've

use the weekly lesson thirty languages

of course i think if you have lots of data like you guys have the

intention to work for very many classes but the

i think

if you don't have enough data per class be in trouble

the next talk i in the language id rear we typically focus a lot on

found data

and it's quite often data crosser languages you'll have a mismatch in the amount of

data

so differences in training

i think there's been a lot of discussion yesterday and low resource languages i'm expecting

that you probably also see varying amounts of build a there could you comment i'm

how some of the folks in the language id area might or try to address

varying amounts of data for improving language id

right in and in this slide what are so there is a before the likelihoods

i had the prior which you can choose so you can use that prior to

essentially white

the

the data so that you could be white the classes which are not well represented

you can rewind them so that to the cross entropy it looks as if there's

more than

more of the cluster that really is so

of course that doesn't magically make the data more

if you

in the and you just

what

cross entropy just really measures error rate

as i showed you cross entropy is constructed with that

a step function so it's better setting it's counting errors cross entropy is counting headers

if there are very few errors and it's gonna have a bad estimate of the

error rate

by multiplying

that error rate

which has

which is inaccurate with a large number you gonna multiply that in accuracy so

you should use those kind of rewriting with K

well here at asr you actually

the life is a bit more difficult than in speaker I to your language id

because they're basically you need to prior

to produce one decision profile or per utterance

so no to people would play with actually also segmenting the output by recognizing the

chunks muting posteriors generating lattices and this kind of stuff

imagine that there is a asr but student coming to you

asking you what you find from with what you

what all these folks are doing what would be the first thing that you would

take from we are like in calibration perspective what would you advise

i would go a need to go and study

speech recognition more carefully

before

before i would be able to answer that question

i thought the more is more obvious application would be an score normalisation for keywords

but i mean people are using discriminant techniques for taking sort of estimate the probability

of error and weighting them in using and normalized scores any thought about

why this we applied to that application

but what about the additional complications

so this

term white to the cost function

has those nasty little thing that keeps on jumping around

in what i showed you here

we assume in the cost function is not

you can make minimum expected cost bayes decisions if you know what the cost is

in the term white the thing

the cost depends on how many times the keyword is in the test data

and

i don't know that so that complicates matters considerably

you would still very well by going to calibrate your the output of your

recognizer

but

once you've got that likelihood

what are you gonna do then

to produce your final output that you're going to send two

to the evaluator

that gets complicated

and there's all kinds of normalizations and things involved

and distances applications when asking a question of like a real world example

and so i noticed that you are a rating with the rest so why not

check and you decide matching less complicated going to the details

that's but i think in a lot of real world applications where i am not

upon that check so that maybe use a context i something else by something else

this is a way have a problem is in the case where you want you

why consistency be good with respect to not checks

right this question makes a semantic is it's interesting handed

and in both cases yes

if you use this logarithmic cost function

then

it tends to my the output of your recognizer

good over a very wide range of operating points

so especially if you just have two classes that you want to recognise

you can

i i'll show that axes of the posterior between gender one

but if you can't i am columns of the posterior then that access becomes infinite

then you can move that

that threshold

all the web from minus infinity to plus infinity so if you move it too

far

you gonna

going to regions where there's no more data more errors in the doesn't make any

sense anymore

there's a limited range

on that axis

where you can do useful stuff

and the logarithmic cost function typically

evaluates mice and widely over of with that useful range

if for example

you would a instead of taking the logarithm

you take

be square to one minus be squared

square loss sometimes called really lost

you get and natalie coverage

that doesn't that doesn't cover applications as widely as a as this case that's

you can go even wonder if you want

then you get a kind of an exponential loss function which is associated with boosting

in machine learning

i have a have a my interspeech paper

of this

explores that kind of thing in detail what if you have other cost functions not

just

cross entropy

so you should find a link to that the

we page

so the answer is basically it's a very good idea to use cross entropy

if you if you optimise your recognizer

to have good cross entropy it's generally going to work for whatever you want to

use it for

right based on speaker

Calibration of binary and multiclass probabilistic classifiers in automatic speaker and language recognition

Applications Day

Niko Brummer (Agnitio)