so good morning sounds very much for inviting me
as better mention i'm not in mainstream speech recognition
so
i hope what i chose to talk about what will be interesting to you
so
before i go on we just
okay just
just about a medium i probably
okay do is a startup that's been around since about two thousand and four
i need to is latin for i recognise so i meet you specialises in automatic
speaker recognition just that
and it sells
a range of products
that make use of this technology in many different countries in the world has its
main office in madrid in spain
and also offices close to washington and california
and we have a small research lab in south africa so that's where i'm based
so just to make sure we'd on the same page
that we know what we're talking about
everybody knows
what
speech recognition is about
speaker recognition is who's
i'll from funded very difficult to explain to people after i've explain for K two
minutes that will still understand speech recognition
and then of course
there's automatic language recognition or called spoken language recognition
just to tell
given a speech segment
which language was this
so
in speaker and also language recognition
we've inherited some stuff from speech recognition
mostly just the acoustic modeling so
the features mfccs and gmms
we do but slide back with neural networks we haven't we've tried but that i
don't work as well as the gmms to
so
we take the acoustic modeling and then we do some relatively simple back and recognition
it's very simple compared to your language modeling and or decoders
so
this talk is going to be deep rather than why i'm going to concentrate just
on the back and recognition part and just on a tiny aspect of that
ninety calibration
and i hope maybe
you guys find something in the store useful that you can maybe use
so
what is calibration
it concerns the goodness of soft decisions so you have a recognizer that i put
some kind of a soft decision a classifier if you want
and
then it can be understood in two senses
first of all calibration is just how could
is the output of my recognizer
or
it's whatever you do to make it bit so if you make your the output
of your recognizer better you calibrating it
so will talk about but
so
i'm not and there's expecting everybody to understand
this diagram this is just
a road map of what we're going to talk about i'll come back to this
diagram
so
i'm going to motivate that if you want your recognizer output
a soft decision
likelihoods rather than posteriors
is what you want and
how to evaluate the goodness
oftentimes outputs final cross entropy
so the cross entropy gives you a calibration sensitive loss function to measure the goodness
of the output of the recognizer
then we can take the wrong put and we can somehow calibrated so
i'll talk about some simple calibrate there's
and then
you can have this kind of a feedback loop
to essentially optimize why the effect of the calibrated and then
that gives you in a since calibration insensitive less
which tells you
how well could i have done if my calibrate it's my system had been optimally
calibrated
and then you can compare the two
and that will tell you
a good was my calibration
so it's not at the beginning
so
the canonical speaker recognition problem
we usually view that as a two class classification problem so the input is a
pair of speech segments often just call the enrollment segment and the test segment
and then
the output one is its class one that segments have the same speaker tall close
to the different
so as an example of a multiclass classifier
we take language recognition
so there we can define a number of language classes
and
if the french and the audience are wondering why they're not there but all their
neighbors that's some other language
so let's look at the
i'll put
the form of the output of
a classifier recognizer so that you just form would just to put
a heart last okay
if you want to soft output that might be
posterior distribution
or we can go to the other side of bayes rule and output like your
distribution i'm going to motivate the last one is preferable
how decisions
there's some people
but it's a bad idea unless error rate is really low
and it cannot make use of context that cannot make use of independent prior information
posting idea is to end users stole intuitive to understand what the posteriors telling them
and it conveys confidence
so
you can you can
recover from an error because you see it coming you know you can make errors
you could make optimal
minimum expected cost bayes decisions if you have a posterior so that's a much more
useful the output
the problem with the posting idea is
the prior is implicit and hardcoded inside the posterior you can remember by dividing it
up but then you also need to know what was the posting
so
a clean a type of output
it's just the likelihood
and then you can afterwards supply any but i in any prior
the only downside is it somewhat harder to understand especially for end users but we
might end users so
bits let's go with the likelihood
in the end
for this implications that we're looking at there really isn't that much difference between the
do if you have the posterior and the prior in the back to likelihood
or you got the other way
if
there's a small number of discreet classes you can always normalized likelihood and you've but
posterior
so
that does look at some examples to might affect
we use the likelihood
so
and language recognizer would output the like to distribution across the number of languages
but then
it's a we know within ornaments today
we more likely to here
check being spoken on the street very unlikely to hear my home language afrikaans
and
you combine these two sources of information
via bayes rule and the and the
the posterior then gives you the complete picture
maybe you could have a phone recognizer the same sort of rescue applies
output a like you distribution
and
the prior is the context in which you try to recognise that phone
and then the decoder combines everything
essentially forms a formal a kind of posterior
let's go to speech recognition
this is and i realised
you
of what a for intra-speaker recognizer might do
so
there was someone who
was careless enough to get himself recorded what he was committing a crime
people is about all of this speech sample
there is the suspect
but also list the suspect nicely to provide another speech sample then you want to
compare the two
is this the same person or not
and because that are just two classes you can conveniently for the likelihood ratio between
those two possibilities
and then if you have a very nice
bayesian courtroom inside the core
by would
this
the total effect of all the other evidence
as a kind of the prior
and then
if you have a very clever jobs and the jewelry i might act like bayes
rule and i can combine these two sources of evidence so
that's probably never really going to happen
but
still for any this is the useful babble
to think what should my i would look like
what should i be thinking about if i want this likelihood ratio that i put
to be
to do its job as well as possible
so
this is an objective field but in practice recognizers are often badly calibrated in my
experience if you bowl the speaker or a language recognizer it's always badly calibrated
you can redesigned the thing and do what you want it's going to be badly
calibrated it might be very accurate
but
install badly calibrated so
you need to
adjust its output
to get the full benefit of the output of the recognizer so
the tools we need
but michael this happened first of all you need to measure the quality of the
calibration
and then
you need some to adjust it
so
first let's talk about the measurement
so
calibration applies to both posteriors and likelihoods
it's easier to explain this whole thing in terms of posteriors and then later we'll
go back to the likelihoods
so are use
two classes
as a running example again because it's easier to
explain and then like that we go we'll go to the multiclass case so
it is a recogniser we represented by the symbol are so all the posteriors of
conditioned on because it's output by the by the recognizer
so
the posterior tells you do things
first is which clusters at five
if the one element is greater than other one it wants to be recognising class
one in this case
but then it also tells us that degree of confidence how much more is the
one element greater than the other one
so we can form that's right so we can take the bible the right so
you could look at the entropy of the distribution
any anything like that will give you a measurement of the of the degree of
confidence
so
the question i'm trying to answer what this presentation is
the recognizer outputs a posterior distribution
we also known for this particular case which of the two classes was really true
was this a good but still not
another example would be
a weather predict there's id percent of the chance of rain tomorrow
the model rows
it doesn't right
how good was that
i would was that the prediction
so first of all
if it's is the one is greater than other one
and that was in the right direction we know it favours the correct class so
at least that aspect of the
posterior once could
i'll do we judge this degree of confidence
what can we do
we don't have a reference posterior would not given that in practice
we just given the true class what can we say about the posterior
so
we'll the sign some penalty function
so
but this graph is telling us
the
recognizer output the posterior distribution
posterior for each of the two classes
and we know the true class so then on the bottom axis we plot
that
we plot that posterior distribution
for a single case of just be a single point on the x-axis
and then
if the posterior for the true class was one that's good
it was very certain of the thing that really happened
but if it's a something that had really happened
is it possible according to the recognizer
that's thirty back so we give it a high penalty maybe even an infinite penalty
and internet penalty
it's might be a good idea
in practice
if you make a wrong decision it can have arbitrarily bad consequences
if
you like playing russian roulette
you've got the gun in your hand
perform some posterior to know hidden that's is there's no time in the time but
at the moment
and you put it should lead
the consequences of a bad posterior
can be arbitrarily but so i liked this idea but the penalty going up to
infinity
so
i've brought that to candidate functions here
it's easy to see
this should be monotonic function
what should the shape be what principle should we used to design this penalty function
so
we'll take an engineering upright will say what do we want to use that the
output for and how do how well there's a do that
so
what can we do it but the posterior
make minimum expected cost bayes decisions
so
in as a speech recognizer it might be sending the find posteriors into the decoder
but in the end it still gonna make some decision at some stage it's gone
output a transcription so in the end you're always making decisions
so
and then we just ask
how well
does it make these decisions and then
that very same cost that you optimising
but the minimum expected cost bayes decision is gonna tell you how well you did
so
good posteriors can be used to make cost effective decisions
but badly calibrated posteriors
maybe on the confident or overconfident in the wrong hypothesis and that will eventually lead
to a series of unnecessarily costly
errors
so let's look at decision cost function
so
the decision cost functions model the consequences of applying recognition technology in the real world
real world it's always more complex
engineers like
simple models but we can optimize
so
this should be very from idea
we
we first look at the case of a heart decision
so
the recognizer says its class one out loss to
and we know that the true class last one of the last two and then
we assign some cost coefficient
so
in
thus example
i might the cost coefficients when there and i was for the correct decision
there's yellow
and for errors there's a non-zero cost
so you might
want to work in terms of rewards
or you can even have a mixture of rewards and penalties this celebrate the
what's called the term weighted value of the keyword spotting
that's a mixture of a of a of every word and the penalty so
in the and all those are equivalent you can you can play around these cost
functions
and
for what we're gonna do
it's
it's container
two
not using what i'm just to
put the cost on the errors
so
now we apply it to a soft decision
so we let the recognizer output the posterior distribution
and then
when we evaluate its goodness
we make and minimum expected cost bayes decision so the base decision is made without
having the two class
and then
we treat that as a heart decision and evaluated with this cost matrix as before
so
what we have now
is the goodness of the posterior
that we've output
very simple thing
it she
what we've achieved so
a couple of slides ago
i try to convince you that
this kind of penalty function on the left is what we want
what we've achieved
is this step function
so
there's a threshold on the posterior
which is a function of the cost coefficients
but the cost is either some non-zero cost or also your
so at least the step function has the right
sense
it's bigger where it needs to be an small that where it needs to be
but
it's very crude and in effect it's only evaluating the goodness of your posterior up
with a single point
it doesn't say anything about making decisions at any of the operating point
so
we need to find a smoother solution
so
in order to smooth let's simplified just a little but
so
the bayes decision threshold
this simple ratio of the costs
so we might think of it in terms of give the costs
compute the threshold
but that's to at the other way round
so let's say
that's choose the threshold at which we but which we're going to evaluate with free
to choose any facial
and then
we might
cost the a function of the threshold and if you choose these simple reciprocal functions
we're still
applying the above equation
so
the above equation is able to
so let's look at let's look at this graphically
so
what we've achieved
the recognizer outputs Q one of the posteriors for class one you do is just
just flip access you'll get you do
so
the penalty when class one is true would be the red curve and the penalty
when close to is true
would be the look of and
the cost coefficients are a function of the threshold which we can
adjust at what so let's do that
we can move the threshold and with it
the cost coefficients
with which are gonna be penalised well
well
well also change if you press
the threshold right against you had or one penalty will be infinite but
that's good because then you want to yourself in the hit
so
by moving the threshold while we're evaluating the goodness of the posterior we in fact
exercising the decision making ability of the a posteriori over its full range
so we're almost done
that's just look at another view
this is the same thing
just another view we have the recognizer output the posterior
the posterior is compared against
the threshold
the threshold is a parameter chosen by the evaluators
and then it
you also need to know the true cost and outputs the cost so
note
is a function of three variables
the recognizer output the true value and this parameter feature
so now let's integrate out see
so
the integral and here is the state be cost function which are plotted a few
slides about
on the left hand side we get
a cost function which is not independent of the threshold because we've integrated
about the full range of the threshold
and
that turns out to be just this logarithmic cost function
so
you
bike the than algorithm of
the posterior have for the two class
and
that
is the goodness of the posterior of the recognizer so
there's napkins
this nice smooth shape
which we were looking for
so
that's two classes
now we're going to generalise to multiclass
so multi class
is a lot trickier
but the sign general principles apply
so
we still gonna work with minimum expected cost bayes decisions
but in this case will use of generalize threshold which all plot for you the
next slide
and we again we're going to integrate out the threshold and get the similar results
so
and the scroll we show the
output of i three class
recognizer
i chose three classes because i can plotted here on this nice flat screen
so Q one
is the posterior for class one
the vertical axis Q to the posterior for class to
and Q three we don't see but it's just the complement of the others to
so everything needs to live inside the simplex
then the
the tricky part
is
we now define a kind of a generalized facial so this threshold
has three components people want to and three
and we constrain them to sum to one so this threshold
is defined
by this point where the lines meet
and that also loves inside the same someplace
and now again
we've chosen the threshold
then we choose the cost function so the cost function again
is this little equation at the bottom again as just the reciprocal of the threshold
coefficients
and again
we can play around
we can move the
the
threshold
or lower bound
the interior of the simplex
we can exercise the decision making ability
of the
of the recognizer
i should have told you the these
the
these lines that the structure of the threshold that is just the consequence of making
the minimum expected cost bayes decision
so
once you assigned those cost functions
that's what the threshold is gonna look like so
again if Q one is large you gonna be in the region all one choose
class one region or to include choose plus two
and i three if the other to a small we gonna choose plus three
so
again
we seen that we can move the threshold around now we can integrated
so
the integral will cover
several slides
which i'm not going to show you
but the same kind of thing applies we just integrate out the
threshold
of at this stage cost function
and lo and behold
we get
the logarithmic function again
so
the whole recipe can be summarized like this
again the recognizer output of posterior distribution
in other words an element of the posterior for each of the classes
when we know what the true class is
we select that component and we just apply logarithm to it
so
if the recognizer says the true
the probability of the two classes one that's very good the penalties you error
if it's a is the
probability of the true class is zero that's very bad penalties is infinite
so
all of the preceding was for just one example
one input one output
if you have a whole database of data which is supervised
you can apply this
two
the whole database and you just average the logarithmic cost
and that is cross entropy which tries to you know very well
so
that's perhaps the most well-known discriminative training objective not just in speech recognition in
all of machine learning
and it forms the basis for all kinds of other things with other names like
mmi logistic regression
it's perhaps not so well known
that'd is a way of measuring calibration
you see that appearing from time to time for example
this book on a gaussian processes they use but use cross entropy to do essentially
all to measure calibration
and then statistics literature
this thing is referred to as the logarithmic proper scoring rule
you get a whole bunch of other proper scoring rules which
can be derived in a similar way you just need to what that integral but
the
the logarithmic one is very simple and generally just a good idea to use
so let's get back to the likelihoods
this is going to be very short and simple
we start with the recipe for the posterior which are show just now
and now we just flip to the other side of bayes rule
so now we also the recognizer give me likelihood a likely distribution instead of a
posterior distribution
and when evaluating its goodness
we just send them to softmax or by israel if you will
and then apply the logarithm
and then
we also provided with the prior
so
notice
that we need to now supply prior distribution
as a parameter to this evaluation recipe
so
you free to choose whatever prior
the prior there's not have to reflect the proportions of
the classes in your data so if you want to emphasise one class rather than
the other
for example better spoke about that this morning
i emphasise some classes
some ready data
you can do that of course
if you have data of one class multiplying that by some
on the other number isn't gonna might
data appeared magically
but
the prior does give you some control over
of
where you want to emphasise
so that's
let's get back to
the graph that we showed earlier
so
i've motive like to that
we want the recognizer output likelihoods
that cross entropy forms a nice calibration sensitive
cost function to tell you how well it's doing
now we can also send the output of the recognizer into a simple calibrated so
a calibrated
can be anything
in general it's a good idea to make it very simple
you spend a whole lot of
in the G on building have strong recognizer calibrate there should be
simple and easy to do
but you can gain about out of it
so
what the stress it does as of explained before
it doesn't trained on based optimize the calibrated to tell you how well could i
have done
if calibration originally had been
bit
and then
you can compare the two
and then
you can
the difference you can call the calibration loss
if you build a recognizer
and the output is
well calibrated in the calibration loss will be small and you can be very happy
otherwise you have to go and
apply some calibrated before you want to apply the recognizer right
so
thus will be brief how to
well calibrated
so
the theory
is very basic
it's a
we don't
some
basic recognizer
which
outputs class likelihoods so
then we just
for all the likelihoods into one vector call it like to distribution
and then we say well
we now
we've mentioned that
these likelihoods are not well calibrated that on my goodbyes decisions
so let's put another probabilistic modeling step on top of that
it's not be mapped
the state is of this likelihood vector this just one of the feature or a
score if you want
or already original recognise it might have been an svm the svm doesn't even pretend
to produce calibrated likelihoods
the output is just the score that's fine we can just
use that as the input to the next modelling stage so
you have complete freedom
of
what you going to use for the next modelling stage
it can be parametric could be non parametric could be more or less bayesian
it can be discriminant of it can be generative
as long as
as long as it works
so
i've tried and tested
various
calibration strategies
the one i'm showing you
stole my five that it's very simple
so
you
take the log likelihoods use kind of them with the class independent scale factors and
you shifted with the class dependent
offset
and that gives you a recalibrated
likelihood recalibrated log likelihood
so we train the
coefficients the scale and the of sets we train the discriminatively
and typically using again
cross entropy average logarithmic cost
and because
the cross entropy
optimizes calibration
this is why this recipe optimizes calibration supposed to discriminate the frisbee i've worked with
generative ones as well as i would do
so
i might just mention that
for example if you're doing automatic
language recognition you might
extract what we call an i-vector
so the i-vector represents the whole
input segments of speech
and then you can just go and do a large multi class logistic regression
and that will outputs likelihoods
so
that already uses cross entropy as an objective function why would you need to calibrate
the to get
so the problem is
to make the labs logistic regression
well you need to regularize the regularization
we'll typically skew the calibration course now we not
optimising
the
minimum expected cost bayes decisions anymore so regularization is necessary but it's cues calibration so
in practice
it's a good idea to
how about some data
to use for calibration so
part of your data you train your original recognizer
the held out set you use for training your calibrated
so in practice we found in speaker and in language recognition
this kind of recipe
works very well
and then the stress is just another form of logistic regression
just the very much constraint
you can of course
you can multiply this the simple fact it with the full matrix if you one
that would be than unconstrained logistic regression
that also works but you have to be a bit more carefully need enough data
the general this the simple recipes is very safe and very effective usually
so
i'll just give you one real world example not real will be it's i the
nist evaluation
almost real world
so
we look at an example of the two thousand and seven nist language mission evaluation
we look at the original accuracy of for one
systems that were competing in this evaluation
and then we look at the improvement after recalibration with the recipe which of just
shown
so
on the vertical axis is the evaluation criterion which was defined specifically for the language
recognition evaluation it's a little bit too complicated explained here but it's enough to know
this is the calibration sensitive criterion you do better if you're calibration is better
and
libraries but it's a cost function
so
the blue ones of the original submissions and after being recalibrated
you get you get an improvement in all the system so
i must mention that the recalibration was done on
not on the evaluation data but on some independent calibration data so this is not
to cheating recalibration
so
with done
time to summarize
so
the job of
posteriors or likelihoods
is in the end to mike cost effective decisions
if we're gonna user recognizers for anything in the end
it makes decisions it outputs some something heart or it does some action that was
all decisions
so
that's very cost
tells us how good the or
if we want them to minimize cost that cost tells us how good they all
and
cross entropy is just the representation
of that same cost it's just it's movie over a range of operating points
and calibration can be measured and improved
i've put this presentation that this finally U R L if you want to find
it
i have some of my
publications on calibration and some kind of is well there are some matlab toolkits
at the next url
and
there's also the url of
about meet you
so
although
you this
on the screen here
somebody goes and he wonders whether he's
he's got these recognizers well calibrated
please going try this recipe
this can tell you how good your calibration is so that's my take on message
probably have time for some questions
and the questions
how genetically using this is done
in terms of the number of classes
all these techniques
what if you want without some plastic
i honestly don't know
in language recognition
we've
use the weekly lesson thirty languages
so
of course i think if you have lots of data like you guys have the
intention to work for very many classes but the
i think
if you don't have enough data per class be in trouble
the next talk i in the language id rear we typically focus a lot on
found data
and it's quite often data crosser languages you'll have a mismatch in the amount of
data
so differences in training
i think there's been a lot of discussion yesterday and low resource languages i'm expecting
that you probably also see varying amounts of build a there could you comment i'm
how some of the folks in the language id area might or try to address
varying amounts of data for improving language id
right in and in this slide what are so there is a before the likelihoods
i had the prior which you can choose so you can use that prior to
essentially white
the
the data so that you could be white the classes which are not well represented
you can rewind them so that to the cross entropy it looks as if there's
more than
more of the cluster that really is so
of course that doesn't magically make the data more
so
if you
in the and you just
what
cross entropy just really measures error rate
as i showed you cross entropy is constructed with that
a step function so it's better setting it's counting errors cross entropy is counting headers
so
if there are very few errors and it's gonna have a bad estimate of the
error rate
so
by multiplying
that error rate
which has
which is inaccurate with a large number you gonna multiply that in accuracy so
you should use those kind of rewriting with K
well here at asr you actually
the life is a bit more difficult than in speaker I to your language id
because they're basically you need to prior
to produce one decision profile or per utterance
so no to people would play with actually also segmenting the output by recognizing the
chunks muting posteriors generating lattices and this kind of stuff
imagine that there is a asr but student coming to you
asking you what you find from with what you
what all these folks are doing what would be the first thing that you would
take from we are like in calibration perspective what would you advise
i would go a need to go and study
speech recognition more carefully
before
before i would be able to answer that question
i thought the more is more obvious application would be an score normalisation for keywords
but i mean people are using discriminant techniques for taking sort of estimate the probability
of error and weighting them in using and normalized scores any thought about
why this we applied to that application
but what about the additional complications
so this
term white to the cost function
has those nasty little thing that keeps on jumping around
in
in what i showed you here
we assume in the cost function is not
so
you can make minimum expected cost bayes decisions if you know what the cost is
in the term white the thing
the cost depends on how many times the keyword is in the test data
and
i don't know that so that complicates matters considerably
you would still very well by going to calibrate your the output of your
recognizer
but
once you've got that likelihood
what are you gonna do then
to produce your final output that you're going to send two
to the evaluator
that gets complicated
and there's all kinds of normalizations and things involved
and distances applications when asking a question of like a real world example
and so i noticed that you are a rating with the rest so why not
check and you decide matching less complicated going to the details
that's but i think in a lot of real world applications where i am not
upon that check so that maybe use a context i something else by something else
this is a way have a problem is in the case where you want you
why consistency be good with respect to not checks
right this question makes a semantic is it's interesting handed
and in both cases yes
so
if you use this logarithmic cost function
then
it tends to my the output of your recognizer
good over a very wide range of operating points
so especially if you just have two classes that you want to recognise
you can
you can
i i'll show that axes of the posterior between gender one
but if you can't i am columns of the posterior then that access becomes infinite
so
then you can move that
that threshold
all the web from minus infinity to plus infinity so if you move it too
far
you gonna
going to regions where there's no more data more errors in the doesn't make any
sense anymore
there's a limited range
on that axis
where you can do useful stuff
and the logarithmic cost function typically
evaluates mice and widely over of with that useful range
so
if for example
you would a instead of taking the logarithm
you take
be square to one minus be squared
square loss sometimes called really lost
you get and natalie coverage
so
that doesn't that doesn't cover applications as widely as a as this case that's
you can go even wonder if you want
then you get a kind of an exponential loss function which is associated with boosting
in machine learning
i have a have a my interspeech paper
of this
explores that kind of thing in detail what if you have other cost functions not
just
cross entropy
so you should find a link to that the
we page
so the answer is basically it's a very good idea to use cross entropy
if you if you optimise your recognizer
to have good cross entropy it's generally going to work for whatever you want to
use it for
right based on speaker