0:00:15so good morning sounds very much for inviting me
0:00:19as better mention i'm not in mainstream speech recognition
0:00:24so
0:00:25i hope what i chose to talk about what will be interesting to you
0:00:29so
0:00:30before i go on we just
0:00:32okay just
0:00:36just about a medium i probably
0:00:39okay do is a startup that's been around since about two thousand and four
0:00:44i need to is latin for i recognise so i meet you specialises in automatic
0:00:50speaker recognition just that
0:00:52and it sells
0:00:54a range of products
0:00:56that make use of this technology in many different countries in the world has its
0:01:01main office in madrid in spain
0:01:03and also offices close to washington and california
0:01:07and we have a small research lab in south africa so that's where i'm based
0:01:15so just to make sure we'd on the same page
0:01:18that we know what we're talking about
0:01:21everybody knows
0:01:22what
0:01:23speech recognition is about
0:01:25speaker recognition is who's
0:01:29i'll from funded very difficult to explain to people after i've explain for K two
0:01:34minutes that will still understand speech recognition
0:01:38and then of course
0:01:39there's automatic language recognition or called spoken language recognition
0:01:43just to tell
0:01:44given a speech segment
0:01:46which language was this
0:01:49so
0:01:51in speaker and also language recognition
0:01:54we've inherited some stuff from speech recognition
0:01:58mostly just the acoustic modeling so
0:02:02the features mfccs and gmms
0:02:05we do but slide back with neural networks we haven't we've tried but that i
0:02:10don't work as well as the gmms to
0:02:15so
0:02:16we take the acoustic modeling and then we do some relatively simple back and recognition
0:02:22it's very simple compared to your language modeling and or decoders
0:02:27so
0:02:28this talk is going to be deep rather than why i'm going to concentrate just
0:02:33on the back and recognition part and just on a tiny aspect of that
0:02:39ninety calibration
0:02:43and i hope maybe
0:02:46you guys find something in the store useful that you can maybe use
0:02:50so
0:02:52what is calibration
0:02:53it concerns the goodness of soft decisions so you have a recognizer that i put
0:02:58some kind of a soft decision a classifier if you want
0:03:03and
0:03:03then it can be understood in two senses
0:03:07first of all calibration is just how could
0:03:10is the output of my recognizer
0:03:12or
0:03:13it's whatever you do to make it bit so if you make your the output
0:03:17of your recognizer better you calibrating it
0:03:20so will talk about but
0:03:22so
0:03:24i'm not and there's expecting everybody to understand
0:03:28this diagram this is just
0:03:30a road map of what we're going to talk about i'll come back to this
0:03:34diagram
0:03:35so
0:03:38i'm going to motivate that if you want your recognizer output
0:03:42a soft decision
0:03:45likelihoods rather than posteriors
0:03:47is what you want and
0:03:50how to evaluate the goodness
0:03:52oftentimes outputs final cross entropy
0:03:55so the cross entropy gives you a calibration sensitive loss function to measure the goodness
0:04:00of the output of the recognizer
0:04:02then we can take the wrong put and we can somehow calibrated so
0:04:07i'll talk about some simple calibrate there's
0:04:10and then
0:04:11you can have this kind of a feedback loop
0:04:13to essentially optimize why the effect of the calibrated and then
0:04:19that gives you in a since calibration insensitive less
0:04:23which tells you
0:04:24how well could i have done if my calibrate it's my system had been optimally
0:04:29calibrated
0:04:30and then you can compare the two
0:04:32and that will tell you
0:04:34a good was my calibration
0:04:38so it's not at the beginning
0:04:40so
0:04:41the canonical speaker recognition problem
0:04:46we usually view that as a two class classification problem so the input is a
0:04:51pair of speech segments often just call the enrollment segment and the test segment
0:04:56and then
0:04:57the output one is its class one that segments have the same speaker tall close
0:05:02to the different
0:05:04so as an example of a multiclass classifier
0:05:09we take language recognition
0:05:12so there we can define a number of language classes
0:05:16and
0:05:18if the french and the audience are wondering why they're not there but all their
0:05:22neighbors that's some other language
0:05:27so let's look at the
0:05:31i'll put
0:05:32the form of the output of
0:05:35a classifier recognizer so that you just form would just to put
0:05:40a heart last okay
0:05:43if you want to soft output that might be
0:05:46posterior distribution
0:05:47or we can go to the other side of bayes rule and output like your
0:05:51distribution i'm going to motivate the last one is preferable
0:05:58how decisions
0:05:59there's some people
0:06:00but it's a bad idea unless error rate is really low
0:06:05and it cannot make use of context that cannot make use of independent prior information
0:06:10posting idea is to end users stole intuitive to understand what the posteriors telling them
0:06:17and it conveys confidence
0:06:20so
0:06:20you can you can
0:06:24recover from an error because you see it coming you know you can make errors
0:06:27you could make optimal
0:06:29minimum expected cost bayes decisions if you have a posterior so that's a much more
0:06:33useful the output
0:06:35the problem with the posting idea is
0:06:39the prior is implicit and hardcoded inside the posterior you can remember by dividing it
0:06:45up but then you also need to know what was the posting
0:06:48so
0:06:49a clean a type of output
0:06:51it's just the likelihood
0:06:52and then you can afterwards supply any but i in any prior
0:06:59the only downside is it somewhat harder to understand especially for end users but we
0:07:04might end users so
0:07:05bits let's go with the likelihood
0:07:10in the end
0:07:12for this implications that we're looking at there really isn't that much difference between the
0:07:16do if you have the posterior and the prior in the back to likelihood
0:07:20or you got the other way
0:07:23if
0:07:24there's a small number of discreet classes you can always normalized likelihood and you've but
0:07:27posterior
0:07:30so
0:07:31that does look at some examples to might affect
0:07:35we use the likelihood
0:07:37so
0:07:38and language recognizer would output the like to distribution across the number of languages
0:07:44but then
0:07:44it's a we know within ornaments today
0:07:48we more likely to here
0:07:51check being spoken on the street very unlikely to hear my home language afrikaans
0:07:57and
0:07:57you combine these two sources of information
0:08:00via bayes rule and the and the
0:08:03the posterior then gives you the complete picture
0:08:09maybe you could have a phone recognizer the same sort of rescue applies
0:08:15output a like you distribution
0:08:17and
0:08:18the prior is the context in which you try to recognise that phone
0:08:22and then the decoder combines everything
0:08:24essentially forms a formal a kind of posterior
0:08:28let's go to speech recognition
0:08:31this is and i realised
0:08:34you
0:08:35of what a for intra-speaker recognizer might do
0:08:38so
0:08:41there was someone who
0:08:43was careless enough to get himself recorded what he was committing a crime
0:08:47people is about all of this speech sample
0:08:51there is the suspect
0:08:53but also list the suspect nicely to provide another speech sample then you want to
0:08:58compare the two
0:08:59is this the same person or not
0:09:02and because that are just two classes you can conveniently for the likelihood ratio between
0:09:06those two possibilities
0:09:09and then if you have a very nice
0:09:12bayesian courtroom inside the core
0:09:14by would
0:09:16this
0:09:17the total effect of all the other evidence
0:09:20as a kind of the prior
0:09:22and then
0:09:24if you have a very clever jobs and the jewelry i might act like bayes
0:09:28rule and i can combine these two sources of evidence so
0:09:32that's probably never really going to happen
0:09:35but
0:09:35still for any this is the useful babble
0:09:39to think what should my i would look like
0:09:42what should i be thinking about if i want this likelihood ratio that i put
0:09:47to be
0:09:49to do its job as well as possible
0:09:54so
0:09:56this is an objective field but in practice recognizers are often badly calibrated in my
0:10:01experience if you bowl the speaker or a language recognizer it's always badly calibrated
0:10:07you can redesigned the thing and do what you want it's going to be badly
0:10:11calibrated it might be very accurate
0:10:14but
0:10:15install badly calibrated so
0:10:18you need to
0:10:20adjust its output
0:10:21to get the full benefit of the output of the recognizer so
0:10:26the tools we need
0:10:28but michael this happened first of all you need to measure the quality of the
0:10:31calibration
0:10:32and then
0:10:33you need some to adjust it
0:10:37so
0:10:38first let's talk about the measurement
0:10:42so
0:10:43calibration applies to both posteriors and likelihoods
0:10:46it's easier to explain this whole thing in terms of posteriors and then later we'll
0:10:50go back to the likelihoods
0:10:53so are use
0:10:56two classes
0:10:57as a running example again because it's easier to
0:11:02explain and then like that we go we'll go to the multiclass case so
0:11:07it is a recogniser we represented by the symbol are so all the posteriors of
0:11:12conditioned on because it's output by the by the recognizer
0:11:16so
0:11:17the posterior tells you do things
0:11:20first is which clusters at five
0:11:23if the one element is greater than other one it wants to be recognising class
0:11:27one in this case
0:11:29but then it also tells us that degree of confidence how much more is the
0:11:33one element greater than the other one
0:11:35so we can form that's right so we can take the bible the right so
0:11:38you could look at the entropy of the distribution
0:11:42any anything like that will give you a measurement of the of the degree of
0:11:46confidence
0:11:49so
0:11:50the question i'm trying to answer what this presentation is
0:11:54the recognizer outputs a posterior distribution
0:11:58we also known for this particular case which of the two classes was really true
0:12:05was this a good but still not
0:12:08another example would be
0:12:10a weather predict there's id percent of the chance of rain tomorrow
0:12:15the model rows
0:12:17it doesn't right
0:12:19how good was that
0:12:21i would was that the prediction
0:12:24so first of all
0:12:26if it's is the one is greater than other one
0:12:29and that was in the right direction we know it favours the correct class so
0:12:34at least that aspect of the
0:12:36posterior once could
0:12:38i'll do we judge this degree of confidence
0:12:42what can we do
0:12:44we don't have a reference posterior would not given that in practice
0:12:48we just given the true class what can we say about the posterior
0:12:54so
0:12:56we'll the sign some penalty function
0:12:59so
0:13:01but this graph is telling us
0:13:02the
0:13:04recognizer output the posterior distribution
0:13:07posterior for each of the two classes
0:13:10and we know the true class so then on the bottom axis we plot
0:13:15that
0:13:15we plot that posterior distribution
0:13:19for a single case of just be a single point on the x-axis
0:13:23and then
0:13:24if the posterior for the true class was one that's good
0:13:27it was very certain of the thing that really happened
0:13:31but if it's a something that had really happened
0:13:35is it possible according to the recognizer
0:13:39that's thirty back so we give it a high penalty maybe even an infinite penalty
0:13:45and internet penalty
0:13:47it's might be a good idea
0:13:49in practice
0:13:50if you make a wrong decision it can have arbitrarily bad consequences
0:13:55if
0:13:56you like playing russian roulette
0:13:59you've got the gun in your hand
0:14:01perform some posterior to know hidden that's is there's no time in the time but
0:14:05at the moment
0:14:07and you put it should lead
0:14:09the consequences of a bad posterior
0:14:14can be arbitrarily but so i liked this idea but the penalty going up to
0:14:19infinity
0:14:21so
0:14:22i've brought that to candidate functions here
0:14:28it's easy to see
0:14:29this should be monotonic function
0:14:32what should the shape be what principle should we used to design this penalty function
0:14:38so
0:14:40we'll take an engineering upright will say what do we want to use that the
0:14:44output for and how do how well there's a do that
0:14:48so
0:14:50what can we do it but the posterior
0:14:52make minimum expected cost bayes decisions
0:14:55so
0:14:56in as a speech recognizer it might be sending the find posteriors into the decoder
0:15:01but in the end it still gonna make some decision at some stage it's gone
0:15:05output a transcription so in the end you're always making decisions
0:15:11so
0:15:11and then we just ask
0:15:13how well
0:15:15does it make these decisions and then
0:15:17that very same cost that you optimising
0:15:20but the minimum expected cost bayes decision is gonna tell you how well you did
0:15:25so
0:15:26good posteriors can be used to make cost effective decisions
0:15:30but badly calibrated posteriors
0:15:34maybe on the confident or overconfident in the wrong hypothesis and that will eventually lead
0:15:41to a series of unnecessarily costly
0:15:44errors
0:15:47so let's look at decision cost function
0:15:51so
0:15:52the decision cost functions model the consequences of applying recognition technology in the real world
0:15:58real world it's always more complex
0:16:02engineers like
0:16:03simple models but we can optimize
0:16:07so
0:16:10this should be very from idea
0:16:12we
0:16:13we first look at the case of a heart decision
0:16:16so
0:16:16the recognizer says its class one out loss to
0:16:20and we know that the true class last one of the last two and then
0:16:23we assign some cost coefficient
0:16:26so
0:16:27in
0:16:28thus example
0:16:30i might the cost coefficients when there and i was for the correct decision
0:16:34there's yellow
0:16:35and for errors there's a non-zero cost
0:16:39so you might
0:16:40want to work in terms of rewards
0:16:44or you can even have a mixture of rewards and penalties this celebrate the
0:16:50what's called the term weighted value of the keyword spotting
0:16:54that's a mixture of a of a of every word and the penalty so
0:16:58in the and all those are equivalent you can you can play around these cost
0:17:02functions
0:17:03and
0:17:04for what we're gonna do
0:17:06it's
0:17:07it's container
0:17:08two
0:17:09not using what i'm just to
0:17:13put the cost on the errors
0:17:16so
0:17:17now we apply it to a soft decision
0:17:21so we let the recognizer output the posterior distribution
0:17:25and then
0:17:26when we evaluate its goodness
0:17:28we make and minimum expected cost bayes decision so the base decision is made without
0:17:33having the two class
0:17:36and then
0:17:37we treat that as a heart decision and evaluated with this cost matrix as before
0:17:43so
0:17:44what we have now
0:17:45is the goodness of the posterior
0:17:48that we've output
0:17:50very simple thing
0:17:53it she
0:17:54what we've achieved so
0:17:57a couple of slides ago
0:18:00i try to convince you that
0:18:02this kind of penalty function on the left is what we want
0:18:06what we've achieved
0:18:07is this step function
0:18:09so
0:18:10there's a threshold on the posterior
0:18:13which is a function of the cost coefficients
0:18:16but the cost is either some non-zero cost or also your
0:18:20so at least the step function has the right
0:18:24sense
0:18:25it's bigger where it needs to be an small that where it needs to be
0:18:28but
0:18:30it's very crude and in effect it's only evaluating the goodness of your posterior up
0:18:35with a single point
0:18:37it doesn't say anything about making decisions at any of the operating point
0:18:42so
0:18:42we need to find a smoother solution
0:18:46so
0:18:47in order to smooth let's simplified just a little but
0:18:51so
0:18:52the bayes decision threshold
0:18:54this simple ratio of the costs
0:18:58so we might think of it in terms of give the costs
0:19:02compute the threshold
0:19:04but that's to at the other way round
0:19:06so let's say
0:19:08that's choose the threshold at which we but which we're going to evaluate with free
0:19:12to choose any facial
0:19:14and then
0:19:15we might
0:19:17cost the a function of the threshold and if you choose these simple reciprocal functions
0:19:24we're still
0:19:25applying the above equation
0:19:27so
0:19:28the above equation is able to
0:19:31so let's look at let's look at this graphically
0:19:34so
0:19:35what we've achieved
0:19:37the recognizer outputs Q one of the posteriors for class one you do is just
0:19:42just flip access you'll get you do
0:19:45so
0:19:46the penalty when class one is true would be the red curve and the penalty
0:19:50when close to is true
0:19:51would be the look of and
0:19:54the cost coefficients are a function of the threshold which we can
0:19:59adjust at what so let's do that
0:20:02we can move the threshold and with it
0:20:05the cost coefficients
0:20:07with which are gonna be penalised well
0:20:09well
0:20:11well also change if you press
0:20:14the threshold right against you had or one penalty will be infinite but
0:20:18that's good because then you want to yourself in the hit
0:20:24so
0:20:25by moving the threshold while we're evaluating the goodness of the posterior we in fact
0:20:29exercising the decision making ability of the a posteriori over its full range
0:20:35so we're almost done
0:20:40that's just look at another view
0:20:42this is the same thing
0:20:45just another view we have the recognizer output the posterior
0:20:48the posterior is compared against
0:20:51the threshold
0:20:52the threshold is a parameter chosen by the evaluators
0:20:57and then it
0:20:58you also need to know the true cost and outputs the cost so
0:21:01note
0:21:03is a function of three variables
0:21:06the recognizer output the true value and this parameter feature
0:21:11so now let's integrate out see
0:21:15so
0:21:16the integral and here is the state be cost function which are plotted a few
0:21:20slides about
0:21:22on the left hand side we get
0:21:25a cost function which is not independent of the threshold because we've integrated
0:21:30about the full range of the threshold
0:21:32and
0:21:33that turns out to be just this logarithmic cost function
0:21:37so
0:21:38you
0:21:39bike the than algorithm of
0:21:42the posterior have for the two class
0:21:45and
0:21:47that
0:21:47is the goodness of the posterior of the recognizer so
0:21:52there's napkins
0:21:53this nice smooth shape
0:21:56which we were looking for
0:22:01so
0:22:01that's two classes
0:22:04now we're going to generalise to multiclass
0:22:07so multi class
0:22:09is a lot trickier
0:22:11but the sign general principles apply
0:22:14so
0:22:15we still gonna work with minimum expected cost bayes decisions
0:22:20but in this case will use of generalize threshold which all plot for you the
0:22:24next slide
0:22:25and we again we're going to integrate out the threshold and get the similar results
0:22:31so
0:22:34and the scroll we show the
0:22:36output of i three class
0:22:39recognizer
0:22:40i chose three classes because i can plotted here on this nice flat screen
0:22:45so Q one
0:22:46is the posterior for class one
0:22:49the vertical axis Q to the posterior for class to
0:22:53and Q three we don't see but it's just the complement of the others to
0:22:57so everything needs to live inside the simplex
0:23:01then the
0:23:03the tricky part
0:23:05is
0:23:05we now define a kind of a generalized facial so this threshold
0:23:11has three components people want to and three
0:23:14and we constrain them to sum to one so this threshold
0:23:18is defined
0:23:19by this point where the lines meet
0:23:22and that also loves inside the same someplace
0:23:25and now again
0:23:27we've chosen the threshold
0:23:28then we choose the cost function so the cost function again
0:23:32is this little equation at the bottom again as just the reciprocal of the threshold
0:23:36coefficients
0:23:39and again
0:23:40we can play around
0:23:42we can move the
0:23:43the
0:23:44threshold
0:23:46or lower bound
0:23:47the interior of the simplex
0:23:49we can exercise the decision making ability
0:23:52of the
0:23:54of the recognizer
0:23:57i should have told you the these
0:24:01the
0:24:03these lines that the structure of the threshold that is just the consequence of making
0:24:07the minimum expected cost bayes decision
0:24:10so
0:24:11once you assigned those cost functions
0:24:14that's what the threshold is gonna look like so
0:24:16again if Q one is large you gonna be in the region all one choose
0:24:20class one region or to include choose plus two
0:24:24and i three if the other to a small we gonna choose plus three
0:24:28so
0:24:31again
0:24:32we seen that we can move the threshold around now we can integrated
0:24:37so
0:24:38the integral will cover
0:24:40several slides
0:24:42which i'm not going to show you
0:24:44but the same kind of thing applies we just integrate out the
0:24:51threshold
0:24:53of at this stage cost function
0:24:55and lo and behold
0:24:57we get
0:24:58the logarithmic function again
0:25:03so
0:25:04the whole recipe can be summarized like this
0:25:08again the recognizer output of posterior distribution
0:25:12in other words an element of the posterior for each of the classes
0:25:17when we know what the true class is
0:25:19we select that component and we just apply logarithm to it
0:25:24so
0:25:24if the recognizer says the true
0:25:27the probability of the two classes one that's very good the penalties you error
0:25:31if it's a is the
0:25:34probability of the true class is zero that's very bad penalties is infinite
0:25:42so
0:25:45all of the preceding was for just one example
0:25:48one input one output
0:25:50if you have a whole database of data which is supervised
0:25:54you can apply this
0:25:56two
0:25:56the whole database and you just average the logarithmic cost
0:26:01and that is cross entropy which tries to you know very well
0:26:05so
0:26:07that's perhaps the most well-known discriminative training objective not just in speech recognition in
0:26:13all of machine learning
0:26:15and it forms the basis for all kinds of other things with other names like
0:26:19mmi logistic regression
0:26:24it's perhaps not so well known
0:26:26that'd is a way of measuring calibration
0:26:29you see that appearing from time to time for example
0:26:34this book on a gaussian processes they use but use cross entropy to do essentially
0:26:40all to measure calibration
0:26:44and then statistics literature
0:26:48this thing is referred to as the logarithmic proper scoring rule
0:26:52you get a whole bunch of other proper scoring rules which
0:26:55can be derived in a similar way you just need to what that integral but
0:26:59the
0:27:00the logarithmic one is very simple and generally just a good idea to use
0:27:08so let's get back to the likelihoods
0:27:11this is going to be very short and simple
0:27:16we start with the recipe for the posterior which are show just now
0:27:19and now we just flip to the other side of bayes rule
0:27:23so now we also the recognizer give me likelihood a likely distribution instead of a
0:27:28posterior distribution
0:27:29and when evaluating its goodness
0:27:32we just send them to softmax or by israel if you will
0:27:36and then apply the logarithm
0:27:38and then
0:27:40we also provided with the prior
0:27:42so
0:27:43notice
0:27:44that we need to now supply prior distribution
0:27:48as a parameter to this evaluation recipe
0:27:51so
0:27:52you free to choose whatever prior
0:27:56the prior there's not have to reflect the proportions of
0:27:59the classes in your data so if you want to emphasise one class rather than
0:28:04the other
0:28:06for example better spoke about that this morning
0:28:09i emphasise some classes
0:28:11some ready data
0:28:12you can do that of course
0:28:14if you have data of one class multiplying that by some
0:28:18on the other number isn't gonna might
0:28:20data appeared magically
0:28:23but
0:28:24the prior does give you some control over
0:28:27of
0:28:29where you want to emphasise
0:28:34so that's
0:28:36let's get back to
0:28:37the graph that we showed earlier
0:28:40so
0:28:41i've motive like to that
0:28:43we want the recognizer output likelihoods
0:28:47that cross entropy forms a nice calibration sensitive
0:28:52cost function to tell you how well it's doing
0:28:55now we can also send the output of the recognizer into a simple calibrated so
0:29:01a calibrated
0:29:02can be anything
0:29:05in general it's a good idea to make it very simple
0:29:08you spend a whole lot of
0:29:10in the G on building have strong recognizer calibrate there should be
0:29:14simple and easy to do
0:29:17but you can gain about out of it
0:29:20so
0:29:21what the stress it does as of explained before
0:29:24it doesn't trained on based optimize the calibrated to tell you how well could i
0:29:29have done
0:29:30if calibration originally had been
0:29:34bit
0:29:34and then
0:29:35you can compare the two
0:29:37and then
0:29:39you can
0:29:41the difference you can call the calibration loss
0:29:44if you build a recognizer
0:29:47and the output is
0:29:49well calibrated in the calibration loss will be small and you can be very happy
0:29:53otherwise you have to go and
0:29:56apply some calibrated before you want to apply the recognizer right
0:30:04so
0:30:07thus will be brief how to
0:30:10well calibrated
0:30:12so
0:30:13the theory
0:30:14is very basic
0:30:17it's a
0:30:19we don't
0:30:20some
0:30:21basic recognizer
0:30:22which
0:30:24outputs class likelihoods so
0:30:28then we just
0:30:30for all the likelihoods into one vector call it like to distribution
0:30:35and then we say well
0:30:38we now
0:30:39we've mentioned that
0:30:41these likelihoods are not well calibrated that on my goodbyes decisions
0:30:46so let's put another probabilistic modeling step on top of that
0:30:51it's not be mapped
0:30:52the state is of this likelihood vector this just one of the feature or a
0:30:56score if you want
0:30:59or already original recognise it might have been an svm the svm doesn't even pretend
0:31:05to produce calibrated likelihoods
0:31:08the output is just the score that's fine we can just
0:31:12use that as the input to the next modelling stage so
0:31:18you have complete freedom
0:31:19of
0:31:20what you going to use for the next modelling stage
0:31:23it can be parametric could be non parametric could be more or less bayesian
0:31:30it can be discriminant of it can be generative
0:31:34as long as
0:31:35as long as it works
0:31:38so
0:31:41i've tried and tested
0:31:43various
0:31:44calibration strategies
0:31:47the one i'm showing you
0:31:49stole my five that it's very simple
0:31:52so
0:31:53you
0:31:54take the log likelihoods use kind of them with the class independent scale factors and
0:31:59you shifted with the class dependent
0:32:01offset
0:32:03and that gives you a recalibrated
0:32:08likelihood recalibrated log likelihood
0:32:12so we train the
0:32:15coefficients the scale and the of sets we train the discriminatively
0:32:18and typically using again
0:32:21cross entropy average logarithmic cost
0:32:23and because
0:32:25the cross entropy
0:32:28optimizes calibration
0:32:30this is why this recipe optimizes calibration supposed to discriminate the frisbee i've worked with
0:32:35generative ones as well as i would do
0:32:38so
0:32:39i might just mention that
0:32:42for example if you're doing automatic
0:32:45language recognition you might
0:32:47extract what we call an i-vector
0:32:50so the i-vector represents the whole
0:32:53input segments of speech
0:32:55and then you can just go and do a large multi class logistic regression
0:33:00and that will outputs likelihoods
0:33:02so
0:33:03that already uses cross entropy as an objective function why would you need to calibrate
0:33:08the to get
0:33:10so the problem is
0:33:11to make the labs logistic regression
0:33:14well you need to regularize the regularization
0:33:18we'll typically skew the calibration course now we not
0:33:22optimising
0:33:23the
0:33:26minimum expected cost bayes decisions anymore so regularization is necessary but it's cues calibration so
0:33:33in practice
0:33:35it's a good idea to
0:33:37how about some data
0:33:40to use for calibration so
0:33:42part of your data you train your original recognizer
0:33:46the held out set you use for training your calibrated
0:33:49so in practice we found in speaker and in language recognition
0:33:54this kind of recipe
0:33:56works very well
0:33:59and then the stress is just another form of logistic regression
0:34:04just the very much constraint
0:34:06you can of course
0:34:07you can multiply this the simple fact it with the full matrix if you one
0:34:10that would be than unconstrained logistic regression
0:34:14that also works but you have to be a bit more carefully need enough data
0:34:19the general this the simple recipes is very safe and very effective usually
0:34:28so
0:34:29i'll just give you one real world example not real will be it's i the
0:34:34nist evaluation
0:34:38almost real world
0:34:41so
0:34:42we look at an example of the two thousand and seven nist language mission evaluation
0:34:49we look at the original accuracy of for one
0:34:52systems that were competing in this evaluation
0:34:55and then we look at the improvement after recalibration with the recipe which of just
0:34:59shown
0:35:01so
0:35:01on the vertical axis is the evaluation criterion which was defined specifically for the language
0:35:08recognition evaluation it's a little bit too complicated explained here but it's enough to know
0:35:14this is the calibration sensitive criterion you do better if you're calibration is better
0:35:20and
0:35:22libraries but it's a cost function
0:35:25so
0:35:26the blue ones of the original submissions and after being recalibrated
0:35:31you get you get an improvement in all the system so
0:35:35i must mention that the recalibration was done on
0:35:39not on the evaluation data but on some independent calibration data so this is not
0:35:45to cheating recalibration
0:35:50so
0:35:52with done
0:35:54time to summarize
0:35:56so
0:36:00the job of
0:36:02posteriors or likelihoods
0:36:04is in the end to mike cost effective decisions
0:36:07if we're gonna user recognizers for anything in the end
0:36:10it makes decisions it outputs some something heart or it does some action that was
0:36:16all decisions
0:36:17so
0:36:20that's very cost
0:36:21tells us how good the or
0:36:23if we want them to minimize cost that cost tells us how good they all
0:36:28and
0:36:29cross entropy is just the representation
0:36:32of that same cost it's just it's movie over a range of operating points
0:36:38and calibration can be measured and improved
0:36:44i've put this presentation that this finally U R L if you want to find
0:36:48it
0:36:49i have some of my
0:36:51publications on calibration and some kind of is well there are some matlab toolkits
0:36:57at the next url
0:36:58and
0:37:00there's also the url of
0:37:02about meet you
0:37:03so
0:37:04although
0:37:07you this
0:37:08on the screen here
0:37:11somebody goes and he wonders whether he's
0:37:14he's got these recognizers well calibrated
0:37:19please going try this recipe
0:37:22this can tell you how good your calibration is so that's my take on message
0:37:29probably have time for some questions
0:37:40and the questions
0:37:44how genetically using this is done
0:37:47in terms of the number of classes
0:37:49all these techniques
0:37:51what if you want without some plastic
0:37:54i honestly don't know
0:37:56in language recognition
0:37:59we've
0:38:01use the weekly lesson thirty languages
0:38:03so
0:38:07of course i think if you have lots of data like you guys have the
0:38:11intention to work for very many classes but the
0:38:16i think
0:38:17if you don't have enough data per class be in trouble
0:38:32the next talk i in the language id rear we typically focus a lot on
0:38:39found data
0:38:41and it's quite often data crosser languages you'll have a mismatch in the amount of
0:38:45data
0:38:46so differences in training
0:38:49i think there's been a lot of discussion yesterday and low resource languages i'm expecting
0:38:54that you probably also see varying amounts of build a there could you comment i'm
0:38:58how some of the folks in the language id area might or try to address
0:39:04varying amounts of data for improving language id
0:39:11right in and in this slide what are so there is a before the likelihoods
0:39:16i had the prior which you can choose so you can use that prior to
0:39:20essentially white
0:39:22the
0:39:26the data so that you could be white the classes which are not well represented
0:39:30you can rewind them so that to the cross entropy it looks as if there's
0:39:34more than
0:39:35more of the cluster that really is so
0:39:38of course that doesn't magically make the data more
0:39:42so
0:39:45if you
0:39:46in the and you just
0:39:47what
0:39:49cross entropy just really measures error rate
0:39:52as i showed you cross entropy is constructed with that
0:39:57a step function so it's better setting it's counting errors cross entropy is counting headers
0:40:03so
0:40:04if there are very few errors and it's gonna have a bad estimate of the
0:40:07error rate
0:40:09so
0:40:09by multiplying
0:40:13that error rate
0:40:14which has
0:40:16which is inaccurate with a large number you gonna multiply that in accuracy so
0:40:23you should use those kind of rewriting with K
0:40:32well here at asr you actually
0:40:35the life is a bit more difficult than in speaker I to your language id
0:40:38because they're basically you need to prior
0:40:40to produce one decision profile or per utterance
0:40:44so no to people would play with actually also segmenting the output by recognizing the
0:40:50chunks muting posteriors generating lattices and this kind of stuff
0:40:55imagine that there is a asr but student coming to you
0:41:00asking you what you find from with what you
0:41:03what all these folks are doing what would be the first thing that you would
0:41:07take from we are like in calibration perspective what would you advise
0:41:13i would go a need to go and study
0:41:16speech recognition more carefully
0:41:19before
0:41:21before i would be able to answer that question
0:41:32i thought the more is more obvious application would be an score normalisation for keywords
0:41:37but i mean people are using discriminant techniques for taking sort of estimate the probability
0:41:43of error and weighting them in using and normalized scores any thought about
0:41:48why this we applied to that application
0:41:53but what about the additional complications
0:41:56so this
0:41:59term white to the cost function
0:42:02has those nasty little thing that keeps on jumping around
0:42:08in
0:42:10in what i showed you here
0:42:13we assume in the cost function is not
0:42:15so
0:42:17you can make minimum expected cost bayes decisions if you know what the cost is
0:42:21in the term white the thing
0:42:25the cost depends on how many times the keyword is in the test data
0:42:30and
0:42:31i don't know that so that complicates matters considerably
0:42:44you would still very well by going to calibrate your the output of your
0:42:50recognizer
0:42:52but
0:42:53once you've got that likelihood
0:42:55what are you gonna do then
0:42:57to produce your final output that you're going to send two
0:43:01to the evaluator
0:43:02that gets complicated
0:43:04and there's all kinds of normalizations and things involved
0:43:20and distances applications when asking a question of like a real world example
0:43:27and so i noticed that you are a rating with the rest so why not
0:43:33check and you decide matching less complicated going to the details
0:43:38that's but i think in a lot of real world applications where i am not
0:43:43upon that check so that maybe use a context i something else by something else
0:43:49this is a way have a problem is in the case where you want you
0:43:53why consistency be good with respect to not checks
0:43:59right this question makes a semantic is it's interesting handed
0:44:04and in both cases yes
0:44:07so
0:44:08if you use this logarithmic cost function
0:44:12then
0:44:14it tends to my the output of your recognizer
0:44:18good over a very wide range of operating points
0:44:22so especially if you just have two classes that you want to recognise
0:44:26you can
0:44:27you can
0:44:30i i'll show that axes of the posterior between gender one
0:44:33but if you can't i am columns of the posterior then that access becomes infinite
0:44:38so
0:44:39then you can move that
0:44:40that threshold
0:44:42all the web from minus infinity to plus infinity so if you move it too
0:44:46far
0:44:47you gonna
0:44:48going to regions where there's no more data more errors in the doesn't make any
0:44:52sense anymore
0:44:54there's a limited range
0:44:57on that axis
0:44:59where you can do useful stuff
0:45:01and the logarithmic cost function typically
0:45:06evaluates mice and widely over of with that useful range
0:45:11so
0:45:12if for example
0:45:13you would a instead of taking the logarithm
0:45:17you take
0:45:19be square to one minus be squared
0:45:22square loss sometimes called really lost
0:45:27you get and natalie coverage
0:45:29so
0:45:31that doesn't that doesn't cover applications as widely as a as this case that's
0:45:37you can go even wonder if you want
0:45:40then you get a kind of an exponential loss function which is associated with boosting
0:45:44in machine learning
0:45:48i have a have a my interspeech paper
0:45:52of this
0:45:54explores that kind of thing in detail what if you have other cost functions not
0:45:59just
0:46:00cross entropy
0:46:03so you should find a link to that the
0:46:06we page
0:46:09so the answer is basically it's a very good idea to use cross entropy
0:46:13if you if you optimise your recognizer
0:46:17to have good cross entropy it's generally going to work for whatever you want to
0:46:21use it for
0:46:25right based on speaker