Speech Transcript - Calibration of binary and multiclass probabilistic classifiers in automatic speaker and language recognition

0:00:15	so good morning sounds very much for inviting me
0:00:19	as better mention i'm not in mainstream speech recognition
0:00:24	so
0:00:25	i hope what i chose to talk about what will be interesting to you
0:00:29	so
0:00:30	before i go on we just
0:00:32	okay just
0:00:36	just about a medium i probably
0:00:39	okay do is a startup that's been around since about two thousand and four
0:00:44	i need to is latin for i recognise so i meet you specialises in automatic
0:00:50	speaker recognition just that
0:00:52	and it sells
0:00:54	a range of products
0:00:56	that make use of this technology in many different countries in the world has its
0:01:01	main office in madrid in spain
0:01:03	and also offices close to washington and california
0:01:07	and we have a small research lab in south africa so that's where i'm based
0:01:15	so just to make sure we'd on the same page
0:01:18	that we know what we're talking about
0:01:21	everybody knows
0:01:22	what
0:01:23	speech recognition is about
0:01:25	speaker recognition is who's
0:01:29	i'll from funded very difficult to explain to people after i've explain for K two
0:01:34	minutes that will still understand speech recognition
0:01:38	and then of course
0:01:39	there's automatic language recognition or called spoken language recognition
0:01:43	just to tell
0:01:44	given a speech segment
0:01:46	which language was this
0:01:49	so
0:01:51	in speaker and also language recognition
0:01:54	we've inherited some stuff from speech recognition
0:01:58	mostly just the acoustic modeling so
0:02:02	the features mfccs and gmms
0:02:05	we do but slide back with neural networks we haven't we've tried but that i
0:02:10	don't work as well as the gmms to
0:02:15	so
0:02:16	we take the acoustic modeling and then we do some relatively simple back and recognition
0:02:22	it's very simple compared to your language modeling and or decoders
0:02:27	so
0:02:28	this talk is going to be deep rather than why i'm going to concentrate just
0:02:33	on the back and recognition part and just on a tiny aspect of that
0:02:39	ninety calibration
0:02:43	and i hope maybe
0:02:46	you guys find something in the store useful that you can maybe use
0:02:50	so
0:02:52	what is calibration
0:02:53	it concerns the goodness of soft decisions so you have a recognizer that i put
0:02:58	some kind of a soft decision a classifier if you want
0:03:03	and
0:03:03	then it can be understood in two senses
0:03:07	first of all calibration is just how could
0:03:10	is the output of my recognizer
0:03:12	or
0:03:13	it's whatever you do to make it bit so if you make your the output
0:03:17	of your recognizer better you calibrating it
0:03:20	so will talk about but
0:03:22	so
0:03:24	i'm not and there's expecting everybody to understand
0:03:28	this diagram this is just
0:03:30	a road map of what we're going to talk about i'll come back to this
0:03:34	diagram
0:03:35	so
0:03:38	i'm going to motivate that if you want your recognizer output
0:03:42	a soft decision
0:03:45	likelihoods rather than posteriors
0:03:47	is what you want and
0:03:50	how to evaluate the goodness
0:03:52	oftentimes outputs final cross entropy
0:03:55	so the cross entropy gives you a calibration sensitive loss function to measure the goodness
0:04:00	of the output of the recognizer
0:04:02	then we can take the wrong put and we can somehow calibrated so
0:04:07	i'll talk about some simple calibrate there's
0:04:10	and then
0:04:11	you can have this kind of a feedback loop
0:04:13	to essentially optimize why the effect of the calibrated and then
0:04:19	that gives you in a since calibration insensitive less
0:04:23	which tells you
0:04:24	how well could i have done if my calibrate it's my system had been optimally
0:04:29	calibrated
0:04:30	and then you can compare the two
0:04:32	and that will tell you
0:04:34	a good was my calibration
0:04:38	so it's not at the beginning
0:04:40	so
0:04:41	the canonical speaker recognition problem
0:04:46	we usually view that as a two class classification problem so the input is a
0:04:51	pair of speech segments often just call the enrollment segment and the test segment
0:04:56	and then
0:04:57	the output one is its class one that segments have the same speaker tall close
0:05:02	to the different
0:05:04	so as an example of a multiclass classifier
0:05:09	we take language recognition
0:05:12	so there we can define a number of language classes
0:05:16	and
0:05:18	if the french and the audience are wondering why they're not there but all their
0:05:22	neighbors that's some other language
0:05:27	so let's look at the
0:05:31	i'll put
0:05:32	the form of the output of
0:05:35	a classifier recognizer so that you just form would just to put
0:05:40	a heart last okay
0:05:43	if you want to soft output that might be
0:05:46	posterior distribution
0:05:47	or we can go to the other side of bayes rule and output like your
0:05:51	distribution i'm going to motivate the last one is preferable
0:05:58	how decisions
0:05:59	there's some people
0:06:00	but it's a bad idea unless error rate is really low
0:06:05	and it cannot make use of context that cannot make use of independent prior information
0:06:10	posting idea is to end users stole intuitive to understand what the posteriors telling them
0:06:17	and it conveys confidence
0:06:20	so
0:06:20	you can you can
0:06:24	recover from an error because you see it coming you know you can make errors
0:06:27	you could make optimal
0:06:29	minimum expected cost bayes decisions if you have a posterior so that's a much more
0:06:33	useful the output
0:06:35	the problem with the posting idea is
0:06:39	the prior is implicit and hardcoded inside the posterior you can remember by dividing it
0:06:45	up but then you also need to know what was the posting
0:06:48	so
0:06:49	a clean a type of output
0:06:51	it's just the likelihood
0:06:52	and then you can afterwards supply any but i in any prior
0:06:59	the only downside is it somewhat harder to understand especially for end users but we
0:07:04	might end users so
0:07:05	bits let's go with the likelihood
0:07:10	in the end
0:07:12	for this implications that we're looking at there really isn't that much difference between the
0:07:16	do if you have the posterior and the prior in the back to likelihood
0:07:20	or you got the other way
0:07:23	if
0:07:24	there's a small number of discreet classes you can always normalized likelihood and you've but
0:07:27	posterior
0:07:30	so
0:07:31	that does look at some examples to might affect
0:07:35	we use the likelihood
0:07:37	so
0:07:38	and language recognizer would output the like to distribution across the number of languages
0:07:44	but then
0:07:44	it's a we know within ornaments today
0:07:48	we more likely to here
0:07:51	check being spoken on the street very unlikely to hear my home language afrikaans
0:07:57	and
0:07:57	you combine these two sources of information
0:08:00	via bayes rule and the and the
0:08:03	the posterior then gives you the complete picture
0:08:09	maybe you could have a phone recognizer the same sort of rescue applies
0:08:15	output a like you distribution
0:08:17	and
0:08:18	the prior is the context in which you try to recognise that phone
0:08:22	and then the decoder combines everything
0:08:24	essentially forms a formal a kind of posterior
0:08:28	let's go to speech recognition
0:08:31	this is and i realised
0:08:34	you
0:08:35	of what a for intra-speaker recognizer might do
0:08:38	so
0:08:41	there was someone who
0:08:43	was careless enough to get himself recorded what he was committing a crime
0:08:47	people is about all of this speech sample
0:08:51	there is the suspect
0:08:53	but also list the suspect nicely to provide another speech sample then you want to
0:08:58	compare the two
0:08:59	is this the same person or not
0:09:02	and because that are just two classes you can conveniently for the likelihood ratio between
0:09:06	those two possibilities
0:09:09	and then if you have a very nice
0:09:12	bayesian courtroom inside the core
0:09:14	by would
0:09:16	this
0:09:17	the total effect of all the other evidence
0:09:20	as a kind of the prior
0:09:22	and then
0:09:24	if you have a very clever jobs and the jewelry i might act like bayes
0:09:28	rule and i can combine these two sources of evidence so
0:09:32	that's probably never really going to happen
0:09:35	but
0:09:35	still for any this is the useful babble
0:09:39	to think what should my i would look like
0:09:42	what should i be thinking about if i want this likelihood ratio that i put
0:09:47	to be
0:09:49	to do its job as well as possible
0:09:54	so
0:09:56	this is an objective field but in practice recognizers are often badly calibrated in my
0:10:01	experience if you bowl the speaker or a language recognizer it's always badly calibrated
0:10:07	you can redesigned the thing and do what you want it's going to be badly
0:10:11	calibrated it might be very accurate
0:10:14	but
0:10:15	install badly calibrated so
0:10:18	you need to
0:10:20	adjust its output
0:10:21	to get the full benefit of the output of the recognizer so
0:10:26	the tools we need
0:10:28	but michael this happened first of all you need to measure the quality of the
0:10:31	calibration
0:10:32	and then
0:10:33	you need some to adjust it
0:10:37	so
0:10:38	first let's talk about the measurement
0:10:42	so
0:10:43	calibration applies to both posteriors and likelihoods
0:10:46	it's easier to explain this whole thing in terms of posteriors and then later we'll
0:10:50	go back to the likelihoods
0:10:53	so are use
0:10:56	two classes
0:10:57	as a running example again because it's easier to
0:11:02	explain and then like that we go we'll go to the multiclass case so
0:11:07	it is a recogniser we represented by the symbol are so all the posteriors of
0:11:12	conditioned on because it's output by the by the recognizer
0:11:16	so
0:11:17	the posterior tells you do things
0:11:20	first is which clusters at five
0:11:23	if the one element is greater than other one it wants to be recognising class
0:11:27	one in this case
0:11:29	but then it also tells us that degree of confidence how much more is the
0:11:33	one element greater than the other one
0:11:35	so we can form that's right so we can take the bible the right so
0:11:38	you could look at the entropy of the distribution
0:11:42	any anything like that will give you a measurement of the of the degree of
0:11:46	confidence
0:11:49	so
0:11:50	the question i'm trying to answer what this presentation is
0:11:54	the recognizer outputs a posterior distribution
0:11:58	we also known for this particular case which of the two classes was really true
0:12:05	was this a good but still not
0:12:08	another example would be
0:12:10	a weather predict there's id percent of the chance of rain tomorrow
0:12:15	the model rows
0:12:17	it doesn't right
0:12:19	how good was that
0:12:21	i would was that the prediction
0:12:24	so first of all
0:12:26	if it's is the one is greater than other one
0:12:29	and that was in the right direction we know it favours the correct class so
0:12:34	at least that aspect of the
0:12:36	posterior once could
0:12:38	i'll do we judge this degree of confidence
0:12:42	what can we do
0:12:44	we don't have a reference posterior would not given that in practice
0:12:48	we just given the true class what can we say about the posterior
0:12:54	so
0:12:56	we'll the sign some penalty function
0:12:59	so
0:13:01	but this graph is telling us
0:13:02	the
0:13:04	recognizer output the posterior distribution
0:13:07	posterior for each of the two classes
0:13:10	and we know the true class so then on the bottom axis we plot
0:13:15	that
0:13:15	we plot that posterior distribution
0:13:19	for a single case of just be a single point on the x-axis
0:13:23	and then
0:13:24	if the posterior for the true class was one that's good
0:13:27	it was very certain of the thing that really happened
0:13:31	but if it's a something that had really happened
0:13:35	is it possible according to the recognizer
0:13:39	that's thirty back so we give it a high penalty maybe even an infinite penalty
0:13:45	and internet penalty
0:13:47	it's might be a good idea
0:13:49	in practice
0:13:50	if you make a wrong decision it can have arbitrarily bad consequences
0:13:55	if
0:13:56	you like playing russian roulette
0:13:59	you've got the gun in your hand
0:14:01	perform some posterior to know hidden that's is there's no time in the time but
0:14:05	at the moment
0:14:07	and you put it should lead
0:14:09	the consequences of a bad posterior
0:14:14	can be arbitrarily but so i liked this idea but the penalty going up to
0:14:19	infinity
0:14:21	so
0:14:22	i've brought that to candidate functions here
0:14:28	it's easy to see
0:14:29	this should be monotonic function
0:14:32	what should the shape be what principle should we used to design this penalty function
0:14:38	so
0:14:40	we'll take an engineering upright will say what do we want to use that the
0:14:44	output for and how do how well there's a do that
0:14:48	so
0:14:50	what can we do it but the posterior
0:14:52	make minimum expected cost bayes decisions
0:14:55	so
0:14:56	in as a speech recognizer it might be sending the find posteriors into the decoder
0:15:01	but in the end it still gonna make some decision at some stage it's gone
0:15:05	output a transcription so in the end you're always making decisions
0:15:11	so
0:15:11	and then we just ask
0:15:13	how well
0:15:15	does it make these decisions and then
0:15:17	that very same cost that you optimising
0:15:20	but the minimum expected cost bayes decision is gonna tell you how well you did
0:15:25	so
0:15:26	good posteriors can be used to make cost effective decisions
0:15:30	but badly calibrated posteriors
0:15:34	maybe on the confident or overconfident in the wrong hypothesis and that will eventually lead
0:15:41	to a series of unnecessarily costly
0:15:44	errors
0:15:47	so let's look at decision cost function
0:15:51	so
0:15:52	the decision cost functions model the consequences of applying recognition technology in the real world
0:15:58	real world it's always more complex
0:16:02	engineers like
0:16:03	simple models but we can optimize
0:16:07	so
0:16:10	this should be very from idea
0:16:12	we
0:16:13	we first look at the case of a heart decision
0:16:16	so
0:16:16	the recognizer says its class one out loss to
0:16:20	and we know that the true class last one of the last two and then
0:16:23	we assign some cost coefficient
0:16:26	so
0:16:27	in
0:16:28	thus example
0:16:30	i might the cost coefficients when there and i was for the correct decision
0:16:34	there's yellow
0:16:35	and for errors there's a non-zero cost
0:16:39	so you might
0:16:40	want to work in terms of rewards
0:16:44	or you can even have a mixture of rewards and penalties this celebrate the
0:16:50	what's called the term weighted value of the keyword spotting
0:16:54	that's a mixture of a of a of every word and the penalty so
0:16:58	in the and all those are equivalent you can you can play around these cost
0:17:02	functions
0:17:03	and
0:17:04	for what we're gonna do
0:17:06	it's
0:17:07	it's container
0:17:08	two
0:17:09	not using what i'm just to
0:17:13	put the cost on the errors
0:17:16	so
0:17:17	now we apply it to a soft decision
0:17:21	so we let the recognizer output the posterior distribution
0:17:25	and then
0:17:26	when we evaluate its goodness
0:17:28	we make and minimum expected cost bayes decision so the base decision is made without
0:17:33	having the two class
0:17:36	and then
0:17:37	we treat that as a heart decision and evaluated with this cost matrix as before
0:17:43	so
0:17:44	what we have now
0:17:45	is the goodness of the posterior
0:17:48	that we've output
0:17:50	very simple thing
0:17:53	it she
0:17:54	what we've achieved so
0:17:57	a couple of slides ago
0:18:00	i try to convince you that
0:18:02	this kind of penalty function on the left is what we want
0:18:06	what we've achieved
0:18:07	is this step function
0:18:09	so
0:18:10	there's a threshold on the posterior
0:18:13	which is a function of the cost coefficients
0:18:16	but the cost is either some non-zero cost or also your
0:18:20	so at least the step function has the right
0:18:24	sense
0:18:25	it's bigger where it needs to be an small that where it needs to be
0:18:28	but
0:18:30	it's very crude and in effect it's only evaluating the goodness of your posterior up
0:18:35	with a single point
0:18:37	it doesn't say anything about making decisions at any of the operating point
0:18:42	so
0:18:42	we need to find a smoother solution
0:18:46	so
0:18:47	in order to smooth let's simplified just a little but
0:18:51	so
0:18:52	the bayes decision threshold
0:18:54	this simple ratio of the costs
0:18:58	so we might think of it in terms of give the costs
0:19:02	compute the threshold
0:19:04	but that's to at the other way round
0:19:06	so let's say
0:19:08	that's choose the threshold at which we but which we're going to evaluate with free
0:19:12	to choose any facial
0:19:14	and then
0:19:15	we might
0:19:17	cost the a function of the threshold and if you choose these simple reciprocal functions
0:19:24	we're still
0:19:25	applying the above equation
0:19:27	so
0:19:28	the above equation is able to
0:19:31	so let's look at let's look at this graphically
0:19:34	so
0:19:35	what we've achieved
0:19:37	the recognizer outputs Q one of the posteriors for class one you do is just
0:19:42	just flip access you'll get you do
0:19:45	so
0:19:46	the penalty when class one is true would be the red curve and the penalty
0:19:50	when close to is true
0:19:51	would be the look of and
0:19:54	the cost coefficients are a function of the threshold which we can
0:19:59	adjust at what so let's do that
0:20:02	we can move the threshold and with it
0:20:05	the cost coefficients
0:20:07	with which are gonna be penalised well
0:20:09	well
0:20:11	well also change if you press
0:20:14	the threshold right against you had or one penalty will be infinite but
0:20:18	that's good because then you want to yourself in the hit
0:20:24	so
0:20:25	by moving the threshold while we're evaluating the goodness of the posterior we in fact
0:20:29	exercising the decision making ability of the a posteriori over its full range
0:20:35	so we're almost done
0:20:40	that's just look at another view
0:20:42	this is the same thing
0:20:45	just another view we have the recognizer output the posterior
0:20:48	the posterior is compared against
0:20:51	the threshold
0:20:52	the threshold is a parameter chosen by the evaluators
0:20:57	and then it
0:20:58	you also need to know the true cost and outputs the cost so
0:21:01	note
0:21:03	is a function of three variables
0:21:06	the recognizer output the true value and this parameter feature
0:21:11	so now let's integrate out see
0:21:15	so
0:21:16	the integral and here is the state be cost function which are plotted a few
0:21:20	slides about
0:21:22	on the left hand side we get
0:21:25	a cost function which is not independent of the threshold because we've integrated
0:21:30	about the full range of the threshold
0:21:32	and
0:21:33	that turns out to be just this logarithmic cost function
0:21:37	so
0:21:38	you
0:21:39	bike the than algorithm of
0:21:42	the posterior have for the two class
0:21:45	and
0:21:47	that
0:21:47	is the goodness of the posterior of the recognizer so
0:21:52	there's napkins
0:21:53	this nice smooth shape
0:21:56	which we were looking for
0:22:01	so
0:22:01	that's two classes
0:22:04	now we're going to generalise to multiclass
0:22:07	so multi class
0:22:09	is a lot trickier
0:22:11	but the sign general principles apply
0:22:14	so
0:22:15	we still gonna work with minimum expected cost bayes decisions
0:22:20	but in this case will use of generalize threshold which all plot for you the
0:22:24	next slide
0:22:25	and we again we're going to integrate out the threshold and get the similar results
0:22:31	so
0:22:34	and the scroll we show the
0:22:36	output of i three class
0:22:39	recognizer
0:22:40	i chose three classes because i can plotted here on this nice flat screen
0:22:45	so Q one
0:22:46	is the posterior for class one
0:22:49	the vertical axis Q to the posterior for class to
0:22:53	and Q three we don't see but it's just the complement of the others to
0:22:57	so everything needs to live inside the simplex
0:23:01	then the
0:23:03	the tricky part
0:23:05	is
0:23:05	we now define a kind of a generalized facial so this threshold
0:23:11	has three components people want to and three
0:23:14	and we constrain them to sum to one so this threshold
0:23:18	is defined
0:23:19	by this point where the lines meet
0:23:22	and that also loves inside the same someplace
0:23:25	and now again
0:23:27	we've chosen the threshold
0:23:28	then we choose the cost function so the cost function again
0:23:32	is this little equation at the bottom again as just the reciprocal of the threshold
0:23:36	coefficients
0:23:39	and again
0:23:40	we can play around
0:23:42	we can move the
0:23:43	the
0:23:44	threshold
0:23:46	or lower bound
0:23:47	the interior of the simplex
0:23:49	we can exercise the decision making ability
0:23:52	of the
0:23:54	of the recognizer
0:23:57	i should have told you the these
0:24:01	the
0:24:03	these lines that the structure of the threshold that is just the consequence of making
0:24:07	the minimum expected cost bayes decision
0:24:10	so
0:24:11	once you assigned those cost functions
0:24:14	that's what the threshold is gonna look like so
0:24:16	again if Q one is large you gonna be in the region all one choose
0:24:20	class one region or to include choose plus two
0:24:24	and i three if the other to a small we gonna choose plus three
0:24:28	so
0:24:31	again
0:24:32	we seen that we can move the threshold around now we can integrated
0:24:37	so
0:24:38	the integral will cover
0:24:40	several slides
0:24:42	which i'm not going to show you
0:24:44	but the same kind of thing applies we just integrate out the
0:24:51	threshold
0:24:53	of at this stage cost function
0:24:55	and lo and behold
0:24:57	we get
0:24:58	the logarithmic function again
0:25:03	so
0:25:04	the whole recipe can be summarized like this
0:25:08	again the recognizer output of posterior distribution
0:25:12	in other words an element of the posterior for each of the classes
0:25:17	when we know what the true class is
0:25:19	we select that component and we just apply logarithm to it
0:25:24	so
0:25:24	if the recognizer says the true
0:25:27	the probability of the two classes one that's very good the penalties you error
0:25:31	if it's a is the
0:25:34	probability of the true class is zero that's very bad penalties is infinite
0:25:42	so
0:25:45	all of the preceding was for just one example
0:25:48	one input one output
0:25:50	if you have a whole database of data which is supervised
0:25:54	you can apply this
0:25:56	two
0:25:56	the whole database and you just average the logarithmic cost
0:26:01	and that is cross entropy which tries to you know very well
0:26:05	so
0:26:07	that's perhaps the most well-known discriminative training objective not just in speech recognition in
0:26:13	all of machine learning
0:26:15	and it forms the basis for all kinds of other things with other names like
0:26:19	mmi logistic regression
0:26:24	it's perhaps not so well known
0:26:26	that'd is a way of measuring calibration
0:26:29	you see that appearing from time to time for example
0:26:34	this book on a gaussian processes they use but use cross entropy to do essentially
0:26:40	all to measure calibration
0:26:44	and then statistics literature
0:26:48	this thing is referred to as the logarithmic proper scoring rule
0:26:52	you get a whole bunch of other proper scoring rules which
0:26:55	can be derived in a similar way you just need to what that integral but
0:26:59	the
0:27:00	the logarithmic one is very simple and generally just a good idea to use
0:27:08	so let's get back to the likelihoods
0:27:11	this is going to be very short and simple
0:27:16	we start with the recipe for the posterior which are show just now
0:27:19	and now we just flip to the other side of bayes rule
0:27:23	so now we also the recognizer give me likelihood a likely distribution instead of a
0:27:28	posterior distribution
0:27:29	and when evaluating its goodness
0:27:32	we just send them to softmax or by israel if you will
0:27:36	and then apply the logarithm
0:27:38	and then
0:27:40	we also provided with the prior
0:27:42	so
0:27:43	notice
0:27:44	that we need to now supply prior distribution
0:27:48	as a parameter to this evaluation recipe
0:27:51	so
0:27:52	you free to choose whatever prior
0:27:56	the prior there's not have to reflect the proportions of
0:27:59	the classes in your data so if you want to emphasise one class rather than
0:28:04	the other
0:28:06	for example better spoke about that this morning
0:28:09	i emphasise some classes
0:28:11	some ready data
0:28:12	you can do that of course
0:28:14	if you have data of one class multiplying that by some
0:28:18	on the other number isn't gonna might
0:28:20	data appeared magically
0:28:23	but
0:28:24	the prior does give you some control over
0:28:27	of
0:28:29	where you want to emphasise
0:28:34	so that's
0:28:36	let's get back to
0:28:37	the graph that we showed earlier
0:28:40	so
0:28:41	i've motive like to that
0:28:43	we want the recognizer output likelihoods
0:28:47	that cross entropy forms a nice calibration sensitive
0:28:52	cost function to tell you how well it's doing
0:28:55	now we can also send the output of the recognizer into a simple calibrated so
0:29:01	a calibrated
0:29:02	can be anything
0:29:05	in general it's a good idea to make it very simple
0:29:08	you spend a whole lot of
0:29:10	in the G on building have strong recognizer calibrate there should be
0:29:14	simple and easy to do
0:29:17	but you can gain about out of it
0:29:20	so
0:29:21	what the stress it does as of explained before
0:29:24	it doesn't trained on based optimize the calibrated to tell you how well could i
0:29:29	have done
0:29:30	if calibration originally had been
0:29:34	bit
0:29:34	and then
0:29:35	you can compare the two
0:29:37	and then
0:29:39	you can
0:29:41	the difference you can call the calibration loss
0:29:44	if you build a recognizer
0:29:47	and the output is
0:29:49	well calibrated in the calibration loss will be small and you can be very happy
0:29:53	otherwise you have to go and
0:29:56	apply some calibrated before you want to apply the recognizer right
0:30:04	so
0:30:07	thus will be brief how to
0:30:10	well calibrated
0:30:12	so
0:30:13	the theory
0:30:14	is very basic
0:30:17	it's a
0:30:19	we don't
0:30:20	some
0:30:21	basic recognizer
0:30:22	which
0:30:24	outputs class likelihoods so
0:30:28	then we just
0:30:30	for all the likelihoods into one vector call it like to distribution
0:30:35	and then we say well
0:30:38	we now
0:30:39	we've mentioned that
0:30:41	these likelihoods are not well calibrated that on my goodbyes decisions
0:30:46	so let's put another probabilistic modeling step on top of that
0:30:51	it's not be mapped
0:30:52	the state is of this likelihood vector this just one of the feature or a
0:30:56	score if you want
0:30:59	or already original recognise it might have been an svm the svm doesn't even pretend
0:31:05	to produce calibrated likelihoods
0:31:08	the output is just the score that's fine we can just
0:31:12	use that as the input to the next modelling stage so
0:31:18	you have complete freedom
0:31:19	of
0:31:20	what you going to use for the next modelling stage
0:31:23	it can be parametric could be non parametric could be more or less bayesian
0:31:30	it can be discriminant of it can be generative
0:31:34	as long as
0:31:35	as long as it works
0:31:38	so
0:31:41	i've tried and tested
0:31:43	various
0:31:44	calibration strategies
0:31:47	the one i'm showing you
0:31:49	stole my five that it's very simple
0:31:52	so
0:31:53	you
0:31:54	take the log likelihoods use kind of them with the class independent scale factors and
0:31:59	you shifted with the class dependent
0:32:01	offset
0:32:03	and that gives you a recalibrated
0:32:08	likelihood recalibrated log likelihood
0:32:12	so we train the
0:32:15	coefficients the scale and the of sets we train the discriminatively
0:32:18	and typically using again
0:32:21	cross entropy average logarithmic cost
0:32:23	and because
0:32:25	the cross entropy
0:32:28	optimizes calibration
0:32:30	this is why this recipe optimizes calibration supposed to discriminate the frisbee i've worked with
0:32:35	generative ones as well as i would do
0:32:38	so
0:32:39	i might just mention that
0:32:42	for example if you're doing automatic
0:32:45	language recognition you might
0:32:47	extract what we call an i-vector
0:32:50	so the i-vector represents the whole
0:32:53	input segments of speech
0:32:55	and then you can just go and do a large multi class logistic regression
0:33:00	and that will outputs likelihoods
0:33:02	so
0:33:03	that already uses cross entropy as an objective function why would you need to calibrate
0:33:08	the to get
0:33:10	so the problem is
0:33:11	to make the labs logistic regression
0:33:14	well you need to regularize the regularization
0:33:18	we'll typically skew the calibration course now we not
0:33:22	optimising
0:33:23	the
0:33:26	minimum expected cost bayes decisions anymore so regularization is necessary but it's cues calibration so
0:33:33	in practice
0:33:35	it's a good idea to
0:33:37	how about some data
0:33:40	to use for calibration so
0:33:42	part of your data you train your original recognizer
0:33:46	the held out set you use for training your calibrated
0:33:49	so in practice we found in speaker and in language recognition
0:33:54	this kind of recipe
0:33:56	works very well
0:33:59	and then the stress is just another form of logistic regression
0:34:04	just the very much constraint
0:34:06	you can of course
0:34:07	you can multiply this the simple fact it with the full matrix if you one
0:34:10	that would be than unconstrained logistic regression
0:34:14	that also works but you have to be a bit more carefully need enough data
0:34:19	the general this the simple recipes is very safe and very effective usually
0:34:28	so
0:34:29	i'll just give you one real world example not real will be it's i the
0:34:34	nist evaluation
0:34:38	almost real world
0:34:41	so
0:34:42	we look at an example of the two thousand and seven nist language mission evaluation
0:34:49	we look at the original accuracy of for one
0:34:52	systems that were competing in this evaluation
0:34:55	and then we look at the improvement after recalibration with the recipe which of just
0:34:59	shown
0:35:01	so
0:35:01	on the vertical axis is the evaluation criterion which was defined specifically for the language
0:35:08	recognition evaluation it's a little bit too complicated explained here but it's enough to know
0:35:14	this is the calibration sensitive criterion you do better if you're calibration is better
0:35:20	and
0:35:22	libraries but it's a cost function
0:35:25	so
0:35:26	the blue ones of the original submissions and after being recalibrated
0:35:31	you get you get an improvement in all the system so
0:35:35	i must mention that the recalibration was done on
0:35:39	not on the evaluation data but on some independent calibration data so this is not
0:35:45	to cheating recalibration
0:35:50	so
0:35:52	with done
0:35:54	time to summarize
0:35:56	so
0:36:00	the job of
0:36:02	posteriors or likelihoods
0:36:04	is in the end to mike cost effective decisions
0:36:07	if we're gonna user recognizers for anything in the end
0:36:10	it makes decisions it outputs some something heart or it does some action that was
0:36:16	all decisions
0:36:17	so
0:36:20	that's very cost
0:36:21	tells us how good the or
0:36:23	if we want them to minimize cost that cost tells us how good they all
0:36:28	and
0:36:29	cross entropy is just the representation
0:36:32	of that same cost it's just it's movie over a range of operating points
0:36:38	and calibration can be measured and improved
0:36:44	i've put this presentation that this finally U R L if you want to find
0:36:48	it
0:36:49	i have some of my
0:36:51	publications on calibration and some kind of is well there are some matlab toolkits
0:36:57	at the next url
0:36:58	and
0:37:00	there's also the url of
0:37:02	about meet you
0:37:03	so
0:37:04	although
0:37:07	you this
0:37:08	on the screen here
0:37:11	somebody goes and he wonders whether he's
0:37:14	he's got these recognizers well calibrated
0:37:19	please going try this recipe
0:37:22	this can tell you how good your calibration is so that's my take on message
0:37:29	probably have time for some questions
0:37:40	and the questions
0:37:44	how genetically using this is done
0:37:47	in terms of the number of classes
0:37:49	all these techniques
0:37:51	what if you want without some plastic
0:37:54	i honestly don't know
0:37:56	in language recognition
0:37:59	we've
0:38:01	use the weekly lesson thirty languages
0:38:03	so
0:38:07	of course i think if you have lots of data like you guys have the
0:38:11	intention to work for very many classes but the
0:38:16	i think
0:38:17	if you don't have enough data per class be in trouble
0:38:32	the next talk i in the language id rear we typically focus a lot on
0:38:39	found data
0:38:41	and it's quite often data crosser languages you'll have a mismatch in the amount of
0:38:45	data
0:38:46	so differences in training
0:38:49	i think there's been a lot of discussion yesterday and low resource languages i'm expecting
0:38:54	that you probably also see varying amounts of build a there could you comment i'm
0:38:58	how some of the folks in the language id area might or try to address
0:39:04	varying amounts of data for improving language id
0:39:11	right in and in this slide what are so there is a before the likelihoods
0:39:16	i had the prior which you can choose so you can use that prior to
0:39:20	essentially white
0:39:22	the
0:39:26	the data so that you could be white the classes which are not well represented
0:39:30	you can rewind them so that to the cross entropy it looks as if there's
0:39:34	more than
0:39:35	more of the cluster that really is so
0:39:38	of course that doesn't magically make the data more
0:39:42	so
0:39:45	if you
0:39:46	in the and you just
0:39:47	what
0:39:49	cross entropy just really measures error rate
0:39:52	as i showed you cross entropy is constructed with that
0:39:57	a step function so it's better setting it's counting errors cross entropy is counting headers
0:40:03	so
0:40:04	if there are very few errors and it's gonna have a bad estimate of the
0:40:07	error rate
0:40:09	so
0:40:09	by multiplying
0:40:13	that error rate
0:40:14	which has
0:40:16	which is inaccurate with a large number you gonna multiply that in accuracy so
0:40:23	you should use those kind of rewriting with K
0:40:32	well here at asr you actually
0:40:35	the life is a bit more difficult than in speaker I to your language id
0:40:38	because they're basically you need to prior
0:40:40	to produce one decision profile or per utterance
0:40:44	so no to people would play with actually also segmenting the output by recognizing the
0:40:50	chunks muting posteriors generating lattices and this kind of stuff
0:40:55	imagine that there is a asr but student coming to you
0:41:00	asking you what you find from with what you
0:41:03	what all these folks are doing what would be the first thing that you would
0:41:07	take from we are like in calibration perspective what would you advise
0:41:13	i would go a need to go and study
0:41:16	speech recognition more carefully
0:41:19	before
0:41:21	before i would be able to answer that question
0:41:32	i thought the more is more obvious application would be an score normalisation for keywords
0:41:37	but i mean people are using discriminant techniques for taking sort of estimate the probability
0:41:43	of error and weighting them in using and normalized scores any thought about
0:41:48	why this we applied to that application
0:41:53	but what about the additional complications
0:41:56	so this
0:41:59	term white to the cost function
0:42:02	has those nasty little thing that keeps on jumping around
0:42:08	in
0:42:10	in what i showed you here
0:42:13	we assume in the cost function is not
0:42:15	so
0:42:17	you can make minimum expected cost bayes decisions if you know what the cost is
0:42:21	in the term white the thing
0:42:25	the cost depends on how many times the keyword is in the test data
0:42:30	and
0:42:31	i don't know that so that complicates matters considerably
0:42:44	you would still very well by going to calibrate your the output of your
0:42:50	recognizer
0:42:52	but
0:42:53	once you've got that likelihood
0:42:55	what are you gonna do then
0:42:57	to produce your final output that you're going to send two
0:43:01	to the evaluator
0:43:02	that gets complicated
0:43:04	and there's all kinds of normalizations and things involved
0:43:20	and distances applications when asking a question of like a real world example
0:43:27	and so i noticed that you are a rating with the rest so why not
0:43:33	check and you decide matching less complicated going to the details
0:43:38	that's but i think in a lot of real world applications where i am not
0:43:43	upon that check so that maybe use a context i something else by something else
0:43:49	this is a way have a problem is in the case where you want you
0:43:53	why consistency be good with respect to not checks
0:43:59	right this question makes a semantic is it's interesting handed
0:44:04	and in both cases yes
0:44:07	so
0:44:08	if you use this logarithmic cost function
0:44:12	then
0:44:14	it tends to my the output of your recognizer
0:44:18	good over a very wide range of operating points
0:44:22	so especially if you just have two classes that you want to recognise
0:44:26	you can
0:44:27	you can
0:44:30	i i'll show that axes of the posterior between gender one
0:44:33	but if you can't i am columns of the posterior then that access becomes infinite
0:44:38	so
0:44:39	then you can move that
0:44:40	that threshold
0:44:42	all the web from minus infinity to plus infinity so if you move it too
0:44:46	far
0:44:47	you gonna
0:44:48	going to regions where there's no more data more errors in the doesn't make any
0:44:52	sense anymore
0:44:54	there's a limited range
0:44:57	on that axis
0:44:59	where you can do useful stuff
0:45:01	and the logarithmic cost function typically
0:45:06	evaluates mice and widely over of with that useful range
0:45:11	so
0:45:12	if for example
0:45:13	you would a instead of taking the logarithm
0:45:17	you take
0:45:19	be square to one minus be squared
0:45:22	square loss sometimes called really lost
0:45:27	you get and natalie coverage
0:45:29	so
0:45:31	that doesn't that doesn't cover applications as widely as a as this case that's
0:45:37	you can go even wonder if you want
0:45:40	then you get a kind of an exponential loss function which is associated with boosting
0:45:44	in machine learning
0:45:48	i have a have a my interspeech paper
0:45:52	of this
0:45:54	explores that kind of thing in detail what if you have other cost functions not
0:45:59	just
0:46:00	cross entropy
0:46:03	so you should find a link to that the
0:46:06	we page
0:46:09	so the answer is basically it's a very good idea to use cross entropy
0:46:13	if you if you optimise your recognizer
0:46:17	to have good cross entropy it's generally going to work for whatever you want to
0:46:21	use it for
0:46:25	right based on speaker

Calibration of binary and multiclass probabilistic classifiers in automatic speaker and language recognition

Applications Day

Niko Brummer (Agnitio)