0:00:15 | so good morning sounds very much for inviting me |
---|---|

0:00:19 | as better mention i'm not in mainstream speech recognition |

0:00:24 | so |

0:00:25 | i hope what i chose to talk about what will be interesting to you |

0:00:29 | so |

0:00:30 | before i go on we just |

0:00:32 | okay just |

0:00:36 | just about a medium i probably |

0:00:39 | okay do is a startup that's been around since about two thousand and four |

0:00:44 | i need to is latin for i recognise so i meet you specialises in automatic |

0:00:50 | speaker recognition just that |

0:00:52 | and it sells |

0:00:54 | a range of products |

0:00:56 | that make use of this technology in many different countries in the world has its |

0:01:01 | main office in madrid in spain |

0:01:03 | and also offices close to washington and california |

0:01:07 | and we have a small research lab in south africa so that's where i'm based |

0:01:15 | so just to make sure we'd on the same page |

0:01:18 | that we know what we're talking about |

0:01:21 | everybody knows |

0:01:22 | what |

0:01:23 | speech recognition is about |

0:01:25 | speaker recognition is who's |

0:01:29 | i'll from funded very difficult to explain to people after i've explain for K two |

0:01:34 | minutes that will still understand speech recognition |

0:01:38 | and then of course |

0:01:39 | there's automatic language recognition or called spoken language recognition |

0:01:43 | just to tell |

0:01:44 | given a speech segment |

0:01:46 | which language was this |

0:01:49 | so |

0:01:51 | in speaker and also language recognition |

0:01:54 | we've inherited some stuff from speech recognition |

0:01:58 | mostly just the acoustic modeling so |

0:02:02 | the features mfccs and gmms |

0:02:05 | we do but slide back with neural networks we haven't we've tried but that i |

0:02:10 | don't work as well as the gmms to |

0:02:15 | so |

0:02:16 | we take the acoustic modeling and then we do some relatively simple back and recognition |

0:02:22 | it's very simple compared to your language modeling and or decoders |

0:02:27 | so |

0:02:28 | this talk is going to be deep rather than why i'm going to concentrate just |

0:02:33 | on the back and recognition part and just on a tiny aspect of that |

0:02:39 | ninety calibration |

0:02:43 | and i hope maybe |

0:02:46 | you guys find something in the store useful that you can maybe use |

0:02:50 | so |

0:02:52 | what is calibration |

0:02:53 | it concerns the goodness of soft decisions so you have a recognizer that i put |

0:02:58 | some kind of a soft decision a classifier if you want |

0:03:03 | and |

0:03:03 | then it can be understood in two senses |

0:03:07 | first of all calibration is just how could |

0:03:10 | is the output of my recognizer |

0:03:12 | or |

0:03:13 | it's whatever you do to make it bit so if you make your the output |

0:03:17 | of your recognizer better you calibrating it |

0:03:20 | so will talk about but |

0:03:22 | so |

0:03:24 | i'm not and there's expecting everybody to understand |

0:03:28 | this diagram this is just |

0:03:30 | a road map of what we're going to talk about i'll come back to this |

0:03:34 | diagram |

0:03:35 | so |

0:03:38 | i'm going to motivate that if you want your recognizer output |

0:03:42 | a soft decision |

0:03:45 | likelihoods rather than posteriors |

0:03:47 | is what you want and |

0:03:50 | how to evaluate the goodness |

0:03:52 | oftentimes outputs final cross entropy |

0:03:55 | so the cross entropy gives you a calibration sensitive loss function to measure the goodness |

0:04:00 | of the output of the recognizer |

0:04:02 | then we can take the wrong put and we can somehow calibrated so |

0:04:07 | i'll talk about some simple calibrate there's |

0:04:10 | and then |

0:04:11 | you can have this kind of a feedback loop |

0:04:13 | to essentially optimize why the effect of the calibrated and then |

0:04:19 | that gives you in a since calibration insensitive less |

0:04:23 | which tells you |

0:04:24 | how well could i have done if my calibrate it's my system had been optimally |

0:04:29 | calibrated |

0:04:30 | and then you can compare the two |

0:04:32 | and that will tell you |

0:04:34 | a good was my calibration |

0:04:38 | so it's not at the beginning |

0:04:40 | so |

0:04:41 | the canonical speaker recognition problem |

0:04:46 | we usually view that as a two class classification problem so the input is a |

0:04:51 | pair of speech segments often just call the enrollment segment and the test segment |

0:04:56 | and then |

0:04:57 | the output one is its class one that segments have the same speaker tall close |

0:05:02 | to the different |

0:05:04 | so as an example of a multiclass classifier |

0:05:09 | we take language recognition |

0:05:12 | so there we can define a number of language classes |

0:05:16 | and |

0:05:18 | if the french and the audience are wondering why they're not there but all their |

0:05:22 | neighbors that's some other language |

0:05:27 | so let's look at the |

0:05:31 | i'll put |

0:05:32 | the form of the output of |

0:05:35 | a classifier recognizer so that you just form would just to put |

0:05:40 | a heart last okay |

0:05:43 | if you want to soft output that might be |

0:05:46 | posterior distribution |

0:05:47 | or we can go to the other side of bayes rule and output like your |

0:05:51 | distribution i'm going to motivate the last one is preferable |

0:05:58 | how decisions |

0:05:59 | there's some people |

0:06:00 | but it's a bad idea unless error rate is really low |

0:06:05 | and it cannot make use of context that cannot make use of independent prior information |

0:06:10 | posting idea is to end users stole intuitive to understand what the posteriors telling them |

0:06:17 | and it conveys confidence |

0:06:20 | so |

0:06:20 | you can you can |

0:06:24 | recover from an error because you see it coming you know you can make errors |

0:06:27 | you could make optimal |

0:06:29 | minimum expected cost bayes decisions if you have a posterior so that's a much more |

0:06:33 | useful the output |

0:06:35 | the problem with the posting idea is |

0:06:39 | the prior is implicit and hardcoded inside the posterior you can remember by dividing it |

0:06:45 | up but then you also need to know what was the posting |

0:06:48 | so |

0:06:49 | a clean a type of output |

0:06:51 | it's just the likelihood |

0:06:52 | and then you can afterwards supply any but i in any prior |

0:06:59 | the only downside is it somewhat harder to understand especially for end users but we |

0:07:04 | might end users so |

0:07:05 | bits let's go with the likelihood |

0:07:10 | in the end |

0:07:12 | for this implications that we're looking at there really isn't that much difference between the |

0:07:16 | do if you have the posterior and the prior in the back to likelihood |

0:07:20 | or you got the other way |

0:07:23 | if |

0:07:24 | there's a small number of discreet classes you can always normalized likelihood and you've but |

0:07:27 | posterior |

0:07:30 | so |

0:07:31 | that does look at some examples to might affect |

0:07:35 | we use the likelihood |

0:07:37 | so |

0:07:38 | and language recognizer would output the like to distribution across the number of languages |

0:07:44 | but then |

0:07:44 | it's a we know within ornaments today |

0:07:48 | we more likely to here |

0:07:51 | check being spoken on the street very unlikely to hear my home language afrikaans |

0:07:57 | and |

0:07:57 | you combine these two sources of information |

0:08:00 | via bayes rule and the and the |

0:08:03 | the posterior then gives you the complete picture |

0:08:09 | maybe you could have a phone recognizer the same sort of rescue applies |

0:08:15 | output a like you distribution |

0:08:17 | and |

0:08:18 | the prior is the context in which you try to recognise that phone |

0:08:22 | and then the decoder combines everything |

0:08:24 | essentially forms a formal a kind of posterior |

0:08:28 | let's go to speech recognition |

0:08:31 | this is and i realised |

0:08:34 | you |

0:08:35 | of what a for intra-speaker recognizer might do |

0:08:38 | so |

0:08:41 | there was someone who |

0:08:43 | was careless enough to get himself recorded what he was committing a crime |

0:08:47 | people is about all of this speech sample |

0:08:51 | there is the suspect |

0:08:53 | but also list the suspect nicely to provide another speech sample then you want to |

0:08:58 | compare the two |

0:08:59 | is this the same person or not |

0:09:02 | and because that are just two classes you can conveniently for the likelihood ratio between |

0:09:06 | those two possibilities |

0:09:09 | and then if you have a very nice |

0:09:12 | bayesian courtroom inside the core |

0:09:14 | by would |

0:09:16 | this |

0:09:17 | the total effect of all the other evidence |

0:09:20 | as a kind of the prior |

0:09:22 | and then |

0:09:24 | if you have a very clever jobs and the jewelry i might act like bayes |

0:09:28 | rule and i can combine these two sources of evidence so |

0:09:32 | that's probably never really going to happen |

0:09:35 | but |

0:09:35 | still for any this is the useful babble |

0:09:39 | to think what should my i would look like |

0:09:42 | what should i be thinking about if i want this likelihood ratio that i put |

0:09:47 | to be |

0:09:49 | to do its job as well as possible |

0:09:54 | so |

0:09:56 | this is an objective field but in practice recognizers are often badly calibrated in my |

0:10:01 | experience if you bowl the speaker or a language recognizer it's always badly calibrated |

0:10:07 | you can redesigned the thing and do what you want it's going to be badly |

0:10:11 | calibrated it might be very accurate |

0:10:14 | but |

0:10:15 | install badly calibrated so |

0:10:18 | you need to |

0:10:20 | adjust its output |

0:10:21 | to get the full benefit of the output of the recognizer so |

0:10:26 | the tools we need |

0:10:28 | but michael this happened first of all you need to measure the quality of the |

0:10:31 | calibration |

0:10:32 | and then |

0:10:33 | you need some to adjust it |

0:10:37 | so |

0:10:38 | first let's talk about the measurement |

0:10:42 | so |

0:10:43 | calibration applies to both posteriors and likelihoods |

0:10:46 | it's easier to explain this whole thing in terms of posteriors and then later we'll |

0:10:50 | go back to the likelihoods |

0:10:53 | so are use |

0:10:56 | two classes |

0:10:57 | as a running example again because it's easier to |

0:11:02 | explain and then like that we go we'll go to the multiclass case so |

0:11:07 | it is a recogniser we represented by the symbol are so all the posteriors of |

0:11:12 | conditioned on because it's output by the by the recognizer |

0:11:16 | so |

0:11:17 | the posterior tells you do things |

0:11:20 | first is which clusters at five |

0:11:23 | if the one element is greater than other one it wants to be recognising class |

0:11:27 | one in this case |

0:11:29 | but then it also tells us that degree of confidence how much more is the |

0:11:33 | one element greater than the other one |

0:11:35 | so we can form that's right so we can take the bible the right so |

0:11:38 | you could look at the entropy of the distribution |

0:11:42 | any anything like that will give you a measurement of the of the degree of |

0:11:46 | confidence |

0:11:49 | so |

0:11:50 | the question i'm trying to answer what this presentation is |

0:11:54 | the recognizer outputs a posterior distribution |

0:11:58 | we also known for this particular case which of the two classes was really true |

0:12:05 | was this a good but still not |

0:12:08 | another example would be |

0:12:10 | a weather predict there's id percent of the chance of rain tomorrow |

0:12:15 | the model rows |

0:12:17 | it doesn't right |

0:12:19 | how good was that |

0:12:21 | i would was that the prediction |

0:12:24 | so first of all |

0:12:26 | if it's is the one is greater than other one |

0:12:29 | and that was in the right direction we know it favours the correct class so |

0:12:34 | at least that aspect of the |

0:12:36 | posterior once could |

0:12:38 | i'll do we judge this degree of confidence |

0:12:42 | what can we do |

0:12:44 | we don't have a reference posterior would not given that in practice |

0:12:48 | we just given the true class what can we say about the posterior |

0:12:54 | so |

0:12:56 | we'll the sign some penalty function |

0:12:59 | so |

0:13:01 | but this graph is telling us |

0:13:02 | the |

0:13:04 | recognizer output the posterior distribution |

0:13:07 | posterior for each of the two classes |

0:13:10 | and we know the true class so then on the bottom axis we plot |

0:13:15 | that |

0:13:15 | we plot that posterior distribution |

0:13:19 | for a single case of just be a single point on the x-axis |

0:13:23 | and then |

0:13:24 | if the posterior for the true class was one that's good |

0:13:27 | it was very certain of the thing that really happened |

0:13:31 | but if it's a something that had really happened |

0:13:35 | is it possible according to the recognizer |

0:13:39 | that's thirty back so we give it a high penalty maybe even an infinite penalty |

0:13:45 | and internet penalty |

0:13:47 | it's might be a good idea |

0:13:49 | in practice |

0:13:50 | if you make a wrong decision it can have arbitrarily bad consequences |

0:13:55 | if |

0:13:56 | you like playing russian roulette |

0:13:59 | you've got the gun in your hand |

0:14:01 | perform some posterior to know hidden that's is there's no time in the time but |

0:14:05 | at the moment |

0:14:07 | and you put it should lead |

0:14:09 | the consequences of a bad posterior |

0:14:14 | can be arbitrarily but so i liked this idea but the penalty going up to |

0:14:19 | infinity |

0:14:21 | so |

0:14:22 | i've brought that to candidate functions here |

0:14:28 | it's easy to see |

0:14:29 | this should be monotonic function |

0:14:32 | what should the shape be what principle should we used to design this penalty function |

0:14:38 | so |

0:14:40 | we'll take an engineering upright will say what do we want to use that the |

0:14:44 | output for and how do how well there's a do that |

0:14:48 | so |

0:14:50 | what can we do it but the posterior |

0:14:52 | make minimum expected cost bayes decisions |

0:14:55 | so |

0:14:56 | in as a speech recognizer it might be sending the find posteriors into the decoder |

0:15:01 | but in the end it still gonna make some decision at some stage it's gone |

0:15:05 | output a transcription so in the end you're always making decisions |

0:15:11 | so |

0:15:11 | and then we just ask |

0:15:13 | how well |

0:15:15 | does it make these decisions and then |

0:15:17 | that very same cost that you optimising |

0:15:20 | but the minimum expected cost bayes decision is gonna tell you how well you did |

0:15:25 | so |

0:15:26 | good posteriors can be used to make cost effective decisions |

0:15:30 | but badly calibrated posteriors |

0:15:34 | maybe on the confident or overconfident in the wrong hypothesis and that will eventually lead |

0:15:41 | to a series of unnecessarily costly |

0:15:44 | errors |

0:15:47 | so let's look at decision cost function |

0:15:51 | so |

0:15:52 | the decision cost functions model the consequences of applying recognition technology in the real world |

0:15:58 | real world it's always more complex |

0:16:02 | engineers like |

0:16:03 | simple models but we can optimize |

0:16:07 | so |

0:16:10 | this should be very from idea |

0:16:12 | we |

0:16:13 | we first look at the case of a heart decision |

0:16:16 | so |

0:16:16 | the recognizer says its class one out loss to |

0:16:20 | and we know that the true class last one of the last two and then |

0:16:23 | we assign some cost coefficient |

0:16:26 | so |

0:16:27 | in |

0:16:28 | thus example |

0:16:30 | i might the cost coefficients when there and i was for the correct decision |

0:16:34 | there's yellow |

0:16:35 | and for errors there's a non-zero cost |

0:16:39 | so you might |

0:16:40 | want to work in terms of rewards |

0:16:44 | or you can even have a mixture of rewards and penalties this celebrate the |

0:16:50 | what's called the term weighted value of the keyword spotting |

0:16:54 | that's a mixture of a of a of every word and the penalty so |

0:16:58 | in the and all those are equivalent you can you can play around these cost |

0:17:02 | functions |

0:17:03 | and |

0:17:04 | for what we're gonna do |

0:17:06 | it's |

0:17:07 | it's container |

0:17:08 | two |

0:17:09 | not using what i'm just to |

0:17:13 | put the cost on the errors |

0:17:16 | so |

0:17:17 | now we apply it to a soft decision |

0:17:21 | so we let the recognizer output the posterior distribution |

0:17:25 | and then |

0:17:26 | when we evaluate its goodness |

0:17:28 | we make and minimum expected cost bayes decision so the base decision is made without |

0:17:33 | having the two class |

0:17:36 | and then |

0:17:37 | we treat that as a heart decision and evaluated with this cost matrix as before |

0:17:43 | so |

0:17:44 | what we have now |

0:17:45 | is the goodness of the posterior |

0:17:48 | that we've output |

0:17:50 | very simple thing |

0:17:53 | it she |

0:17:54 | what we've achieved so |

0:17:57 | a couple of slides ago |

0:18:00 | i try to convince you that |

0:18:02 | this kind of penalty function on the left is what we want |

0:18:06 | what we've achieved |

0:18:07 | is this step function |

0:18:09 | so |

0:18:10 | there's a threshold on the posterior |

0:18:13 | which is a function of the cost coefficients |

0:18:16 | but the cost is either some non-zero cost or also your |

0:18:20 | so at least the step function has the right |

0:18:24 | sense |

0:18:25 | it's bigger where it needs to be an small that where it needs to be |

0:18:28 | but |

0:18:30 | it's very crude and in effect it's only evaluating the goodness of your posterior up |

0:18:35 | with a single point |

0:18:37 | it doesn't say anything about making decisions at any of the operating point |

0:18:42 | so |

0:18:42 | we need to find a smoother solution |

0:18:46 | so |

0:18:47 | in order to smooth let's simplified just a little but |

0:18:51 | so |

0:18:52 | the bayes decision threshold |

0:18:54 | this simple ratio of the costs |

0:18:58 | so we might think of it in terms of give the costs |

0:19:02 | compute the threshold |

0:19:04 | but that's to at the other way round |

0:19:06 | so let's say |

0:19:08 | that's choose the threshold at which we but which we're going to evaluate with free |

0:19:12 | to choose any facial |

0:19:14 | and then |

0:19:15 | we might |

0:19:17 | cost the a function of the threshold and if you choose these simple reciprocal functions |

0:19:24 | we're still |

0:19:25 | applying the above equation |

0:19:27 | so |

0:19:28 | the above equation is able to |

0:19:31 | so let's look at let's look at this graphically |

0:19:34 | so |

0:19:35 | what we've achieved |

0:19:37 | the recognizer outputs Q one of the posteriors for class one you do is just |

0:19:42 | just flip access you'll get you do |

0:19:45 | so |

0:19:46 | the penalty when class one is true would be the red curve and the penalty |

0:19:50 | when close to is true |

0:19:51 | would be the look of and |

0:19:54 | the cost coefficients are a function of the threshold which we can |

0:19:59 | adjust at what so let's do that |

0:20:02 | we can move the threshold and with it |

0:20:05 | the cost coefficients |

0:20:07 | with which are gonna be penalised well |

0:20:09 | well |

0:20:11 | well also change if you press |

0:20:14 | the threshold right against you had or one penalty will be infinite but |

0:20:18 | that's good because then you want to yourself in the hit |

0:20:24 | so |

0:20:25 | by moving the threshold while we're evaluating the goodness of the posterior we in fact |

0:20:29 | exercising the decision making ability of the a posteriori over its full range |

0:20:35 | so we're almost done |

0:20:40 | that's just look at another view |

0:20:42 | this is the same thing |

0:20:45 | just another view we have the recognizer output the posterior |

0:20:48 | the posterior is compared against |

0:20:51 | the threshold |

0:20:52 | the threshold is a parameter chosen by the evaluators |

0:20:57 | and then it |

0:20:58 | you also need to know the true cost and outputs the cost so |

0:21:01 | note |

0:21:03 | is a function of three variables |

0:21:06 | the recognizer output the true value and this parameter feature |

0:21:11 | so now let's integrate out see |

0:21:15 | so |

0:21:16 | the integral and here is the state be cost function which are plotted a few |

0:21:20 | slides about |

0:21:22 | on the left hand side we get |

0:21:25 | a cost function which is not independent of the threshold because we've integrated |

0:21:30 | about the full range of the threshold |

0:21:32 | and |

0:21:33 | that turns out to be just this logarithmic cost function |

0:21:37 | so |

0:21:38 | you |

0:21:39 | bike the than algorithm of |

0:21:42 | the posterior have for the two class |

0:21:45 | and |

0:21:47 | that |

0:21:47 | is the goodness of the posterior of the recognizer so |

0:21:52 | there's napkins |

0:21:53 | this nice smooth shape |

0:21:56 | which we were looking for |

0:22:01 | so |

0:22:01 | that's two classes |

0:22:04 | now we're going to generalise to multiclass |

0:22:07 | so multi class |

0:22:09 | is a lot trickier |

0:22:11 | but the sign general principles apply |

0:22:14 | so |

0:22:15 | we still gonna work with minimum expected cost bayes decisions |

0:22:20 | but in this case will use of generalize threshold which all plot for you the |

0:22:24 | next slide |

0:22:25 | and we again we're going to integrate out the threshold and get the similar results |

0:22:31 | so |

0:22:34 | and the scroll we show the |

0:22:36 | output of i three class |

0:22:39 | recognizer |

0:22:40 | i chose three classes because i can plotted here on this nice flat screen |

0:22:45 | so Q one |

0:22:46 | is the posterior for class one |

0:22:49 | the vertical axis Q to the posterior for class to |

0:22:53 | and Q three we don't see but it's just the complement of the others to |

0:22:57 | so everything needs to live inside the simplex |

0:23:01 | then the |

0:23:03 | the tricky part |

0:23:05 | is |

0:23:05 | we now define a kind of a generalized facial so this threshold |

0:23:11 | has three components people want to and three |

0:23:14 | and we constrain them to sum to one so this threshold |

0:23:18 | is defined |

0:23:19 | by this point where the lines meet |

0:23:22 | and that also loves inside the same someplace |

0:23:25 | and now again |

0:23:27 | we've chosen the threshold |

0:23:28 | then we choose the cost function so the cost function again |

0:23:32 | is this little equation at the bottom again as just the reciprocal of the threshold |

0:23:36 | coefficients |

0:23:39 | and again |

0:23:40 | we can play around |

0:23:42 | we can move the |

0:23:43 | the |

0:23:44 | threshold |

0:23:46 | or lower bound |

0:23:47 | the interior of the simplex |

0:23:49 | we can exercise the decision making ability |

0:23:52 | of the |

0:23:54 | of the recognizer |

0:23:57 | i should have told you the these |

0:24:01 | the |

0:24:03 | these lines that the structure of the threshold that is just the consequence of making |

0:24:07 | the minimum expected cost bayes decision |

0:24:10 | so |

0:24:11 | once you assigned those cost functions |

0:24:14 | that's what the threshold is gonna look like so |

0:24:16 | again if Q one is large you gonna be in the region all one choose |

0:24:20 | class one region or to include choose plus two |

0:24:24 | and i three if the other to a small we gonna choose plus three |

0:24:28 | so |

0:24:31 | again |

0:24:32 | we seen that we can move the threshold around now we can integrated |

0:24:37 | so |

0:24:38 | the integral will cover |

0:24:40 | several slides |

0:24:42 | which i'm not going to show you |

0:24:44 | but the same kind of thing applies we just integrate out the |

0:24:51 | threshold |

0:24:53 | of at this stage cost function |

0:24:55 | and lo and behold |

0:24:57 | we get |

0:24:58 | the logarithmic function again |

0:25:03 | so |

0:25:04 | the whole recipe can be summarized like this |

0:25:08 | again the recognizer output of posterior distribution |

0:25:12 | in other words an element of the posterior for each of the classes |

0:25:17 | when we know what the true class is |

0:25:19 | we select that component and we just apply logarithm to it |

0:25:24 | so |

0:25:24 | if the recognizer says the true |

0:25:27 | the probability of the two classes one that's very good the penalties you error |

0:25:31 | if it's a is the |

0:25:34 | probability of the true class is zero that's very bad penalties is infinite |

0:25:42 | so |

0:25:45 | all of the preceding was for just one example |

0:25:48 | one input one output |

0:25:50 | if you have a whole database of data which is supervised |

0:25:54 | you can apply this |

0:25:56 | two |

0:25:56 | the whole database and you just average the logarithmic cost |

0:26:01 | and that is cross entropy which tries to you know very well |

0:26:05 | so |

0:26:07 | that's perhaps the most well-known discriminative training objective not just in speech recognition in |

0:26:13 | all of machine learning |

0:26:15 | and it forms the basis for all kinds of other things with other names like |

0:26:19 | mmi logistic regression |

0:26:24 | it's perhaps not so well known |

0:26:26 | that'd is a way of measuring calibration |

0:26:29 | you see that appearing from time to time for example |

0:26:34 | this book on a gaussian processes they use but use cross entropy to do essentially |

0:26:40 | all to measure calibration |

0:26:44 | and then statistics literature |

0:26:48 | this thing is referred to as the logarithmic proper scoring rule |

0:26:52 | you get a whole bunch of other proper scoring rules which |

0:26:55 | can be derived in a similar way you just need to what that integral but |

0:26:59 | the |

0:27:00 | the logarithmic one is very simple and generally just a good idea to use |

0:27:08 | so let's get back to the likelihoods |

0:27:11 | this is going to be very short and simple |

0:27:16 | we start with the recipe for the posterior which are show just now |

0:27:19 | and now we just flip to the other side of bayes rule |

0:27:23 | so now we also the recognizer give me likelihood a likely distribution instead of a |

0:27:28 | posterior distribution |

0:27:29 | and when evaluating its goodness |

0:27:32 | we just send them to softmax or by israel if you will |

0:27:36 | and then apply the logarithm |

0:27:38 | and then |

0:27:40 | we also provided with the prior |

0:27:42 | so |

0:27:43 | notice |

0:27:44 | that we need to now supply prior distribution |

0:27:48 | as a parameter to this evaluation recipe |

0:27:51 | so |

0:27:52 | you free to choose whatever prior |

0:27:56 | the prior there's not have to reflect the proportions of |

0:27:59 | the classes in your data so if you want to emphasise one class rather than |

0:28:04 | the other |

0:28:06 | for example better spoke about that this morning |

0:28:09 | i emphasise some classes |

0:28:11 | some ready data |

0:28:12 | you can do that of course |

0:28:14 | if you have data of one class multiplying that by some |

0:28:18 | on the other number isn't gonna might |

0:28:20 | data appeared magically |

0:28:23 | but |

0:28:24 | the prior does give you some control over |

0:28:27 | of |

0:28:29 | where you want to emphasise |

0:28:34 | so that's |

0:28:36 | let's get back to |

0:28:37 | the graph that we showed earlier |

0:28:40 | so |

0:28:41 | i've motive like to that |

0:28:43 | we want the recognizer output likelihoods |

0:28:47 | that cross entropy forms a nice calibration sensitive |

0:28:52 | cost function to tell you how well it's doing |

0:28:55 | now we can also send the output of the recognizer into a simple calibrated so |

0:29:01 | a calibrated |

0:29:02 | can be anything |

0:29:05 | in general it's a good idea to make it very simple |

0:29:08 | you spend a whole lot of |

0:29:10 | in the G on building have strong recognizer calibrate there should be |

0:29:14 | simple and easy to do |

0:29:17 | but you can gain about out of it |

0:29:20 | so |

0:29:21 | what the stress it does as of explained before |

0:29:24 | it doesn't trained on based optimize the calibrated to tell you how well could i |

0:29:29 | have done |

0:29:30 | if calibration originally had been |

0:29:34 | bit |

0:29:34 | and then |

0:29:35 | you can compare the two |

0:29:37 | and then |

0:29:39 | you can |

0:29:41 | the difference you can call the calibration loss |

0:29:44 | if you build a recognizer |

0:29:47 | and the output is |

0:29:49 | well calibrated in the calibration loss will be small and you can be very happy |

0:29:53 | otherwise you have to go and |

0:29:56 | apply some calibrated before you want to apply the recognizer right |

0:30:04 | so |

0:30:07 | thus will be brief how to |

0:30:10 | well calibrated |

0:30:12 | so |

0:30:13 | the theory |

0:30:14 | is very basic |

0:30:17 | it's a |

0:30:19 | we don't |

0:30:20 | some |

0:30:21 | basic recognizer |

0:30:22 | which |

0:30:24 | outputs class likelihoods so |

0:30:28 | then we just |

0:30:30 | for all the likelihoods into one vector call it like to distribution |

0:30:35 | and then we say well |

0:30:38 | we now |

0:30:39 | we've mentioned that |

0:30:41 | these likelihoods are not well calibrated that on my goodbyes decisions |

0:30:46 | so let's put another probabilistic modeling step on top of that |

0:30:51 | it's not be mapped |

0:30:52 | the state is of this likelihood vector this just one of the feature or a |

0:30:56 | score if you want |

0:30:59 | or already original recognise it might have been an svm the svm doesn't even pretend |

0:31:05 | to produce calibrated likelihoods |

0:31:08 | the output is just the score that's fine we can just |

0:31:12 | use that as the input to the next modelling stage so |

0:31:18 | you have complete freedom |

0:31:19 | of |

0:31:20 | what you going to use for the next modelling stage |

0:31:23 | it can be parametric could be non parametric could be more or less bayesian |

0:31:30 | it can be discriminant of it can be generative |

0:31:34 | as long as |

0:31:35 | as long as it works |

0:31:38 | so |

0:31:41 | i've tried and tested |

0:31:43 | various |

0:31:44 | calibration strategies |

0:31:47 | the one i'm showing you |

0:31:49 | stole my five that it's very simple |

0:31:52 | so |

0:31:53 | you |

0:31:54 | take the log likelihoods use kind of them with the class independent scale factors and |

0:31:59 | you shifted with the class dependent |

0:32:01 | offset |

0:32:03 | and that gives you a recalibrated |

0:32:08 | likelihood recalibrated log likelihood |

0:32:12 | so we train the |

0:32:15 | coefficients the scale and the of sets we train the discriminatively |

0:32:18 | and typically using again |

0:32:21 | cross entropy average logarithmic cost |

0:32:23 | and because |

0:32:25 | the cross entropy |

0:32:28 | optimizes calibration |

0:32:30 | this is why this recipe optimizes calibration supposed to discriminate the frisbee i've worked with |

0:32:35 | generative ones as well as i would do |

0:32:38 | so |

0:32:39 | i might just mention that |

0:32:42 | for example if you're doing automatic |

0:32:45 | language recognition you might |

0:32:47 | extract what we call an i-vector |

0:32:50 | so the i-vector represents the whole |

0:32:53 | input segments of speech |

0:32:55 | and then you can just go and do a large multi class logistic regression |

0:33:00 | and that will outputs likelihoods |

0:33:02 | so |

0:33:03 | that already uses cross entropy as an objective function why would you need to calibrate |

0:33:08 | the to get |

0:33:10 | so the problem is |

0:33:11 | to make the labs logistic regression |

0:33:14 | well you need to regularize the regularization |

0:33:18 | we'll typically skew the calibration course now we not |

0:33:22 | optimising |

0:33:23 | the |

0:33:26 | minimum expected cost bayes decisions anymore so regularization is necessary but it's cues calibration so |

0:33:33 | in practice |

0:33:35 | it's a good idea to |

0:33:37 | how about some data |

0:33:40 | to use for calibration so |

0:33:42 | part of your data you train your original recognizer |

0:33:46 | the held out set you use for training your calibrated |

0:33:49 | so in practice we found in speaker and in language recognition |

0:33:54 | this kind of recipe |

0:33:56 | works very well |

0:33:59 | and then the stress is just another form of logistic regression |

0:34:04 | just the very much constraint |

0:34:06 | you can of course |

0:34:07 | you can multiply this the simple fact it with the full matrix if you one |

0:34:10 | that would be than unconstrained logistic regression |

0:34:14 | that also works but you have to be a bit more carefully need enough data |

0:34:19 | the general this the simple recipes is very safe and very effective usually |

0:34:28 | so |

0:34:29 | i'll just give you one real world example not real will be it's i the |

0:34:34 | nist evaluation |

0:34:38 | almost real world |

0:34:41 | so |

0:34:42 | we look at an example of the two thousand and seven nist language mission evaluation |

0:34:49 | we look at the original accuracy of for one |

0:34:52 | systems that were competing in this evaluation |

0:34:55 | and then we look at the improvement after recalibration with the recipe which of just |

0:34:59 | shown |

0:35:01 | so |

0:35:01 | on the vertical axis is the evaluation criterion which was defined specifically for the language |

0:35:08 | recognition evaluation it's a little bit too complicated explained here but it's enough to know |

0:35:14 | this is the calibration sensitive criterion you do better if you're calibration is better |

0:35:20 | and |

0:35:22 | libraries but it's a cost function |

0:35:25 | so |

0:35:26 | the blue ones of the original submissions and after being recalibrated |

0:35:31 | you get you get an improvement in all the system so |

0:35:35 | i must mention that the recalibration was done on |

0:35:39 | not on the evaluation data but on some independent calibration data so this is not |

0:35:45 | to cheating recalibration |

0:35:50 | so |

0:35:52 | with done |

0:35:54 | time to summarize |

0:35:56 | so |

0:36:00 | the job of |

0:36:02 | posteriors or likelihoods |

0:36:04 | is in the end to mike cost effective decisions |

0:36:07 | if we're gonna user recognizers for anything in the end |

0:36:10 | it makes decisions it outputs some something heart or it does some action that was |

0:36:16 | all decisions |

0:36:17 | so |

0:36:20 | that's very cost |

0:36:21 | tells us how good the or |

0:36:23 | if we want them to minimize cost that cost tells us how good they all |

0:36:28 | and |

0:36:29 | cross entropy is just the representation |

0:36:32 | of that same cost it's just it's movie over a range of operating points |

0:36:38 | and calibration can be measured and improved |

0:36:44 | i've put this presentation that this finally U R L if you want to find |

0:36:48 | it |

0:36:49 | i have some of my |

0:36:51 | publications on calibration and some kind of is well there are some matlab toolkits |

0:36:57 | at the next url |

0:36:58 | and |

0:37:00 | there's also the url of |

0:37:02 | about meet you |

0:37:03 | so |

0:37:04 | although |

0:37:07 | you this |

0:37:08 | on the screen here |

0:37:11 | somebody goes and he wonders whether he's |

0:37:14 | he's got these recognizers well calibrated |

0:37:19 | please going try this recipe |

0:37:22 | this can tell you how good your calibration is so that's my take on message |

0:37:29 | probably have time for some questions |

0:37:40 | and the questions |

0:37:44 | how genetically using this is done |

0:37:47 | in terms of the number of classes |

0:37:49 | all these techniques |

0:37:51 | what if you want without some plastic |

0:37:54 | i honestly don't know |

0:37:56 | in language recognition |

0:37:59 | we've |

0:38:01 | use the weekly lesson thirty languages |

0:38:03 | so |

0:38:07 | of course i think if you have lots of data like you guys have the |

0:38:11 | intention to work for very many classes but the |

0:38:16 | i think |

0:38:17 | if you don't have enough data per class be in trouble |

0:38:32 | the next talk i in the language id rear we typically focus a lot on |

0:38:39 | found data |

0:38:41 | and it's quite often data crosser languages you'll have a mismatch in the amount of |

0:38:45 | data |

0:38:46 | so differences in training |

0:38:49 | i think there's been a lot of discussion yesterday and low resource languages i'm expecting |

0:38:54 | that you probably also see varying amounts of build a there could you comment i'm |

0:38:58 | how some of the folks in the language id area might or try to address |

0:39:04 | varying amounts of data for improving language id |

0:39:11 | right in and in this slide what are so there is a before the likelihoods |

0:39:16 | i had the prior which you can choose so you can use that prior to |

0:39:20 | essentially white |

0:39:22 | the |

0:39:26 | the data so that you could be white the classes which are not well represented |

0:39:30 | you can rewind them so that to the cross entropy it looks as if there's |

0:39:34 | more than |

0:39:35 | more of the cluster that really is so |

0:39:38 | of course that doesn't magically make the data more |

0:39:42 | so |

0:39:45 | if you |

0:39:46 | in the and you just |

0:39:47 | what |

0:39:49 | cross entropy just really measures error rate |

0:39:52 | as i showed you cross entropy is constructed with that |

0:39:57 | a step function so it's better setting it's counting errors cross entropy is counting headers |

0:40:03 | so |

0:40:04 | if there are very few errors and it's gonna have a bad estimate of the |

0:40:07 | error rate |

0:40:09 | so |

0:40:09 | by multiplying |

0:40:13 | that error rate |

0:40:14 | which has |

0:40:16 | which is inaccurate with a large number you gonna multiply that in accuracy so |

0:40:23 | you should use those kind of rewriting with K |

0:40:32 | well here at asr you actually |

0:40:35 | the life is a bit more difficult than in speaker I to your language id |

0:40:38 | because they're basically you need to prior |

0:40:40 | to produce one decision profile or per utterance |

0:40:44 | so no to people would play with actually also segmenting the output by recognizing the |

0:40:50 | chunks muting posteriors generating lattices and this kind of stuff |

0:40:55 | imagine that there is a asr but student coming to you |

0:41:00 | asking you what you find from with what you |

0:41:03 | what all these folks are doing what would be the first thing that you would |

0:41:07 | take from we are like in calibration perspective what would you advise |

0:41:13 | i would go a need to go and study |

0:41:16 | speech recognition more carefully |

0:41:19 | before |

0:41:21 | before i would be able to answer that question |

0:41:32 | i thought the more is more obvious application would be an score normalisation for keywords |

0:41:37 | but i mean people are using discriminant techniques for taking sort of estimate the probability |

0:41:43 | of error and weighting them in using and normalized scores any thought about |

0:41:48 | why this we applied to that application |

0:41:53 | but what about the additional complications |

0:41:56 | so this |

0:41:59 | term white to the cost function |

0:42:02 | has those nasty little thing that keeps on jumping around |

0:42:08 | in |

0:42:10 | in what i showed you here |

0:42:13 | we assume in the cost function is not |

0:42:15 | so |

0:42:17 | you can make minimum expected cost bayes decisions if you know what the cost is |

0:42:21 | in the term white the thing |

0:42:25 | the cost depends on how many times the keyword is in the test data |

0:42:30 | and |

0:42:31 | i don't know that so that complicates matters considerably |

0:42:44 | you would still very well by going to calibrate your the output of your |

0:42:50 | recognizer |

0:42:52 | but |

0:42:53 | once you've got that likelihood |

0:42:55 | what are you gonna do then |

0:42:57 | to produce your final output that you're going to send two |

0:43:01 | to the evaluator |

0:43:02 | that gets complicated |

0:43:04 | and there's all kinds of normalizations and things involved |

0:43:20 | and distances applications when asking a question of like a real world example |

0:43:27 | and so i noticed that you are a rating with the rest so why not |

0:43:33 | check and you decide matching less complicated going to the details |

0:43:38 | that's but i think in a lot of real world applications where i am not |

0:43:43 | upon that check so that maybe use a context i something else by something else |

0:43:49 | this is a way have a problem is in the case where you want you |

0:43:53 | why consistency be good with respect to not checks |

0:43:59 | right this question makes a semantic is it's interesting handed |

0:44:04 | and in both cases yes |

0:44:07 | so |

0:44:08 | if you use this logarithmic cost function |

0:44:12 | then |

0:44:14 | it tends to my the output of your recognizer |

0:44:18 | good over a very wide range of operating points |

0:44:22 | so especially if you just have two classes that you want to recognise |

0:44:26 | you can |

0:44:27 | you can |

0:44:30 | i i'll show that axes of the posterior between gender one |

0:44:33 | but if you can't i am columns of the posterior then that access becomes infinite |

0:44:38 | so |

0:44:39 | then you can move that |

0:44:40 | that threshold |

0:44:42 | all the web from minus infinity to plus infinity so if you move it too |

0:44:46 | far |

0:44:47 | you gonna |

0:44:48 | going to regions where there's no more data more errors in the doesn't make any |

0:44:52 | sense anymore |

0:44:54 | there's a limited range |

0:44:57 | on that axis |

0:44:59 | where you can do useful stuff |

0:45:01 | and the logarithmic cost function typically |

0:45:06 | evaluates mice and widely over of with that useful range |

0:45:11 | so |

0:45:12 | if for example |

0:45:13 | you would a instead of taking the logarithm |

0:45:17 | you take |

0:45:19 | be square to one minus be squared |

0:45:22 | square loss sometimes called really lost |

0:45:27 | you get and natalie coverage |

0:45:29 | so |

0:45:31 | that doesn't that doesn't cover applications as widely as a as this case that's |

0:45:37 | you can go even wonder if you want |

0:45:40 | then you get a kind of an exponential loss function which is associated with boosting |

0:45:44 | in machine learning |

0:45:48 | i have a have a my interspeech paper |

0:45:52 | of this |

0:45:54 | explores that kind of thing in detail what if you have other cost functions not |

0:45:59 | just |

0:46:00 | cross entropy |

0:46:03 | so you should find a link to that the |

0:46:06 | we page |

0:46:09 | so the answer is basically it's a very good idea to use cross entropy |

0:46:13 | if you if you optimise your recognizer |

0:46:17 | to have good cross entropy it's generally going to work for whatever you want to |

0:46:21 | use it for |

0:46:25 | right based on speaker |