0:00:00 | i know things for attending this talk |
---|---|

0:00:03 | i am just enough that i'm a researcher the computer science institute |

0:00:07 | which is a unit to university one aside as some corny set in argentina |

0:00:13 | to they'll be talking about the initial of calibration in speaker verification |

0:00:17 | and hopefully by the end of the talk and i'm gonna can be assumed that |

0:00:20 | these things you need an important issue if you were not already convinced |

0:00:26 | so the top will be organised this way first and gonna define calibration |

0:00:31 | and given intuition |

0:00:35 | then |

0:00:36 | talk about why we should care about it |

0:00:39 | which is related also to how to make sure it |

0:00:43 | and if we find out that bit calibration is bad in a certain system then |

0:00:47 | how to fix it |

0:00:49 | and then finally i'll talk about issues of robustness of calibration for speaker verification |

0:00:55 | the task the main task |

0:00:58 | on which i will be that samples on in speaker verification |

0:01:02 | and assume that the audience |

0:01:03 | in you know the c |

0:01:05 | knows |

0:01:06 | well this task but just in case |

0:01:10 | it's a binary classification task |

0:01:12 | where the samples |

0:01:13 | are given by a |

0:01:16 | two waveforms or two sets of waveforms |

0:01:18 | but we need to compare to decide whether |

0:01:21 | they come from the same speaker or from different speakers |

0:01:26 | so the task is binary classification so much of what i'm gonna say |

0:01:31 | applies to any binary classification task and we just |

0:01:34 | speaker verification |

0:01:36 | okay so what is calibration |

0:01:39 | that's a we want to build a system that predicts the probability that it will |

0:01:43 | rain within the next hour |

0:01:45 | based only on a picture of the sky |

0:01:47 | so this is these are wary |

0:01:49 | if we see this picture then we would expect the system to work we don't |

0:01:53 | know probability say point one |

0:01:55 | while it was in this picture then we would expect it well would have much |

0:01:58 | higher probability of rain |

0:02:00 | it's a closer to one when |

0:02:03 | are we will say that the system is kind of really |

0:02:06 | the values that are able by the system coincide |

0:02:10 | we what we seen |

0:02:11 | in the data |

0:02:15 | so |

0:02:16 | i well calibrated score |

0:02:18 | should reflect the uncertainty of the system |

0:02:21 | for example to be concrete |

0:02:24 | for all the samples |

0:02:25 | but get a score |

0:02:26 | or point eight |

0:02:27 | come the system |

0:02:29 | then we would expect eighty percent of them to be labeled correctly |

0:02:32 | that's one data point eight meetings |

0:02:35 | in that happens |

0:02:37 | then we will say that the system is what kind of |

0:02:40 | and then we could be an example of diagram that is used in many tasks |

0:02:44 | not match a speaker verification on not at all but it's |

0:02:48 | i think it's very intuitive four |

0:02:50 | understanding calibration |

0:02:53 | it's called the reliability of diagram |

0:02:55 | i'm basically but when it shows is the posteriors |

0:02:59 | from a system that was random certain data |

0:03:02 | the posteriors that the system okay |

0:03:04 | for the class |

0:03:05 | then we predict |

0:03:07 | so for example for |

0:03:08 | this being |

0:03:11 | we have all the samples for which the system gave a posterior between point eight |

0:03:15 | point |

0:03:17 | and what the |

0:03:18 | diagram shows is the accuracy |

0:03:21 | on those some |

0:03:23 | so in there |

0:03:24 | system was calibrated then we would expect these two we |

0:03:28 | diagonal |

0:03:29 | because |

0:03:30 | what the system predicted |

0:03:31 | what coincide with the accuracy than we seen also |

0:03:35 | in this specific case what we actually see that the system was correct more times |

0:03:41 | then you thought it would be |

0:03:43 | which is interesting in to a system that underestimates it's coupled |

0:03:49 | now i to this diagram |

0:03:52 | from a paper from twenty seventeen |

0:03:56 | which actually studies the initial calibration on |

0:03:59 | and different architectures |

0:04:01 | so it compares on a task |

0:04:03 | that is quality far one hundred which is the image classification how to different classes |

0:04:08 | and it compares |

0:04:09 | the this is the plot that i already showed a |

0:04:12 | c n from nineteen ninety eight |

0:04:15 | we address in it |

0:04:17 | from twenty sixteen |

0:04:19 | we and they show that actually the new network |

0:04:23 | much worse calibrated |

0:04:25 | then the old network |

0:04:26 | so for this saying being the racial before |

0:04:29 | then you network actually has an accuracy much lower than we got it should how |

0:04:36 | which is point five more |

0:04:38 | so in this is an over confident |

0:04:41 | the nn |

0:04:42 | is things |

0:04:43 | it will do much better than it actually thus |

0:04:46 | one the other hand being error |

0:04:48 | from the new network is no |

0:04:50 | so if you put this network to make decisions that the sessions will be better |

0:04:54 | than the old ones |

0:04:55 | but the score studied outputs |

0:04:58 | cannot be interpreted as posterior settle |

0:05:00 | it cannot be interpreted as |

0:05:02 | the certainty that sit that the system has when it makes a decision |

0:05:08 | so |

0:05:09 | this is actually a phenomenon that we see a node in speaker recognition basically you |

0:05:15 | have a badly calibrated bottle tiny still |

0:05:18 | when discriminately |

0:05:20 | the problem is that such a model |

0:05:22 | might be useless in practice depending on the scenario in which we plan to use |

0:05:27 | so |

0:05:28 | as i already said |

0:05:29 | this course |

0:05:30 | from an is gonna weighting system cannot be interpreted as the certainty |

0:05:35 | that the system has units decisions |

0:05:39 | also |

0:05:40 | the scores cannot be made |

0:05:42 | i cannot been used to make optimal position |

0:05:44 | without |

0:05:46 | having the data to |

0:05:48 | how does make a decision so that's what i'm gonna talk about in the next |

0:05:51 | two sets |

0:05:54 | so how do we make optimal decision in general for binary classification |

0:05:59 | when usually define a cost function |

0:06:02 | and this is a very |

0:06:03 | common cost function which has very nice properties |

0:06:07 | it's a combination of two terms |

0:06:09 | one for each class |

0:06:11 | where |

0:06:12 | the |

0:06:13 | maybe part here is the probability of making an error for that class of these |

0:06:18 | is |

0:06:18 | the probability of |

0:06:19 | to see |

0:06:21 | class |

0:06:22 | zero |

0:06:22 | when the true class |

0:06:24 | was one |

0:06:26 | we multiply these probability of error by the prior |

0:06:29 | for that class one |

0:06:31 | and then we further multiplied by cost which is what we think |

0:06:36 | it is gonna cost us if we make these are |

0:06:40 | this is very specific to the application that we're gonna use the system |

0:06:44 | and for the other classes the same symmetric |

0:06:47 | so |

0:06:49 | this is an expected cost |

0:06:51 | the way to minimize is expected cost is to choose the following |

0:06:57 | the session |

0:06:58 | so |

0:06:59 | for a certain sample x |

0:07:01 | the text class should be one |

0:07:03 | in this factor |

0:07:05 | it is larger than this factor and zero otherwise |

0:07:09 | and this factor is composed of the cost |

0:07:12 | the prior |

0:07:14 | and the likelihood |

0:07:16 | for the class one |

0:07:19 | and this is the same forecasting |

0:07:23 | so |

0:07:23 | we see here than one we need to make optimal decisions is these likelihood |

0:07:28 | be of x |

0:07:29 | given c |

0:07:32 | now |

0:07:34 | one we have |

0:07:36 | is the likelihood then we learned |

0:07:38 | without formal |

0:07:39 | is the likelihood when they're |

0:07:42 | on the training data |

0:07:43 | that's why amusing here the we go to indicate that these in the cost |

0:07:48 | these probabilities the one we expect to see testing |

0:07:51 | one we actually see that's the |

0:07:55 | while we don't have that |

0:07:57 | what we have is one we saw in train |

0:08:00 | so let's say that we train a generative model then our generative model is gonna |

0:08:05 | be was directly these likelihood |

0:08:07 | but it will be the likelihood we learned in training |

0:08:10 | and that's fine we usually just assume |

0:08:13 | in order to do anything at all the machine learning |

0:08:15 | we assume that these will generalize to testing |

0:08:18 | testing |

0:08:20 | now we may not have the likelihood if we train the discriminative system |

0:08:25 | in that case we may have the posterior |

0:08:28 | discriminative systems |

0:08:29 | training for example with cross entropy any two i'll would posteriors |

0:08:34 | in that case when we need to do is compare those posteriors by two likelihoods |

0:08:38 | and for that we use bayes rule |

0:08:41 | by basically we want to like the by |

0:08:43 | this be of x and divided by the by |

0:08:46 | i don't hear that again this is the prior in training |

0:08:50 | is not the prior |

0:08:52 | the p we call that i put hearing the cost which is the one we |

0:08:55 | expect to see testing |

0:08:58 | and that's the whole |

0:08:59 | point why we use likelihoods and not posteriors |

0:09:03 | to make these optimal position |

0:09:06 | because it gives us the flexibility |

0:09:08 | two separate |

0:09:09 | the prior from training from the prior in testing |

0:09:14 | okay so |

0:09:16 | going back to the |

0:09:17 | to the optimal decisions |

0:09:19 | we have this expression |

0:09:21 | we can simplify with this expression by defining the log-likelihood ratio |

0:09:26 | which i'm sure everybody now see |

0:09:28 | you're working speaker verification |

0:09:30 | it's basically the spatial between |

0:09:33 | the likelihood for class one and the likelihood for cassie rule |

0:09:37 | and we take monopoly because it's |

0:09:39 | nicer |

0:09:40 | are we can do a similar thing with that costs |

0:09:43 | the factors that multiplied these likelihoods here |

0:09:47 | so we define these data |

0:09:49 | and their with those definitions we can |

0:09:53 | simplify the optimal decisions to look like these basically you decide class one |

0:09:57 | if the llr is larger than |

0:09:59 | data |

0:10:00 | otherwise garcia |

0:10:02 | and the and an untimely computed from the system posteriors |

0:10:06 | with this expression digits |

0:10:07 | based rules |

0:10:08 | after taking the logarithm |

0:10:11 | you of a scroll so wait |

0:10:13 | because it was |

0:10:15 | in both |

0:10:18 | factors it what in most likely |

0:10:21 | and |

0:10:21 | and this is basically the no goals of the posterior minus the notebooks of the |

0:10:26 | prior |

0:10:26 | which can be written is way using the energy fine function |

0:10:34 | okay so in speaker verification the feature x |

0:10:39 | it's actually a pair of features or even |

0:10:41 | a pair of sets of features |

0:10:43 | a one for enrollment and one for test |

0:10:46 | then class one is the class for target or same speaker |

0:10:51 | trial |

0:10:52 | and class zero is the task for impostor or different speaker trial |

0:10:57 | and we define the cost function or we use an equally dcf in speaker verification |

0:11:03 | using these |

0:11:05 | names |

0:11:06 | for the costs and priors |

0:11:08 | and |

0:11:09 | we call the errors be nice be false alarm |

0:11:12 | and beanies and means |

0:11:14 | would be |

0:11:15 | a missing a target trial soul namely non-target trial as an impostor |

0:11:19 | and a false alarm would be |

0:11:21 | namely and impostor asset are |

0:11:26 | and that the racial |

0:11:27 | looks like this using these names |

0:11:30 | and if you know |

0:11:31 | only care about |

0:11:32 | it's actually this thing to make optimal decisions you don't care about the whole |

0:11:38 | combinational |

0:11:40 | values of costs and priors altogether about these things they |

0:11:44 | so you could impact simplify |

0:11:47 | the cost functions the families of can cost functions to consider by just using a |

0:11:52 | single binary and the fact that beat are that is equivalent to having this |

0:11:58 | triplet for money is that are really just three because |

0:12:02 | p is a function |

0:12:06 | so we will be using that a the rest of the talk because it's much |

0:12:10 | simpler |

0:12:11 | and it helps a lot in the analysis |

0:12:14 | basically we simplify all possible cost functions |

0:12:18 | all combinations of |

0:12:20 | costs and priors to a single |

0:12:22 | affect the guitar |

0:12:27 | so let's see some examples of applications that use different costs |

0:12:32 | right |

0:12:33 | so the default |

0:12:36 | the simplest cost function would be to have equal priors any vocals |

0:12:40 | and that would give you the threshold zero |

0:12:43 | that would be the optimum bayes threshold for these cost function |

0:12:49 | now if you have an application of any sport for examples |

0:12:53 | speaker authentication where |

0:12:55 | your goal |

0:12:56 | he's two |

0:12:59 | verifying whether somebody |

0:13:01 | is what they say they are |

0:13:03 | to their voice |

0:13:05 | for example two and their |

0:13:08 | system |

0:13:09 | then new would expect that most of your cases i've and of e |

0:13:13 | target trials |

0:13:14 | because you know how many posters trying to get into your system |

0:13:18 | on the other hand the cost of making a mistake |

0:13:22 | is very high |

0:13:23 | you feel false alarm |

0:13:25 | so you don't want any of the |

0:13:27 | was able you impostors getting into the system |

0:13:31 | that means you need to |

0:13:32 | said a very high cost alarm |

0:13:34 | a cost of false alarm |

0:13:36 | compared to the cost of |

0:13:38 | and that corresponds with initial |

0:13:40 | two point three |

0:13:41 | so basically what you're doing a small with the threshold to the right so that |

0:13:44 | the this area here on the solid curve |

0:13:49 | which is the distribution of scores |

0:13:52 | for the impostor samples |

0:13:55 | so everything about that racial two point three will be a false-alarm |

0:14:00 | by moving the initial to the right we are meaning lies in this area |

0:14:05 | another application that actually is |

0:14:07 | or seen in terms of course |

0:14:10 | priors is the speaker search |

0:14:11 | in that case you're looking for certain specific speaker weeding |

0:14:16 | another instead of many other speakers |

0:14:19 | so in that case the probability of finding your speaker is actually no |

0:14:23 | that's a one-to-one one percent |

0:14:26 | but the cost that you care about |

0:14:29 | the errors and you want to avoid are the basis because |

0:14:33 | you don't want you're looking for one specific speaker that is important to you for |

0:14:37 | some reason so you know want to meet |

0:14:40 | so in that case the problem of initial is |

0:14:43 | a symmetric to the now minus two point three |

0:14:46 | and in that case what you're trying to minimize is under the dash |

0:14:51 | it to the left |

0:14:53 | of the threshold |

0:14:54 | which is the probability of miss |

0:14:59 | okay |

0:15:02 | so to recover before moving onto |

0:15:05 | evaluation |

0:15:07 | if we have been and are then |

0:15:09 | i showed that we can trivially make optimisations for any possible cost function that you |

0:15:14 | can imagine |

0:15:16 | when the phone that i gave |

0:15:19 | but of course these decisions will only be actually optimal if the system outputs are |

0:15:25 | well calibrated |

0:15:26 | otherwise they will not you |

0:15:29 | so how do we figure out |

0:15:31 | if we have a |

0:15:32 | well calibrated system |

0:15:34 | the |

0:15:35 | question is if you're gonna make your system make decisions using these thresholds that i |

0:15:41 | showed before the data |

0:15:43 | then that's when you should evaluate have your system make those decisions using those data |

0:15:49 | and |

0:15:50 | see how well the |

0:15:52 | and then the for the question is |

0:15:54 | quote we have made better this ensures if we calibrated scores before making the decisions |

0:16:01 | that will give us sarong |

0:16:03 | how well calibrated is the system |

0:16:05 | two meeting |

0:16:08 | so |

0:16:09 | the when we usually evaluate performance on binary classification task |

0:16:14 | these |

0:16:15 | by using the cost |

0:16:17 | no wonder you over initial |

0:16:19 | so we prefix that the racial |

0:16:22 | using bayes |

0:16:23 | a decision theory or not |

0:16:26 | we just |

0:16:26 | that is commercial and then compute the beanie some people sometime which of these yes |

0:16:31 | and the two distributions |

0:16:34 | and then compute the costs |

0:16:37 | now we can also |

0:16:40 | and |

0:16:40 | define matrix that depend on the whole distribution to two sisters |

0:16:45 | so for example the equal error rate |

0:16:48 | is defined |

0:16:49 | by finding the commercial that makes these two areas the same |

0:16:54 | so basically to computing you need the whole test this deviation |

0:16:59 | and a similar thing is the minimum dcf |

0:17:01 | so what you're doing that case is |

0:17:04 | we official |

0:17:07 | across the whole range of scores |

0:17:10 | compute the cost |

0:17:11 | for almost possible threshold |

0:17:13 | and then |

0:17:15 | choose the threshold okay the mean cost |

0:17:20 | now that minimum cost is actually bounded |

0:17:22 | and |

0:17:23 | and it bummed in by |

0:17:25 | basically dummy decisions |

0:17:27 | this system that makes to make decisions |

0:17:31 | if you put |

0:17:32 | for example you official all the way to write |

0:17:35 | then you will only make |

0:17:37 | and mistakes that are misses |

0:17:40 | everything will be nice |

0:17:42 | so you'll have been means of one before xenomorph zero |

0:17:46 | in that case the cost then you will incur is this factor here |

0:17:50 | when the other hand if you put the threshold a way to the left |

0:17:54 | then you will only make false alarms and there will be the cost for that |

0:17:58 | system |

0:17:59 | will be these factors here |

0:18:02 | so basically the bound for the meeting these is |

0:18:05 | the best of those |

0:18:06 | two case |

0:18:08 | they're both times systems but one will be better than the other |

0:18:12 | are we usually use this mindcf to normalize |

0:18:16 | the dcf so and nist evaluations for example |

0:18:19 | the |

0:18:20 | core studies define is the normalized dcf |

0:18:23 | also |

0:18:26 | and then finally another thing we can do is we the threshold |

0:18:30 | we called the puny some people's allow for every possible value of potential |

0:18:35 | and then gives a score curves like these |

0:18:37 | and if we transform the axis appropriately then we get the |

0:18:43 | standard that curves we use for speaker verification |

0:19:02 | so the cost that i've been talking about can be decomposed |

0:19:05 | into discrimination and calibration component |

0:19:10 | so let's see how |

0:19:12 | that's a we assume a cost or well priors an equal cost |

0:19:18 | in that case |

0:19:19 | the optimal threshold will be civil |

0:19:21 | the bayes optimal threshold would be zero |

0:19:24 | so |

0:19:25 | we compare the cost using that |

0:19:27 | commercial |

0:19:28 | and we get these |

0:19:29 | a |

0:19:30 | given that the priors and costs are the same then the cost will be given |

0:19:34 | by the average of these two areas |

0:19:36 | and shown here |

0:19:38 | now when you can also compute the mean cost as i mentioned before |

0:19:42 | basically sweet but initial |

0:19:44 | actual the threshold that gives |

0:19:46 | the minimum cost |

0:19:47 | again is the average between these two areas which you see is much smaller than |

0:19:52 | the average between these two areas in this case |

0:19:55 | and the difference between |

0:19:57 | those two cost |

0:19:59 | can be seen |

0:20:00 | as the additional cost that you encouraging because your system was makes me scully weight |

0:20:06 | so this orange area here which is the difference between |

0:20:09 | the sound |

0:20:11 | well the areas here on the sum of the areas here |

0:20:14 | is the cost due to these calibration and that's one way of measuring |

0:20:19 | how |

0:20:20 | nice kind of ready to system |

0:20:25 | so there's discrimination which is how well the scores |

0:20:28 | separated classes |

0:20:30 | and there's calibration which is whether the discourse can be interpreted probabilistically |

0:20:34 | which implies that you can make optimum bayes decisions |

0:20:37 | if they are kind of work |

0:20:40 | and the key here is then discrimination is the part |

0:20:43 | of the |

0:20:45 | performance that cannot be changed |

0:20:47 | if we transform the scores into we then invertible transformation |

0:20:52 | so here's a simple example that a you have these distribution of scores |

0:20:58 | and you have a threshold t that you chose for some reason |

0:21:01 | could be the optimal or not |

0:21:03 | and you transform this course we |

0:21:06 | any monotonic transformation |

0:21:09 | whatever that in these example is just an affine transformation |

0:21:13 | you transform it |

0:21:15 | and you can also transform the threshold t |

0:21:18 | with the same exact |

0:21:20 | function |

0:21:22 | that there's for that forty will correspond to exactly the same cost |

0:21:27 | as the threshold t in the original domain |

0:21:31 | so basically |

0:21:33 | by doing a monotonic transformation to your scores you cannot change it's discrimination |

0:21:39 | the minimum cost |

0:21:41 | then you will be able to find in both cases will be the same |

0:21:52 | so |

0:21:52 | the cost of a talking about measures the performance artist single operating point |

0:21:57 | it evaluates the quality of the car decisions for certain |

0:22:01 | they |

0:22:03 | now |

0:22:04 | and more comprehensive measure |

0:22:06 | is the cross entropy which is given by this expression and you probably all now |

0:22:11 | the cross-entropy empirical cross-entropy in the average |

0:22:15 | all the logarithm of the posterior that the system gives |

0:22:19 | to the correct class for its |

0:22:22 | so you want these posterior to be as high as portable one |

0:22:25 | if possible |

0:22:27 | no you |

0:22:28 | and algorithm of zero and if that happens for every sample then you know |

0:22:33 | cross entropy zero which is what you want |

0:22:37 | now there's a right weighted version of these cross entropy |

0:22:40 | which is |

0:22:41 | basically the same |

0:22:43 | by |

0:22:43 | you'll split your samples into two terms |

0:22:46 | the ones poll |

0:22:49 | class zero once forecast one |

0:22:51 | and you we wait |

0:22:53 | these averages |

0:22:55 | by and prior |

0:22:56 | that is these effective prior that i talked about before |

0:23:01 | so basically you make yourself independent of the priors and you're seen in the test |

0:23:06 | data |

0:23:07 | you can evaluate for any |

0:23:09 | right you work |

0:23:13 | these posteriors are computed from the and then hours |

0:23:16 | and the priors |

0:23:18 | using bayes rule |

0:23:19 | at least note that these are the priors that you're applied any here |

0:23:23 | the ones that you need to used to compute the llr |

0:23:27 | okay and the famous e llr that we used in |

0:23:30 | nist evaluations any many papers |

0:23:32 | is defined as these weighted cross entropy when the priors are point five |

0:23:37 | and it's normalized by the logarithm to one and explained in the next like |

0:23:41 | what |

0:23:44 | so the weighted cross entropy can be decomposed also |

0:23:47 | like the cost |

0:23:49 | in discrimination and calibration terms |

0:23:52 | basically you compute the actual weighted cross entropy |

0:23:56 | and you subtracted |

0:23:58 | and they |

0:23:59 | minimum |

0:23:59 | weighted cross entropy |

0:24:01 | now this meeting one is not a trivial to obtain ask for the cost you |

0:24:05 | can't just choose the threshold because here where evaluating this course itself is not just |

0:24:11 | the decisions |

0:24:12 | so we need to actually what the scores to get |

0:24:16 | the best possible way to cross entropy |

0:24:19 | we don't change in the discrimination |

0:24:21 | of the scores |

0:24:22 | and that means |

0:24:23 | using an one attorney transformation |

0:24:26 | and there's an algorithm goal will adjacent by annotators |

0:24:29 | well |

0:24:30 | which |

0:24:31 | that's exactly that so in |

0:24:34 | without changing the rank of the scores the order of the scores |

0:24:38 | in dallas the best it can to minimize the weighted cross |

0:24:42 | and so that's what we used to compute |

0:24:45 | yes delta |

0:24:46 | which |

0:24:47 | measures how these kind of reading your system it's |

0:24:50 | in terms of we present |

0:24:53 | and this way to present the peace mounted the same last |

0:24:56 | the cost |

0:24:57 | by and a system that in this case is the system that out what's |

0:25:02 | instead of |

0:25:02 | the posteriors we don't was directly the prior so with the system that doesn't know |

0:25:06 | anything about its input |

0:25:08 | but |

0:25:09 | still nasty |

0:25:10 | best buy |

0:25:11 | i would in the priors |

0:25:14 | and |

0:25:16 | that means that the worst |

0:25:18 | in c n r |

0:25:20 | is one point zero because we were normalized to didn't |

0:25:23 | right nobles to which is exactly these things when you evaluated i point five |

0:25:29 | so this means that the |

0:25:31 | minimum c llr |

0:25:33 | will never be |

0:25:34 | where someone |

0:25:36 | i mean the actual c llr is worse than one then you know for sure |

0:25:39 | that you're gonna have a difference here |

0:25:41 | because this is never |

0:25:43 | larger than one in this is larger than one and then it means you have |

0:25:46 | a calibration problem |

0:25:50 | okay |

0:25:51 | finally in terms of evaluation i wanted to mention these |

0:25:55 | curves of the applied probability of error curves ache |

0:25:59 | and the llr shows a single summary number |

0:26:02 | but you might want to actually seen |

0:26:04 | the performance across |

0:26:06 | a range of operating points and that's what this curves two |

0:26:10 | they basically show the cost |

0:26:12 | of |

0:26:14 | as a function of the beat are the effect of peter |

0:26:18 | which |

0:26:19 | also defines that data |

0:26:21 | so |

0:26:22 | what we see here these |

0:26:23 | the |

0:26:24 | these cost |

0:26:25 | for prior decisions |

0:26:27 | and the prior decisions are what i mentioned before |

0:26:30 | basically just a dummy system that always outputs |

0:26:34 | the priors instead of posteriors |

0:26:38 | and the red is our system whatever that he's |

0:26:43 | kind of varying or not |

0:26:46 | and then dashed curve is the very best you can do if you work to |

0:26:50 | work your scores using the palm algorithm |

0:26:54 | so basically the difference for each data the difference between the dashed and the right |

0:27:00 | is there is calibration and that |

0:27:02 | operating point |

0:27:05 | and the nice property of all these curves is that the c in a lower |

0:27:08 | east proportional to the area under the covers |

0:27:12 | so the actual see an alarm is proportional to the area under the red curve |

0:27:17 | and the means the lr is proportional to the area under the dashed |

0:27:23 | and furthermore the equal error rate is the maximum |

0:27:26 | of these |

0:27:28 | a red curve |

0:27:30 | and their variance of these curves |

0:27:32 | which accompanies this papers |

0:27:35 | change in the way the axis and define |

0:27:39 | okay |

0:27:40 | so let's see not saying we |

0:27:43 | already in our system has a kind of a simple |

0:27:45 | should we worry about it shall we trying to fix it |

0:27:50 | there's some scenarios where you |

0:27:52 | there's |

0:27:52 | no problem if you have a nice calibrated system there is no need to fix |

0:27:56 | it for example |

0:27:58 | e |

0:27:59 | you know what the cost function is ahead of time |

0:28:02 | and there's development data available |

0:28:04 | then all you need to do is run on the system for the development data |

0:28:08 | and find the and |

0:28:10 | you can best |

0:28:11 | commercial |

0:28:13 | for |

0:28:14 | done them data for that system and that can cost function |

0:28:17 | and you're that |

0:28:19 | and |

0:28:20 | you also the need to worry about calibration if |

0:28:23 | it you wanna care about ranking |

0:28:25 | the samples so you want to do not and |

0:28:29 | likely targets |

0:28:30 | and nothing |

0:28:33 | on the other hand it may be very necessary to the calibration in many other |

0:28:37 | sin |

0:28:39 | one of them is for example if you don't know ahead of time what the |

0:28:43 | system will be used for exactly what is the application |

0:28:45 | i don't means |

0:28:47 | you don't know the cost function and if you don't know the cost function |

0:28:50 | you cannot optimize the partial |

0:28:52 | i had of time |

0:28:53 | so if you want to give the user of the system and all |

0:28:57 | then defines these effective bit are |

0:29:01 | then the system has to be calibrated for the baseline |

0:29:05 | bayes optimal threshold to be in |

0:29:08 | really optimal |

0:29:09 | to work well |

0:29:11 | a another case where you need to look at iteration is if you want to |

0:29:15 | get a probabilistic value |

0:29:18 | from your system |

0:29:19 | some men sure all the uncertainty that a system has |

0:29:23 | when you make six |

0:29:24 | this issue |

0:29:25 | and you can use that uncertainty for example |

0:29:28 | to reject samples when the system is uncertain |

0:29:31 | so |

0:29:33 | if you're and in our is too close to the threshold then you work planning |

0:29:36 | to use to make our decisions |

0:29:38 | then perhaps |

0:29:39 | you wanna system not to make a decision total the user i don't know |

0:29:44 | union under some |

0:29:47 | and another case is when this you actually don't want to make her decisions when |

0:29:51 | you want to report the value |

0:29:53 | then his interpretable |

0:29:55 | not for example in the forensic voice comparison people |

0:30:02 | okay so |

0:30:03 | that's a we do want to fix |

0:30:05 | a calibration we are in one of those scenarios where it matters |

0:30:10 | one very common approach to do this is to use linear logistic regression |

0:30:15 | so this assumes that b and an hour |

0:30:18 | the kind of weighted score |

0:30:20 | is an affine transformation all whatever your system |

0:30:25 | and the parameters of these small are the w and b |

0:30:30 | and uses the weighted cross entropy ask the loss function |

0:30:35 | now |

0:30:37 | for to compute the weighted presence of we need posteriors not and then hours so |

0:30:41 | we need to compare those in a nursing to posteriors and we use this expression |

0:30:44 | that actual before |

0:30:46 | which is |

0:30:47 | the llr is the nobles of the posterior minus the no guards of the right |

0:30:53 | and it would |

0:30:55 | basically where these expression we get there not just the functional which is the inverse |

0:31:00 | of the legit |

0:31:01 | and |

0:31:04 | and finally after doing |

0:31:09 | trivial computations we can these expression which is that bystander |

0:31:13 | mean and logistic expression |

0:31:15 | we need to them further like these posterior into the expression of the weighted cross |

0:31:20 | entropy to get lost |

0:31:21 | that we can then optimize thus we wish |

0:31:24 | and finally once we optimize these on |

0:31:28 | some the data |

0:31:30 | we can the w and b |

0:31:32 | that are optimal for that |

0:31:34 | not rate |

0:31:37 | so |

0:31:39 | this is an affine transformation so we doesn't change the shapes |

0:31:42 | of the distributions at all |

0:31:44 | basically |

0:31:45 | these looks like he did nothing |

0:31:47 | but what indeed is |

0:31:49 | more |

0:31:50 | shrink shift and shrink |

0:31:53 | the axes so that the resulting |

0:31:56 | scores |

0:31:57 | are kind of right |

0:32:01 | and in terms of t and then are you can see that their raw scores |

0:32:05 | which are these ones |

0:32:07 | how do very high c and an hour actually higher than one so the where |

0:32:10 | words and one |

0:32:12 | and after you calibrate them |

0:32:14 | which all your the was really scale and shapes |

0:32:17 | a new data much better see in the lower |

0:32:20 | these minimum here is that well maybe |

0:32:23 | the |

0:32:24 | the |

0:32:25 | very best you can do |

0:32:27 | so we define transformation we are actually doing almost as |

0:32:32 | good as the very best |

0:32:36 | which means that the affine assumption was actually in this case a quite |

0:32:41 | this is a real case this is box and of data process we the |

0:32:46 | be lda system |

0:32:50 | and then many other approaches to do calibration i'm not gonna cover them because it |

0:32:54 | would take another |

0:32:55 | another whole keynote |

0:32:57 | and |

0:32:58 | there are nonlinear approaches |

0:33:02 | which |

0:33:02 | i in some |

0:33:04 | cases do better than linear |

0:33:07 | is a good at some somebody is not perfect |

0:33:12 | then their originality and basin approaches that actually do quite well when you have very |

0:33:16 | little data |

0:33:18 | to train the calibration model |

0:33:19 | and then they're approaches and goal the way |

0:33:22 | but |

0:33:23 | to know data not labeled data |

0:33:26 | so there's label but |

0:33:27 | there's |

0:33:28 | they have and you don't know than they |

0:33:31 | and those works surprisingly well |

0:33:35 | so |

0:33:36 | if we have a kind of really score there |

0:33:39 | we know we can train the most looks not log-likelihood ratios which means |

0:33:43 | then we can use them to make optimal decisions |

0:33:46 | and we can also convert them to posteriors if we wanted to and if we |

0:33:50 | had the bright |

0:33:53 | and it and very nice property of that in our is that |

0:33:57 | if you work to compute |

0:33:58 | in a collection racial |

0:34:00 | all your |

0:34:01 | already calibrated score then you would get |

0:34:04 | the same thing |

0:34:05 | so you can treat |

0:34:07 | this score the in an hour after feature |

0:34:10 | and you |

0:34:11 | we compute these racial you would get the same by |

0:34:16 | i don't this they don't seem to some nice properties like for example |

0:34:21 | in a calibrated |

0:34:22 | score |

0:34:24 | the two distributions have to cross exactly at zero |

0:34:28 | because when the nn are is zero |

0:34:30 | these racial is one |

0:34:32 | which means then these two |

0:34:34 | have to be the same |

0:34:36 | and these two are exactly what we're seeing here the densities |

0:34:40 | the probability density function of the score for each of the two guys |

0:34:44 | they have to corsets you |

0:34:46 | and further if we assume that one of these two distributions is gaussian |

0:34:51 | then the other distributions forced to be gaussian |

0:34:54 | with the same |

0:34:55 | standard deviation and with symmetric meetings |

0:34:59 | and these as i said it's a real example and it's actually quite |

0:35:03 | close to that assumption |

0:35:05 | in this box and up to |

0:35:09 | okay so to recover this problem before we don't |

0:35:13 | what i've been saying is that occurs equal error rate mindcf |

0:35:18 | my sure only discrimination performance |

0:35:21 | basically this means that the nor the usual threshold selections of the nor the usual |

0:35:26 | how to get |

0:35:28 | to the actual decisions |

0:35:29 | from the score |

0:35:31 | on the other hand the weighted cross entropy on the actual dcf and the ape |

0:35:36 | curves and measure total form |

0:35:39 | and |

0:35:40 | that includes the initial how to |

0:35:43 | make the decisions |

0:35:45 | and we can further use these metrics |

0:35:48 | to compute the |

0:35:49 | calibration loss |

0:35:51 | so to see whether the system is well calibrated or not |

0:35:57 | and if you find the calibration is actually not good then fixing this calibration issues |

0:36:02 | is |

0:36:03 | usually see in ideal conditions so you can train an invertible transformation |

0:36:09 | used in |

0:36:10 | usually a small representative that said |

0:36:13 | which is enough because |

0:36:15 | in many of the approach is the number of parameters are is very small so |

0:36:19 | you don't need a lot update |

0:36:23 | the key here though |

0:36:25 | is then you need a representative that's it |

0:36:27 | and that's going on |

0:36:29 | what i'm gonna discussing the nastiest like |

0:36:33 | so |

0:36:33 | basically what we of serving right these repeatedly is that calibration of our speaker verification |

0:36:40 | systems |

0:36:42 | it is |

0:36:43 | extremely fragile |

0:36:45 | it is now for our current system and it has always be |

0:36:49 | okay since i've been working |

0:36:51 | on speaker verification for |

0:36:53 | almost twenty years not |

0:36:57 | anything like language noise distortions duration they not affect |

0:37:02 | the calibration parameters |

0:37:05 | and that means that one to train one condition |

0:37:08 | it's very unlikely to generalize to another condition |

0:37:11 | on the other hand the discrimination performance is usually still reasonable |

0:37:16 | on unseen conditions |

0:37:17 | so if you train a system on telephone data and you try to use it |

0:37:21 | on microphone data |

0:37:22 | is that gonna may not be the best you can do |

0:37:25 | but he still will be reasonable |

0:37:28 | on the other hand if you train your calibration model on telephone data and trying |

0:37:32 | to use it a microphone in many |

0:37:34 | perform horribly |

0:37:37 | and this is one example |

0:37:39 | so |

0:37:40 | i'm training the calibration set on the conversion |

0:37:45 | well on two different sets |

0:37:47 | speakers in the while and sre sixty |

0:37:50 | that's |

0:37:51 | and applying those models |

0:37:53 | but on box in the two |

0:37:55 | they just |

0:37:56 | the |

0:37:58 | scores are identical the raw scores and all and doing is changing the w and |

0:38:02 | the be based on the calibration set |

0:38:05 | what we see here |

0:38:07 | is that the model that was trained with speakers in the while |

0:38:12 | is extremely good |

0:38:13 | it's basically almost |

0:38:15 | perfect |

0:38:17 | while the model that was trained on a set of sixteen is |

0:38:20 | quite by |

0:38:21 | is better than the raw scores but he still quite well |

0:38:24 | compared to the best you can do |

0:38:27 | and this is not surprising because |

0:38:29 | block selects actually quite close to speakers in the white in terms of conditions |

0:38:33 | by this i sixteen is not |

0:38:36 | now |

0:38:36 | you may think maybe sorry sixteen is but just about set for doing calibration |

0:38:41 | but that's not the case because if you evaluate and sre sixteen |

0:38:45 | evaluation data |

0:38:47 | and |

0:38:48 | then the opposite happens |

0:38:50 | so the |

0:38:51 | calibration model that is good in that case is the one that was trained on |

0:38:54 | the set of sixteen so you these |

0:38:57 | scores |

0:38:58 | newman much lower |

0:39:00 | still an arm than the ones that were |

0:39:02 | calibrated we'd speakers in the way |

0:39:06 | in this case again you're almost which in the mean |

0:39:10 | so basically this tells us that the conditions on which the calibration model is trained |

0:39:14 | are at determining off |

0:39:17 | where they're gonna be |

0:39:19 | good |

0:39:20 | you have do you have to match the conditions on your evaluation |

0:39:26 | now |

0:39:27 | this goes even deeper |

0:39:29 | if you |

0:39:30 | zoom into a data set you can actually finest calibration issues within the dataset itself |

0:39:36 | so |

0:39:38 | i'm showing |

0:39:39 | results on sre sixteen evaluation set |

0:39:42 | when i training calibration parameters exactly one of the same impulse it so this is |

0:39:47 | a cheating calibration experiment |

0:39:50 | here |

0:39:51 | i'm showing the |

0:39:53 | see an alarm which is the solid bar and the means in an hour which |

0:39:57 | are here are the same by construction and here is that |

0:40:01 | relative difference between those two |

0:40:04 | so where the full set i have not lost |

0:40:06 | by construction as it said |

0:40:08 | on the other hand if i start to subset |

0:40:11 | peaceful set |

0:40:13 | a randomly or by gender |

0:40:16 | or my condition |

0:40:17 | i start to see |

0:40:19 | one more calibration loss |

0:40:21 | so than random subset is |

0:40:23 | fine |

0:40:24 | it is well calibrated females and males are reasonably well calibrated |

0:40:28 | but for this specific conditions |

0:40:31 | there are defined by the language the gender |

0:40:34 | where there are they to waveforms in the trial come from the same telephone number |

0:40:39 | or not |

0:40:41 | then we start to see calibration loss |

0:40:43 | our to almost twenty percent in this case |

0:40:47 | so |

0:40:49 | the distributions so for the |

0:40:52 | target i don't |

0:40:53 | female same telephone number set |

0:40:57 | we see that the distributions are shifted to the fact |

0:41:00 | they should be aligned with zero remember that the this to the sri distributions if |

0:41:04 | they were kind of reading they should cross at zero |

0:41:08 | but they don't |

0:41:09 | so that means they shifted to the right and that is reasonable because seems they |

0:41:13 | are the same telephone number for both |

0:41:16 | sides of the trial |

0:41:18 | then it means that |

0:41:19 | they look very much the same |

0:41:23 | more than if the channels one different |

0:41:26 | so |

0:41:27 | everything every trial looks more target |

0:41:30 | then they should |

0:41:33 | or than they do in the overall distribution |

0:41:36 | on the opposite happens on the different telephone number |

0:41:39 | scores |

0:41:40 | the shift to the left |

0:41:42 | and the final comments here is that these mis calibration with dataset |

0:41:49 | it's also cost in a discrimination problem |

0:41:52 | because if you pool these |

0:41:54 | trials as they are is kind of reading |

0:41:56 | you will get poor discrimination then if you work to first calibrated |

0:42:00 | and then pooled together |

0:42:03 | so |

0:42:05 | there's an interplay here between calibration and discrimination |

0:42:09 | because |

0:42:11 | the nist calibration is happening |

0:42:14 | for different sub conditions within the set |

0:42:21 | okay so they're been several approaches in the literature |

0:42:25 | over the last decades at least |

0:42:28 | that's right to |

0:42:31 | solve this problem or |

0:42:33 | condition dependent is calibration |

0:42:38 | where the |

0:42:39 | assumption of having a global |

0:42:42 | calibration model |

0:42:43 | that has a single w and a single be |

0:42:46 | for all trials it's actually not as good as such |

0:42:50 | so most of these approaches assume that there's an external class |

0:42:54 | or vector representation |

0:42:57 | the ldc there are given by the metadata |

0:42:59 | or estimated |

0:43:02 | that represents the condition of the samples |

0:43:04 | the enrollment and the samples |

0:43:07 | and these vectors |

0:43:10 | are fed into the calibration stage and they are used to condition the parameters of |

0:43:14 | these calibration stage |

0:43:17 | here are some approaches if you are interesting to take a look |

0:43:24 | over all these approaches something quite successful at |

0:43:27 | making the |

0:43:28 | final system better actually more discriminative |

0:43:32 | because they align the distributions of the different sub conditions before putting them together |

0:43:40 | and that their family of approaches these |

0:43:43 | where they put the condition awareness |

0:43:46 | in the back end itself rather than in the calibration stage |

0:43:49 | so |

0:43:50 | there's again a condition extractor of some kind |

0:43:54 | that affects the parameters of them okay |

0:43:58 | the thing is that this approach doesn't necessarily fix calibration |

0:44:02 | it improves discrimination in general |

0:44:04 | but you may still need to the calibration it is but can is deal for |

0:44:08 | example it be lda look and this i think these cases |

0:44:12 | what comes out of here is still use kind of |

0:44:15 | so you still need a |

0:44:18 | perhaps normal |

0:44:19 | calibration model and the or |

0:44:23 | okay and recently we propose an approach that jointly trains the backend |

0:44:30 | and a condition beep and then |

0:44:32 | calibrate or |

0:44:33 | where here we assume that the condition is extracted automatically as a function of the |

0:44:38 | and mailings themselves |

0:44:40 | and the whole thing |

0:44:42 | is trained jointly to optimize weighted cross entropy |

0:44:46 | so |

0:44:47 | this model actually gives |

0:44:49 | excellent calibration performance across so wide range of conditions |

0:44:53 | you can actually find the paper |

0:44:55 | and |

0:44:56 | in the ldc proceedings if you're interest |

0:44:59 | and there's a very related paper a also in a dc one middle |

0:45:04 | by daniel garcia romano |

0:45:06 | which i suggest you taken it to if you're interested in these topics |

0:45:13 | okay so |

0:45:15 | to finish up |

0:45:17 | i didn't talking about two |

0:45:20 | wide application scenarios for speaker verification technology |

0:45:24 | one of them |

0:45:25 | is where you assume that there's development data available for the evaluation conditions |

0:45:32 | in that case |

0:45:33 | as i said you can either calibrate the system on my on that data which |

0:45:38 | is matched |

0:45:39 | or just |

0:45:40 | find the best commercial |

0:45:43 | by really calibration in that |

0:45:45 | scenario is not a mediation |

0:45:49 | in fact most speaker verification papers |

0:45:52 | historically |

0:45:53 | operate under this scenario |

0:45:55 | it's also the scenario of the nist evaluations where we usually get development data which |

0:46:00 | is maybe not perfectly matching but |

0:46:02 | pretty well matched to what we will see you in the evaluation |

0:46:07 | not only see this ldc five i found thirty three speaker recognition papers |

0:46:13 | of which twenty eight fold |

0:46:15 | in this category |

0:46:18 | so |

0:46:19 | the mostly report just equal error rate and dcf some report actual values |

0:46:25 | some don't |

0:46:28 | and i think it's fine to just report mean dcf in those cases because you |

0:46:32 | basically assuming that the |

0:46:35 | it calibration initial is |

0:46:37 | easy to sell |

0:46:39 | so that |

0:46:39 | if you work to have |

0:46:41 | development data |

0:46:43 | a you could train a kind of visual all and you won't reach very close |

0:46:47 | to the minimum |

0:46:48 | this year |

0:46:49 | the actual performance gonna get very close to the |

0:46:54 | now the still the can be at that |

0:46:57 | you may still have used calibration problems within sub conditions anything i don't report |

0:47:02 | actual dcf on this year on some conditions |

0:47:05 | and that's |

0:47:07 | she |

0:47:07 | behind the overall performance |

0:47:11 | the other big scenario is |

0:47:13 | and the |

0:47:15 | the one where we don't have development data |

0:47:18 | for the above conditions |

0:47:21 | in that case we cannot calibrate or just a special |

0:47:25 | on matched conditions we can only whole |

0:47:28 | that our system will |

0:47:30 | work well out of the box |

0:47:35 | from the |

0:47:36 | all these proceedings i only five |

0:47:39 | papers that operate on their this scenario where the |

0:47:42 | actually test |

0:47:43 | a system that was trained on some condition all |

0:47:47 | at this data that is on a different conditions |

0:47:50 | and they do not assume that they have |

0:47:52 | development data for that |

0:47:54 | if recognition |

0:47:58 | so basically we as a community |

0:48:00 | are very heavily focused on the first scenario have always been is |

0:48:05 | from historically |

0:48:09 | and i but this man be why our current speaker verification technology |

0:48:14 | cannot be used out of the box |

0:48:16 | we are just |

0:48:18 | used to |

0:48:19 | always |

0:48:21 | asking for development data |

0:48:22 | in order to tune at least the calibration stage of our system |

0:48:28 | we know the calibration stage has to be tuned otherwise the system one work |

0:48:33 | in maybe where someone |

0:48:36 | so my question is and maybe we can discuss the question and answer session |

0:48:43 | wouldn't be worth it for as a community to pay more attention to these |

0:48:47 | scenario |

0:48:48 | no development data available |

0:48:53 | i believe that the new and two and approaches have the |

0:48:57 | potential to be quite good |

0:48:59 | i generalising |

0:49:00 | and this is basically based on the |

0:49:03 | paper that i mentioned that actually |

0:49:05 | is not really into and |

0:49:07 | but |

0:49:07 | almost |

0:49:09 | and it works |

0:49:11 | quite well |

0:49:12 | surprisingly well in terms of calibration across conditions on unseen conditions |

0:49:18 | so i think it's doable |

0:49:22 | maybe if we would and therefore as a community then maybe we reduce or even |

0:49:27 | in it |

0:49:28 | if we're very optimistic |

0:49:30 | the performance difference between the two center so maybe we can end up with systems |

0:49:35 | then |

0:49:36 | are not so independent of having development data |

0:49:40 | and perhaps even having development data one how much i don't know or more |

0:49:47 | the out of the book system |

0:49:51 | so what would you and tail to develop for these known that scenario |

0:49:56 | possible we |

0:49:57 | we have to assume that we will need heterogeneous data for training of course because |

0:50:02 | if you train a system on telephone data is |

0:50:05 | quite unlikely that it will generalize to |

0:50:07 | maybe other condition |

0:50:11 | the second thing is one has to have doubts |

0:50:15 | some sets |

0:50:16 | at least |

0:50:18 | during development that are not d menus for |

0:50:20 | hyperparameter two |

0:50:22 | because otherwise they would not be completely and see |

0:50:26 | so these sets out to be really |

0:50:30 | held out until the very and until you just evaluate the system out of the |

0:50:34 | box as in this scenario that we are imagining |

0:50:38 | and of course in into report actual matrix and not just meaning because in this |

0:50:42 | case you cannot assume that you're gonna be able to do kind of racial well |

0:50:46 | you need to test whether the model |

0:50:49 | i think i cd as it stands |

0:50:51 | it's actually giving you |

0:50:53 | good calibration with the session |

0:50:56 | and finally it's probably a good idea to also report matrix |

0:51:01 | on some conditions in the set |

0:51:03 | because |

0:51:04 | they the mis calibration issues within the sub conditions maybe he in |

0:51:09 | within the true distribution of the whole set they compensate each other sometimes |

0:51:15 | and reporting |

0:51:19 | metrics sub conditions yes |

0:51:21 | both actual and minimum something you can actually tell |

0:51:24 | if there's a calibration |

0:51:27 | okay |

0:51:28 | thank you very much for listening and i'm looking forward to your questions in the |

0:51:32 | next session |