| 0:00:15 | okay so |
|---|
| 0:00:17 | in this p and this paper i'm going to compared some linear and some non |
|---|
| 0:00:22 | linear calibration function |
|---|
| 0:00:25 | so |
|---|
| 0:00:25 | there's a list of previous papers |
|---|
| 0:00:28 | all of them used linear calibration we did various interesting things |
|---|
| 0:00:33 | but |
|---|
| 0:00:34 | when i was doing that work every time it became evident |
|---|
| 0:00:38 | the linear calibration has limits let's explore what we can do with the non linear |
|---|
| 0:00:43 | calibration |
|---|
| 0:00:45 | so |
|---|
| 0:00:46 | in this paper |
|---|
| 0:00:47 | we're going to use lots of data for training the calibration in that case of |
|---|
| 0:00:52 | plugging solution |
|---|
| 0:00:55 | works well |
|---|
| 0:00:57 | we have an upcoming interspeech paper where we're gonna have used tiny amounts of training |
|---|
| 0:01:01 | data and their a bayesian solution |
|---|
| 0:01:06 | is an interesting thing to do |
|---|
| 0:01:09 | so |
|---|
| 0:01:10 | just a reminder why calibration so |
|---|
| 0:01:13 | a speaker recognizer about the speaker recognizer is not useful unless it actually does something |
|---|
| 0:01:19 | so we wanted to make decisions and it's nice if you can make those decisions |
|---|
| 0:01:24 | be |
|---|
| 0:01:25 | minimum expected cost bayes decisions |
|---|
| 0:01:28 | if you take the raw score sort of the recognizer |
|---|
| 0:01:30 | i don't make good decisions so we like to calibrate them |
|---|
| 0:01:34 | so that the we can make would cost effective decisions |
|---|
| 0:01:39 | so |
|---|
| 0:01:40 | four years we've done pretty well what |
|---|
| 0:01:42 | linear calibration |
|---|
| 0:01:44 | why complicate things so |
|---|
| 0:01:48 | well or |
|---|
| 0:01:49 | the good things about linear calibration simplicity |
|---|
| 0:01:53 | it's easy to do |
|---|
| 0:01:56 | and |
|---|
| 0:01:57 | then there's the problem of overfitting |
|---|
| 0:02:00 | the linear calibration as very few parameters so it doesn't overfit that easily |
|---|
| 0:02:05 | not a problem that should be underestimated even if you have lots of data if |
|---|
| 0:02:09 | you work at an extreme operating point |
|---|
| 0:02:11 | the error rates become very low so your data is actually errors not |
|---|
| 0:02:15 | not the not speech samples of trials you know another errors you gonna have over |
|---|
| 0:02:21 | fitting problems |
|---|
| 0:02:23 | and the thing |
|---|
| 0:02:24 | the linear calibration is monotonic |
|---|
| 0:02:27 | the score increases the log-likelihood ratio increases |
|---|
| 0:02:30 | if you do something nonlinear |
|---|
| 0:02:32 | you might be in a situation |
|---|
| 0:02:34 | with |
|---|
| 0:02:36 | the score increases but the log likelihood ratio decreases |
|---|
| 0:02:40 | and that i don't think even all the soreness in finland can help us on |
|---|
| 0:02:43 | certain whether we want that kind of thing or not |
|---|
| 0:02:48 | so |
|---|
| 0:02:50 | the limitation of linear methods |
|---|
| 0:02:52 | is |
|---|
| 0:02:53 | that the |
|---|
| 0:02:55 | they don't look at all operating points at the same time |
|---|
| 0:02:59 | you have to choose |
|---|
| 0:03:02 | where |
|---|
| 0:03:03 | do we want to operate what cost ratio what prior for the target do we |
|---|
| 0:03:08 | want to work at |
|---|
| 0:03:09 | and then you have to thailand your operating point your training objective function |
|---|
| 0:03:17 | to make that what |
|---|
| 0:03:21 | so that |
|---|
| 0:03:21 | why is this a problem |
|---|
| 0:03:23 | you cannot always know in advance where you're gonna want your system to work |
|---|
| 0:03:27 | and especially if you dealing with unsupervised data |
|---|
| 0:03:32 | so |
|---|
| 0:03:33 | the non linear methods |
|---|
| 0:03:36 | like can be accurate |
|---|
| 0:03:37 | accurate over a wider range of operating point |
|---|
| 0:03:41 | then |
|---|
| 0:03:41 | you don't need to do this so much gymnastics with your |
|---|
| 0:03:47 | training objective function |
|---|
| 0:03:51 | so |
|---|
| 0:03:52 | the non linear methods or considerably more complex to train i had to go and |
|---|
| 0:03:56 | find out a lot of things about the basal functions and how to compute the |
|---|
| 0:04:00 | derivatives |
|---|
| 0:04:03 | and there are more vulnerable to overfitting so |
|---|
| 0:04:06 | more complex functions more things can go wrong |
|---|
| 0:04:12 | so will compare |
|---|
| 0:04:15 | various flavours |
|---|
| 0:04:17 | discriminative and generative linear calibrations |
|---|
| 0:04:19 | and the same for non linear ones |
|---|
| 0:04:22 | another conclusion is going to be there is some benefit to the nonlinear ones |
|---|
| 0:04:27 | but you will cover a wide range of operating points |
|---|
| 0:04:34 | it's |
|---|
| 0:04:35 | first describe the linear ones |
|---|
| 0:04:38 | so |
|---|
| 0:04:39 | it's linear because we take the score that comes out of the system |
|---|
| 0:04:43 | we scale at by some fact that i and we shifted by some constant be |
|---|
| 0:04:48 | and then |
|---|
| 0:04:50 | if we do if we use |
|---|
| 0:04:52 | gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets |
|---|
| 0:04:59 | and |
|---|
| 0:05:01 | you have a target mean and the non-target mean and then you would share the |
|---|
| 0:05:06 | variance between the two distributions |
|---|
| 0:05:09 | if you don't if you have separate variances you get a quadratic function |
|---|
| 0:05:14 | so in the linear case we sharing that the that stigma |
|---|
| 0:05:19 | so that |
|---|
| 0:05:20 | gives a linear generative calibration |
|---|
| 0:05:23 | or you could be discrimate of in that case you're probabilistic model is just the |
|---|
| 0:05:28 | formula at the top of the by and you directly trained those parameters by minimizing |
|---|
| 0:05:32 | cross entropy |
|---|
| 0:05:36 | so |
|---|
| 0:05:37 | i said |
|---|
| 0:05:39 | we have to kind of the objective function to make it work at a specific |
|---|
| 0:05:42 | operating point |
|---|
| 0:05:44 | so |
|---|
| 0:05:44 | what we basically do |
|---|
| 0:05:46 | it's we white |
|---|
| 0:05:47 | the target trials and the non-target trials or if you want the most errors and |
|---|
| 0:05:52 | the false alarm errors |
|---|
| 0:05:53 | by a factor of all five and one minus alpha so |
|---|
| 0:05:57 | but school also the training parameters so when you train at first you have to |
|---|
| 0:06:01 | select your operating point |
|---|
| 0:06:05 | so |
|---|
| 0:06:07 | let's you know that all works out present |
|---|
| 0:06:11 | experimental results first a nonlinear stuff then we'll that of the nonlinear |
|---|
| 0:06:14 | so |
|---|
| 0:06:16 | sample experimental setup and i-vector system |
|---|
| 0:06:20 | and |
|---|
| 0:06:20 | we trained |
|---|
| 0:06:23 | the calibrations on a huge amount of scores of a forty million scores and we |
|---|
| 0:06:28 | pairs to the |
|---|
| 0:06:29 | on the sre twelve which was described earlier today |
|---|
| 0:06:34 | like i've about nine million scores |
|---|
| 0:06:36 | and are evaluation criterion |
|---|
| 0:06:39 | is the same one that was used again and the nist evaluation |
|---|
| 0:06:43 | very well known dcf what if you want the by using a direct |
|---|
| 0:06:49 | and it's normalized |
|---|
| 0:06:51 | as shown by |
|---|
| 0:06:53 | the performance of |
|---|
| 0:06:55 | additional system that doesn't look at the scores of just makes decisions by the prior |
|---|
| 0:06:59 | alone |
|---|
| 0:07:02 | so |
|---|
| 0:07:05 | this is the |
|---|
| 0:07:06 | result of the |
|---|
| 0:07:09 | gaussian calibration |
|---|
| 0:07:13 | what we're looking at is on the horizontal axis is the dcf for the error |
|---|
| 0:07:17 | rate lower is better |
|---|
| 0:07:19 | on the |
|---|
| 0:07:20 | sorry about the collected the that the horizontal axis is you're operating point or your |
|---|
| 0:07:26 | target prior on the log on scale |
|---|
| 0:07:28 | so the other would be a problem prior of a half |
|---|
| 0:07:31 | negative |
|---|
| 0:07:33 | small priors positive lots priors |
|---|
| 0:07:36 | and |
|---|
| 0:07:38 | the |
|---|
| 0:07:38 | dashed line |
|---|
| 0:07:39 | is what you would know is minimum dcf |
|---|
| 0:07:43 | what is the best you can do |
|---|
| 0:07:45 | if the evaluator sets that the threshold that at every single operating point |
|---|
| 0:07:50 | so |
|---|
| 0:07:52 | we trained |
|---|
| 0:07:54 | the system using three different |
|---|
| 0:07:57 | values |
|---|
| 0:07:58 | for |
|---|
| 0:07:59 | the training weighting parameter all four |
|---|
| 0:08:03 | so of a much smaller than one means |
|---|
| 0:08:06 | within george doddington the region |
|---|
| 0:08:08 | the low false alarm rate |
|---|
| 0:08:11 | false alarms or more important so you white the more |
|---|
| 0:08:14 | if you do that |
|---|
| 0:08:16 | you do well in the region that you want to do well but on the |
|---|
| 0:08:19 | other side you will see the red curve suffers |
|---|
| 0:08:22 | if |
|---|
| 0:08:23 | we set the parameter that the hoff |
|---|
| 0:08:26 | the does badly almost everywhere |
|---|
| 0:08:29 | if you set the parameter to the others other side almost one |
|---|
| 0:08:33 | you get the reverse |
|---|
| 0:08:35 | on that side it's bad on the side |
|---|
| 0:08:37 | it's good that that's the blue curve |
|---|
| 0:08:40 | so |
|---|
| 0:08:41 | this was generative let's move to discriminant of |
|---|
| 0:08:45 | so |
|---|
| 0:08:46 | the picture is slightly better |
|---|
| 0:08:50 | this is the usual button if you have lots of data discrimate of |
|---|
| 0:08:55 | of from those bit of the genital |
|---|
| 0:08:57 | but still |
|---|
| 0:08:58 | we don't do as well as we might like to over all operating points |
|---|
| 0:09:03 | so let's see what the non linear methods will do |
|---|
| 0:09:07 | so |
|---|
| 0:09:10 | that the i v algorithm also sometimes called a symphonic regression |
|---|
| 0:09:15 | is a very interesting algorithm |
|---|
| 0:09:18 | we assign |
|---|
| 0:09:19 | we |
|---|
| 0:09:21 | allowing for calibration function any monotonic rising functions |
|---|
| 0:09:26 | and |
|---|
| 0:09:27 | then |
|---|
| 0:09:29 | there's |
|---|
| 0:09:31 | and optimisation procedures |
|---|
| 0:09:33 | which essentially selects for every single score |
|---|
| 0:09:36 | what the function if is going to map it do so it nonparametric |
|---|
| 0:09:42 | and the very interesting thing is |
|---|
| 0:09:44 | we don't have to choose |
|---|
| 0:09:47 | which objective |
|---|
| 0:09:49 | we actually want to optimize this function class is rich enough that it actually just |
|---|
| 0:09:55 | optimizes all of them so |
|---|
| 0:09:57 | you get that automatically |
|---|
| 0:10:00 | all your objective functions optimized at all operating points on the training data |
|---|
| 0:10:07 | if you going to the test data |
|---|
| 0:10:10 | you see |
|---|
| 0:10:11 | over a wide range of operating points it does work pretty well |
|---|
| 0:10:15 | but at the extreme negative and we do have a slight problem so |
|---|
| 0:10:22 | i attribute that overfitting |
|---|
| 0:10:24 | so |
|---|
| 0:10:25 | this thing has forty two million parameters is non parametric the parameters grow with the |
|---|
| 0:10:30 | data |
|---|
| 0:10:32 | but they're also forty two million inequality constraints |
|---|
| 0:10:35 | that makes it behind mostly |
|---|
| 0:10:38 | except there we run out of errors and it stops behaving |
|---|
| 0:10:44 | so |
|---|
| 0:10:45 | now we go to the generative |
|---|
| 0:10:48 | version of nonlinear |
|---|
| 0:10:51 | and |
|---|
| 0:10:52 | as i mentioned before |
|---|
| 0:10:53 | if you just allow |
|---|
| 0:10:55 | the target distribution on the non-target distribution |
|---|
| 0:10:58 | you have separate variances we get a nonlinear a quadratic |
|---|
| 0:11:03 | calibration function |
|---|
| 0:11:05 | and then |
|---|
| 0:11:06 | also applied a student's t-distribution |
|---|
| 0:11:09 | and even a more general distribution for the normal inverse gaussian |
|---|
| 0:11:14 | what if twenty nine but the important thing is |
|---|
| 0:11:17 | we got from the formally or gaussian was just has a mean and variance a |
|---|
| 0:11:21 | location and the scale |
|---|
| 0:11:22 | you distributions and can control the tail thickness |
|---|
| 0:11:26 | and then the final one has just skewness so we will see what the |
|---|
| 0:11:31 | what the times extra parameters what their effectiveness |
|---|
| 0:11:36 | so |
|---|
| 0:11:37 | this picture |
|---|
| 0:11:40 | much better than the previous ones |
|---|
| 0:11:43 | all of them a better |
|---|
| 0:11:45 | if we have to choose between them |
|---|
| 0:11:47 | the blue one the most complex one |
|---|
| 0:11:50 | does the best |
|---|
| 0:11:52 | but the gaussian one |
|---|
| 0:11:54 | those pretty well so the gaussian one is a lot faster than a lot easier |
|---|
| 0:11:58 | to use |
|---|
| 0:12:00 | so |
|---|
| 0:12:02 | by the you don't want to bother with your bessel functions |
|---|
| 0:12:06 | and complex optimization algorithms |
|---|
| 0:12:09 | you can read in the bible as a part of column of how to optimize |
|---|
| 0:12:12 | the new one |
|---|
| 0:12:16 | what is interesting is |
|---|
| 0:12:18 | the t distribution |
|---|
| 0:12:20 | is it complexity between the others to so why is it |
|---|
| 0:12:24 | workers |
|---|
| 0:12:27 | you would expect it to like that so the green one we would expect to |
|---|
| 0:12:30 | be between the red and the blue |
|---|
| 0:12:33 | so my explanation is |
|---|
| 0:12:37 | it's sort of abusing it's |
|---|
| 0:12:39 | ability to adjust the tail thickness |
|---|
| 0:12:42 | but it's a metric |
|---|
| 0:12:43 | so what it seeing at the one perilous trying to apply to the other day |
|---|
| 0:12:47 | so |
|---|
| 0:12:48 | i think that sort of a complex mixture of overfitting unless fitting that using here |
|---|
| 0:12:56 | so |
|---|
| 0:12:57 | there's just quickly summarise the results |
|---|
| 0:12:59 | this table games |
|---|
| 0:13:01 | all the calibration solutions |
|---|
| 0:13:03 | the red ones of the linear ones |
|---|
| 0:13:05 | two or three parameters |
|---|
| 0:13:06 | the underfoot |
|---|
| 0:13:08 | but the ivy |
|---|
| 0:13:10 | has forty two million parameters |
|---|
| 0:13:12 | and |
|---|
| 0:13:13 | there's some overfitting |
|---|
| 0:13:15 | and then the blue ones that don't the ones |
|---|
| 0:13:17 | they do a lot better |
|---|
| 0:13:19 | and |
|---|
| 0:13:20 | the most complex one |
|---|
| 0:13:22 | works the based |
|---|
| 0:13:26 | i'll just show these plots again |
|---|
| 0:13:27 | the so you can see how improve |
|---|
| 0:13:30 | from |
|---|
| 0:13:31 | the general the one |
|---|
| 0:13:33 | discrimate the |
|---|
| 0:13:35 | a nonlinear |
|---|
| 0:13:36 | and the nonlinear |
|---|
| 0:13:38 | parametric |
|---|
| 0:13:43 | in conclusion |
|---|
| 0:13:45 | the linear |
|---|
| 0:13:48 | calibration |
|---|
| 0:13:50 | suffers from underfunding |
|---|
| 0:13:51 | but |
|---|
| 0:13:52 | we can manage that |
|---|
| 0:13:54 | by focusing on a specific operating point |
|---|
| 0:13:58 | non linear calibrations |
|---|
| 0:14:01 | don't have |
|---|
| 0:14:02 | the under splitting problem but you have to |
|---|
| 0:14:05 | watch out for over fitting |
|---|
| 0:14:07 | but again that can be managed you can regularizer as you would do with |
|---|
| 0:14:11 | machine learning techniques or |
|---|
| 0:14:13 | you can |
|---|
| 0:14:14 | you can do you can use bayesian methods |
|---|
| 0:14:19 | so that's my story |
|---|
| 0:14:21 | fancy questions |
|---|
| 0:14:39 | and ask double question |
|---|
| 0:14:43 | do you think that these conclusions hold for other kinds of systems and other kind |
|---|
| 0:14:49 | of data have any experience |
|---|
| 0:14:53 | it was this only a p lda i-vector system |
|---|
| 0:14:57 | yes recounted i need only did it on that system on the on |
|---|
| 0:15:02 | the one database |
|---|
| 0:15:08 | i would like to speculate |
|---|
| 0:15:13 | once you've got tested on other data as well |
|---|