0:00:15 | okay so |
---|---|

0:00:17 | in this p and this paper i'm going to compared some linear and some non |

0:00:22 | linear calibration function |

0:00:25 | so |

0:00:25 | there's a list of previous papers |

0:00:28 | all of them used linear calibration we did various interesting things |

0:00:33 | but |

0:00:34 | when i was doing that work every time it became evident |

0:00:38 | the linear calibration has limits let's explore what we can do with the non linear |

0:00:43 | calibration |

0:00:45 | so |

0:00:46 | in this paper |

0:00:47 | we're going to use lots of data for training the calibration in that case of |

0:00:52 | plugging solution |

0:00:55 | works well |

0:00:57 | we have an upcoming interspeech paper where we're gonna have used tiny amounts of training |

0:01:01 | data and their a bayesian solution |

0:01:06 | is an interesting thing to do |

0:01:09 | so |

0:01:10 | just a reminder why calibration so |

0:01:13 | a speaker recognizer about the speaker recognizer is not useful unless it actually does something |

0:01:19 | so we wanted to make decisions and it's nice if you can make those decisions |

0:01:24 | be |

0:01:25 | minimum expected cost bayes decisions |

0:01:28 | if you take the raw score sort of the recognizer |

0:01:30 | i don't make good decisions so we like to calibrate them |

0:01:34 | so that the we can make would cost effective decisions |

0:01:39 | so |

0:01:40 | four years we've done pretty well what |

0:01:42 | linear calibration |

0:01:44 | why complicate things so |

0:01:48 | well or |

0:01:49 | the good things about linear calibration simplicity |

0:01:53 | it's easy to do |

0:01:56 | and |

0:01:57 | then there's the problem of overfitting |

0:02:00 | the linear calibration as very few parameters so it doesn't overfit that easily |

0:02:05 | not a problem that should be underestimated even if you have lots of data if |

0:02:09 | you work at an extreme operating point |

0:02:11 | the error rates become very low so your data is actually errors not |

0:02:15 | not the not speech samples of trials you know another errors you gonna have over |

0:02:21 | fitting problems |

0:02:23 | and the thing |

0:02:24 | the linear calibration is monotonic |

0:02:27 | the score increases the log-likelihood ratio increases |

0:02:30 | if you do something nonlinear |

0:02:32 | you might be in a situation |

0:02:34 | with |

0:02:36 | the score increases but the log likelihood ratio decreases |

0:02:40 | and that i don't think even all the soreness in finland can help us on |

0:02:43 | certain whether we want that kind of thing or not |

0:02:48 | so |

0:02:50 | the limitation of linear methods |

0:02:52 | is |

0:02:53 | that the |

0:02:55 | they don't look at all operating points at the same time |

0:02:59 | you have to choose |

0:03:02 | where |

0:03:03 | do we want to operate what cost ratio what prior for the target do we |

0:03:08 | want to work at |

0:03:09 | and then you have to thailand your operating point your training objective function |

0:03:17 | to make that what |

0:03:21 | so that |

0:03:21 | why is this a problem |

0:03:23 | you cannot always know in advance where you're gonna want your system to work |

0:03:27 | and especially if you dealing with unsupervised data |

0:03:32 | so |

0:03:33 | the non linear methods |

0:03:36 | like can be accurate |

0:03:37 | accurate over a wider range of operating point |

0:03:41 | then |

0:03:41 | you don't need to do this so much gymnastics with your |

0:03:47 | training objective function |

0:03:51 | so |

0:03:52 | the non linear methods or considerably more complex to train i had to go and |

0:03:56 | find out a lot of things about the basal functions and how to compute the |

0:04:00 | derivatives |

0:04:03 | and there are more vulnerable to overfitting so |

0:04:06 | more complex functions more things can go wrong |

0:04:12 | so will compare |

0:04:15 | various flavours |

0:04:17 | discriminative and generative linear calibrations |

0:04:19 | and the same for non linear ones |

0:04:22 | another conclusion is going to be there is some benefit to the nonlinear ones |

0:04:27 | but you will cover a wide range of operating points |

0:04:34 | it's |

0:04:35 | first describe the linear ones |

0:04:38 | so |

0:04:39 | it's linear because we take the score that comes out of the system |

0:04:43 | we scale at by some fact that i and we shifted by some constant be |

0:04:48 | and then |

0:04:50 | if we do if we use |

0:04:52 | gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets |

0:04:59 | and |

0:05:01 | you have a target mean and the non-target mean and then you would share the |

0:05:06 | variance between the two distributions |

0:05:09 | if you don't if you have separate variances you get a quadratic function |

0:05:14 | so in the linear case we sharing that the that stigma |

0:05:19 | so that |

0:05:20 | gives a linear generative calibration |

0:05:23 | or you could be discrimate of in that case you're probabilistic model is just the |

0:05:28 | formula at the top of the by and you directly trained those parameters by minimizing |

0:05:32 | cross entropy |

0:05:36 | so |

0:05:37 | i said |

0:05:39 | we have to kind of the objective function to make it work at a specific |

0:05:42 | operating point |

0:05:44 | so |

0:05:44 | what we basically do |

0:05:46 | it's we white |

0:05:47 | the target trials and the non-target trials or if you want the most errors and |

0:05:52 | the false alarm errors |

0:05:53 | by a factor of all five and one minus alpha so |

0:05:57 | but school also the training parameters so when you train at first you have to |

0:06:01 | select your operating point |

0:06:05 | so |

0:06:07 | let's you know that all works out present |

0:06:11 | experimental results first a nonlinear stuff then we'll that of the nonlinear |

0:06:14 | so |

0:06:16 | sample experimental setup and i-vector system |

0:06:20 | and |

0:06:20 | we trained |

0:06:23 | the calibrations on a huge amount of scores of a forty million scores and we |

0:06:28 | pairs to the |

0:06:29 | on the sre twelve which was described earlier today |

0:06:34 | like i've about nine million scores |

0:06:36 | and are evaluation criterion |

0:06:39 | is the same one that was used again and the nist evaluation |

0:06:43 | very well known dcf what if you want the by using a direct |

0:06:49 | and it's normalized |

0:06:51 | as shown by |

0:06:53 | the performance of |

0:06:55 | additional system that doesn't look at the scores of just makes decisions by the prior |

0:06:59 | alone |

0:07:02 | so |

0:07:05 | this is the |

0:07:06 | result of the |

0:07:09 | gaussian calibration |

0:07:13 | what we're looking at is on the horizontal axis is the dcf for the error |

0:07:17 | rate lower is better |

0:07:19 | on the |

0:07:20 | sorry about the collected the that the horizontal axis is you're operating point or your |

0:07:26 | target prior on the log on scale |

0:07:28 | so the other would be a problem prior of a half |

0:07:31 | negative |

0:07:33 | small priors positive lots priors |

0:07:36 | and |

0:07:38 | the |

0:07:38 | dashed line |

0:07:39 | is what you would know is minimum dcf |

0:07:43 | what is the best you can do |

0:07:45 | if the evaluator sets that the threshold that at every single operating point |

0:07:50 | so |

0:07:52 | we trained |

0:07:54 | the system using three different |

0:07:57 | values |

0:07:58 | for |

0:07:59 | the training weighting parameter all four |

0:08:03 | so of a much smaller than one means |

0:08:06 | within george doddington the region |

0:08:08 | the low false alarm rate |

0:08:11 | false alarms or more important so you white the more |

0:08:14 | if you do that |

0:08:16 | you do well in the region that you want to do well but on the |

0:08:19 | other side you will see the red curve suffers |

0:08:22 | if |

0:08:23 | we set the parameter that the hoff |

0:08:26 | the does badly almost everywhere |

0:08:29 | if you set the parameter to the others other side almost one |

0:08:33 | you get the reverse |

0:08:35 | on that side it's bad on the side |

0:08:37 | it's good that that's the blue curve |

0:08:40 | so |

0:08:41 | this was generative let's move to discriminant of |

0:08:45 | so |

0:08:46 | the picture is slightly better |

0:08:50 | this is the usual button if you have lots of data discrimate of |

0:08:55 | of from those bit of the genital |

0:08:57 | but still |

0:08:58 | we don't do as well as we might like to over all operating points |

0:09:03 | so let's see what the non linear methods will do |

0:09:07 | so |

0:09:10 | that the i v algorithm also sometimes called a symphonic regression |

0:09:15 | is a very interesting algorithm |

0:09:18 | we assign |

0:09:19 | we |

0:09:21 | allowing for calibration function any monotonic rising functions |

0:09:26 | and |

0:09:27 | then |

0:09:29 | there's |

0:09:31 | and optimisation procedures |

0:09:33 | which essentially selects for every single score |

0:09:36 | what the function if is going to map it do so it nonparametric |

0:09:42 | and the very interesting thing is |

0:09:44 | we don't have to choose |

0:09:47 | which objective |

0:09:49 | we actually want to optimize this function class is rich enough that it actually just |

0:09:55 | optimizes all of them so |

0:09:57 | you get that automatically |

0:10:00 | all your objective functions optimized at all operating points on the training data |

0:10:07 | if you going to the test data |

0:10:10 | you see |

0:10:11 | over a wide range of operating points it does work pretty well |

0:10:15 | but at the extreme negative and we do have a slight problem so |

0:10:22 | i attribute that overfitting |

0:10:24 | so |

0:10:25 | this thing has forty two million parameters is non parametric the parameters grow with the |

0:10:30 | data |

0:10:32 | but they're also forty two million inequality constraints |

0:10:35 | that makes it behind mostly |

0:10:38 | except there we run out of errors and it stops behaving |

0:10:44 | so |

0:10:45 | now we go to the generative |

0:10:48 | version of nonlinear |

0:10:51 | and |

0:10:52 | as i mentioned before |

0:10:53 | if you just allow |

0:10:55 | the target distribution on the non-target distribution |

0:10:58 | you have separate variances we get a nonlinear a quadratic |

0:11:03 | calibration function |

0:11:05 | and then |

0:11:06 | also applied a student's t-distribution |

0:11:09 | and even a more general distribution for the normal inverse gaussian |

0:11:14 | what if twenty nine but the important thing is |

0:11:17 | we got from the formally or gaussian was just has a mean and variance a |

0:11:21 | location and the scale |

0:11:22 | you distributions and can control the tail thickness |

0:11:26 | and then the final one has just skewness so we will see what the |

0:11:31 | what the times extra parameters what their effectiveness |

0:11:36 | so |

0:11:37 | this picture |

0:11:40 | much better than the previous ones |

0:11:43 | all of them a better |

0:11:45 | if we have to choose between them |

0:11:47 | the blue one the most complex one |

0:11:50 | does the best |

0:11:52 | but the gaussian one |

0:11:54 | those pretty well so the gaussian one is a lot faster than a lot easier |

0:11:58 | to use |

0:12:00 | so |

0:12:02 | by the you don't want to bother with your bessel functions |

0:12:06 | and complex optimization algorithms |

0:12:09 | you can read in the bible as a part of column of how to optimize |

0:12:12 | the new one |

0:12:16 | what is interesting is |

0:12:18 | the t distribution |

0:12:20 | is it complexity between the others to so why is it |

0:12:24 | workers |

0:12:27 | you would expect it to like that so the green one we would expect to |

0:12:30 | be between the red and the blue |

0:12:33 | so my explanation is |

0:12:37 | it's sort of abusing it's |

0:12:39 | ability to adjust the tail thickness |

0:12:42 | but it's a metric |

0:12:43 | so what it seeing at the one perilous trying to apply to the other day |

0:12:47 | so |

0:12:48 | i think that sort of a complex mixture of overfitting unless fitting that using here |

0:12:56 | so |

0:12:57 | there's just quickly summarise the results |

0:12:59 | this table games |

0:13:01 | all the calibration solutions |

0:13:03 | the red ones of the linear ones |

0:13:05 | two or three parameters |

0:13:06 | the underfoot |

0:13:08 | but the ivy |

0:13:10 | has forty two million parameters |

0:13:12 | and |

0:13:13 | there's some overfitting |

0:13:15 | and then the blue ones that don't the ones |

0:13:17 | they do a lot better |

0:13:19 | and |

0:13:20 | the most complex one |

0:13:22 | works the based |

0:13:26 | i'll just show these plots again |

0:13:27 | the so you can see how improve |

0:13:30 | from |

0:13:31 | the general the one |

0:13:33 | discrimate the |

0:13:35 | a nonlinear |

0:13:36 | and the nonlinear |

0:13:38 | parametric |

0:13:43 | in conclusion |

0:13:45 | the linear |

0:13:48 | calibration |

0:13:50 | suffers from underfunding |

0:13:51 | but |

0:13:52 | we can manage that |

0:13:54 | by focusing on a specific operating point |

0:13:58 | non linear calibrations |

0:14:01 | don't have |

0:14:02 | the under splitting problem but you have to |

0:14:05 | watch out for over fitting |

0:14:07 | but again that can be managed you can regularizer as you would do with |

0:14:11 | machine learning techniques or |

0:14:13 | you can |

0:14:14 | you can do you can use bayesian methods |

0:14:19 | so that's my story |

0:14:21 | fancy questions |

0:14:39 | and ask double question |

0:14:43 | do you think that these conclusions hold for other kinds of systems and other kind |

0:14:49 | of data have any experience |

0:14:53 | it was this only a p lda i-vector system |

0:14:57 | yes recounted i need only did it on that system on the on |

0:15:02 | the one database |

0:15:08 | i would like to speculate |

0:15:13 | once you've got tested on other data as well |