0:00:15okay so
0:00:17in this p and this paper i'm going to compared some linear and some non
0:00:22linear calibration function
0:00:25there's a list of previous papers
0:00:28all of them used linear calibration we did various interesting things
0:00:34when i was doing that work every time it became evident
0:00:38the linear calibration has limits let's explore what we can do with the non linear
0:00:46in this paper
0:00:47we're going to use lots of data for training the calibration in that case of
0:00:52plugging solution
0:00:55works well
0:00:57we have an upcoming interspeech paper where we're gonna have used tiny amounts of training
0:01:01data and their a bayesian solution
0:01:06is an interesting thing to do
0:01:10just a reminder why calibration so
0:01:13a speaker recognizer about the speaker recognizer is not useful unless it actually does something
0:01:19so we wanted to make decisions and it's nice if you can make those decisions
0:01:25minimum expected cost bayes decisions
0:01:28if you take the raw score sort of the recognizer
0:01:30i don't make good decisions so we like to calibrate them
0:01:34so that the we can make would cost effective decisions
0:01:40four years we've done pretty well what
0:01:42linear calibration
0:01:44why complicate things so
0:01:48well or
0:01:49the good things about linear calibration simplicity
0:01:53it's easy to do
0:01:57then there's the problem of overfitting
0:02:00the linear calibration as very few parameters so it doesn't overfit that easily
0:02:05not a problem that should be underestimated even if you have lots of data if
0:02:09you work at an extreme operating point
0:02:11the error rates become very low so your data is actually errors not
0:02:15not the not speech samples of trials you know another errors you gonna have over
0:02:21fitting problems
0:02:23and the thing
0:02:24the linear calibration is monotonic
0:02:27the score increases the log-likelihood ratio increases
0:02:30if you do something nonlinear
0:02:32you might be in a situation
0:02:36the score increases but the log likelihood ratio decreases
0:02:40and that i don't think even all the soreness in finland can help us on
0:02:43certain whether we want that kind of thing or not
0:02:50the limitation of linear methods
0:02:53that the
0:02:55they don't look at all operating points at the same time
0:02:59you have to choose
0:03:03do we want to operate what cost ratio what prior for the target do we
0:03:08want to work at
0:03:09and then you have to thailand your operating point your training objective function
0:03:17to make that what
0:03:21so that
0:03:21why is this a problem
0:03:23you cannot always know in advance where you're gonna want your system to work
0:03:27and especially if you dealing with unsupervised data
0:03:33the non linear methods
0:03:36like can be accurate
0:03:37accurate over a wider range of operating point
0:03:41you don't need to do this so much gymnastics with your
0:03:47training objective function
0:03:52the non linear methods or considerably more complex to train i had to go and
0:03:56find out a lot of things about the basal functions and how to compute the
0:04:03and there are more vulnerable to overfitting so
0:04:06more complex functions more things can go wrong
0:04:12so will compare
0:04:15various flavours
0:04:17discriminative and generative linear calibrations
0:04:19and the same for non linear ones
0:04:22another conclusion is going to be there is some benefit to the nonlinear ones
0:04:27but you will cover a wide range of operating points
0:04:35first describe the linear ones
0:04:39it's linear because we take the score that comes out of the system
0:04:43we scale at by some fact that i and we shifted by some constant be
0:04:48and then
0:04:50if we do if we use
0:04:52gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets
0:05:01you have a target mean and the non-target mean and then you would share the
0:05:06variance between the two distributions
0:05:09if you don't if you have separate variances you get a quadratic function
0:05:14so in the linear case we sharing that the that stigma
0:05:19so that
0:05:20gives a linear generative calibration
0:05:23or you could be discrimate of in that case you're probabilistic model is just the
0:05:28formula at the top of the by and you directly trained those parameters by minimizing
0:05:32cross entropy
0:05:37i said
0:05:39we have to kind of the objective function to make it work at a specific
0:05:42operating point
0:05:44what we basically do
0:05:46it's we white
0:05:47the target trials and the non-target trials or if you want the most errors and
0:05:52the false alarm errors
0:05:53by a factor of all five and one minus alpha so
0:05:57but school also the training parameters so when you train at first you have to
0:06:01select your operating point
0:06:07let's you know that all works out present
0:06:11experimental results first a nonlinear stuff then we'll that of the nonlinear
0:06:16sample experimental setup and i-vector system
0:06:20we trained
0:06:23the calibrations on a huge amount of scores of a forty million scores and we
0:06:28pairs to the
0:06:29on the sre twelve which was described earlier today
0:06:34like i've about nine million scores
0:06:36and are evaluation criterion
0:06:39is the same one that was used again and the nist evaluation
0:06:43very well known dcf what if you want the by using a direct
0:06:49and it's normalized
0:06:51as shown by
0:06:53the performance of
0:06:55additional system that doesn't look at the scores of just makes decisions by the prior
0:07:05this is the
0:07:06result of the
0:07:09gaussian calibration
0:07:13what we're looking at is on the horizontal axis is the dcf for the error
0:07:17rate lower is better
0:07:19on the
0:07:20sorry about the collected the that the horizontal axis is you're operating point or your
0:07:26target prior on the log on scale
0:07:28so the other would be a problem prior of a half
0:07:33small priors positive lots priors
0:07:38dashed line
0:07:39is what you would know is minimum dcf
0:07:43what is the best you can do
0:07:45if the evaluator sets that the threshold that at every single operating point
0:07:52we trained
0:07:54the system using three different
0:07:59the training weighting parameter all four
0:08:03so of a much smaller than one means
0:08:06within george doddington the region
0:08:08the low false alarm rate
0:08:11false alarms or more important so you white the more
0:08:14if you do that
0:08:16you do well in the region that you want to do well but on the
0:08:19other side you will see the red curve suffers
0:08:23we set the parameter that the hoff
0:08:26the does badly almost everywhere
0:08:29if you set the parameter to the others other side almost one
0:08:33you get the reverse
0:08:35on that side it's bad on the side
0:08:37it's good that that's the blue curve
0:08:41this was generative let's move to discriminant of
0:08:46the picture is slightly better
0:08:50this is the usual button if you have lots of data discrimate of
0:08:55of from those bit of the genital
0:08:57but still
0:08:58we don't do as well as we might like to over all operating points
0:09:03so let's see what the non linear methods will do
0:09:10that the i v algorithm also sometimes called a symphonic regression
0:09:15is a very interesting algorithm
0:09:18we assign
0:09:21allowing for calibration function any monotonic rising functions
0:09:31and optimisation procedures
0:09:33which essentially selects for every single score
0:09:36what the function if is going to map it do so it nonparametric
0:09:42and the very interesting thing is
0:09:44we don't have to choose
0:09:47which objective
0:09:49we actually want to optimize this function class is rich enough that it actually just
0:09:55optimizes all of them so
0:09:57you get that automatically
0:10:00all your objective functions optimized at all operating points on the training data
0:10:07if you going to the test data
0:10:10you see
0:10:11over a wide range of operating points it does work pretty well
0:10:15but at the extreme negative and we do have a slight problem so
0:10:22i attribute that overfitting
0:10:25this thing has forty two million parameters is non parametric the parameters grow with the
0:10:32but they're also forty two million inequality constraints
0:10:35that makes it behind mostly
0:10:38except there we run out of errors and it stops behaving
0:10:45now we go to the generative
0:10:48version of nonlinear
0:10:52as i mentioned before
0:10:53if you just allow
0:10:55the target distribution on the non-target distribution
0:10:58you have separate variances we get a nonlinear a quadratic
0:11:03calibration function
0:11:05and then
0:11:06also applied a student's t-distribution
0:11:09and even a more general distribution for the normal inverse gaussian
0:11:14what if twenty nine but the important thing is
0:11:17we got from the formally or gaussian was just has a mean and variance a
0:11:21location and the scale
0:11:22you distributions and can control the tail thickness
0:11:26and then the final one has just skewness so we will see what the
0:11:31what the times extra parameters what their effectiveness
0:11:37this picture
0:11:40much better than the previous ones
0:11:43all of them a better
0:11:45if we have to choose between them
0:11:47the blue one the most complex one
0:11:50does the best
0:11:52but the gaussian one
0:11:54those pretty well so the gaussian one is a lot faster than a lot easier
0:11:58to use
0:12:02by the you don't want to bother with your bessel functions
0:12:06and complex optimization algorithms
0:12:09you can read in the bible as a part of column of how to optimize
0:12:12the new one
0:12:16what is interesting is
0:12:18the t distribution
0:12:20is it complexity between the others to so why is it
0:12:27you would expect it to like that so the green one we would expect to
0:12:30be between the red and the blue
0:12:33so my explanation is
0:12:37it's sort of abusing it's
0:12:39ability to adjust the tail thickness
0:12:42but it's a metric
0:12:43so what it seeing at the one perilous trying to apply to the other day
0:12:48i think that sort of a complex mixture of overfitting unless fitting that using here
0:12:57there's just quickly summarise the results
0:12:59this table games
0:13:01all the calibration solutions
0:13:03the red ones of the linear ones
0:13:05two or three parameters
0:13:06the underfoot
0:13:08but the ivy
0:13:10has forty two million parameters
0:13:13there's some overfitting
0:13:15and then the blue ones that don't the ones
0:13:17they do a lot better
0:13:20the most complex one
0:13:22works the based
0:13:26i'll just show these plots again
0:13:27the so you can see how improve
0:13:31the general the one
0:13:33discrimate the
0:13:35a nonlinear
0:13:36and the nonlinear
0:13:43in conclusion
0:13:45the linear
0:13:50suffers from underfunding
0:13:52we can manage that
0:13:54by focusing on a specific operating point
0:13:58non linear calibrations
0:14:01don't have
0:14:02the under splitting problem but you have to
0:14:05watch out for over fitting
0:14:07but again that can be managed you can regularizer as you would do with
0:14:11machine learning techniques or
0:14:13you can
0:14:14you can do you can use bayesian methods
0:14:19so that's my story
0:14:21fancy questions
0:14:39and ask double question
0:14:43do you think that these conclusions hold for other kinds of systems and other kind
0:14:49of data have any experience
0:14:53it was this only a p lda i-vector system
0:14:57yes recounted i need only did it on that system on the on
0:15:02the one database
0:15:08i would like to speculate
0:15:13once you've got tested on other data as well