okay so
in this p and this paper i'm going to compared some linear and some non
linear calibration function
so
there's a list of previous papers
all of them used linear calibration we did various interesting things
but
when i was doing that work every time it became evident
the linear calibration has limits let's explore what we can do with the non linear
calibration
so
in this paper
we're going to use lots of data for training the calibration in that case of
plugging solution
works well
we have an upcoming interspeech paper where we're gonna have used tiny amounts of training
data and their a bayesian solution
is an interesting thing to do
so
just a reminder why calibration so
a speaker recognizer about the speaker recognizer is not useful unless it actually does something
so we wanted to make decisions and it's nice if you can make those decisions
be
minimum expected cost bayes decisions
if you take the raw score sort of the recognizer
i don't make good decisions so we like to calibrate them
so that the we can make would cost effective decisions
so
four years we've done pretty well what
linear calibration
why complicate things so
well or
the good things about linear calibration simplicity
it's easy to do
and
then there's the problem of overfitting
the linear calibration as very few parameters so it doesn't overfit that easily
not a problem that should be underestimated even if you have lots of data if
you work at an extreme operating point
the error rates become very low so your data is actually errors not
not the not speech samples of trials you know another errors you gonna have over
fitting problems
and the thing
the linear calibration is monotonic
the score increases the log-likelihood ratio increases
if you do something nonlinear
you might be in a situation
with
the score increases but the log likelihood ratio decreases
and that i don't think even all the soreness in finland can help us on
certain whether we want that kind of thing or not
so
the limitation of linear methods
is
that the
they don't look at all operating points at the same time
you have to choose
where
do we want to operate what cost ratio what prior for the target do we
want to work at
and then you have to thailand your operating point your training objective function
to make that what
so that
why is this a problem
you cannot always know in advance where you're gonna want your system to work
and especially if you dealing with unsupervised data
so
the non linear methods
like can be accurate
accurate over a wider range of operating point
then
you don't need to do this so much gymnastics with your
training objective function
so
the non linear methods or considerably more complex to train i had to go and
find out a lot of things about the basal functions and how to compute the
derivatives
and there are more vulnerable to overfitting so
more complex functions more things can go wrong
so will compare
various flavours
discriminative and generative linear calibrations
and the same for non linear ones
another conclusion is going to be there is some benefit to the nonlinear ones
but you will cover a wide range of operating points
it's
first describe the linear ones
so
it's linear because we take the score that comes out of the system
we scale at by some fact that i and we shifted by some constant be
and then
if we do if we use
gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets
and
you have a target mean and the non-target mean and then you would share the
variance between the two distributions
if you don't if you have separate variances you get a quadratic function
so in the linear case we sharing that the that stigma
so that
gives a linear generative calibration
or you could be discrimate of in that case you're probabilistic model is just the
formula at the top of the by and you directly trained those parameters by minimizing
cross entropy
so
i said
we have to kind of the objective function to make it work at a specific
operating point
so
what we basically do
it's we white
the target trials and the non-target trials or if you want the most errors and
the false alarm errors
by a factor of all five and one minus alpha so
but school also the training parameters so when you train at first you have to
select your operating point
so
let's you know that all works out present
experimental results first a nonlinear stuff then we'll that of the nonlinear
so
sample experimental setup and i-vector system
and
we trained
the calibrations on a huge amount of scores of a forty million scores and we
pairs to the
on the sre twelve which was described earlier today
like i've about nine million scores
and are evaluation criterion
is the same one that was used again and the nist evaluation
very well known dcf what if you want the by using a direct
and it's normalized
as shown by
the performance of
additional system that doesn't look at the scores of just makes decisions by the prior
alone
so
this is the
result of the
gaussian calibration
what we're looking at is on the horizontal axis is the dcf for the error
rate lower is better
on the
sorry about the collected the that the horizontal axis is you're operating point or your
target prior on the log on scale
so the other would be a problem prior of a half
negative
small priors positive lots priors
and
the
dashed line
is what you would know is minimum dcf
what is the best you can do
if the evaluator sets that the threshold that at every single operating point
so
we trained
the system using three different
values
for
the training weighting parameter all four
so of a much smaller than one means
within george doddington the region
the low false alarm rate
false alarms or more important so you white the more
if you do that
you do well in the region that you want to do well but on the
other side you will see the red curve suffers
if
we set the parameter that the hoff
the does badly almost everywhere
if you set the parameter to the others other side almost one
you get the reverse
on that side it's bad on the side
it's good that that's the blue curve
so
this was generative let's move to discriminant of
so
the picture is slightly better
this is the usual button if you have lots of data discrimate of
of from those bit of the genital
but still
we don't do as well as we might like to over all operating points
so let's see what the non linear methods will do
so
that the i v algorithm also sometimes called a symphonic regression
is a very interesting algorithm
we assign
we
allowing for calibration function any monotonic rising functions
and
then
there's
and optimisation procedures
which essentially selects for every single score
what the function if is going to map it do so it nonparametric
and the very interesting thing is
we don't have to choose
which objective
we actually want to optimize this function class is rich enough that it actually just
optimizes all of them so
you get that automatically
all your objective functions optimized at all operating points on the training data
if you going to the test data
you see
over a wide range of operating points it does work pretty well
but at the extreme negative and we do have a slight problem so
i attribute that overfitting
so
this thing has forty two million parameters is non parametric the parameters grow with the
data
but they're also forty two million inequality constraints
that makes it behind mostly
except there we run out of errors and it stops behaving
so
now we go to the generative
version of nonlinear
and
as i mentioned before
if you just allow
the target distribution on the non-target distribution
you have separate variances we get a nonlinear a quadratic
calibration function
and then
also applied a student's t-distribution
and even a more general distribution for the normal inverse gaussian
what if twenty nine but the important thing is
we got from the formally or gaussian was just has a mean and variance a
location and the scale
you distributions and can control the tail thickness
and then the final one has just skewness so we will see what the
what the times extra parameters what their effectiveness
so
this picture
much better than the previous ones
all of them a better
if we have to choose between them
the blue one the most complex one
does the best
but the gaussian one
those pretty well so the gaussian one is a lot faster than a lot easier
to use
so
by the you don't want to bother with your bessel functions
and complex optimization algorithms
you can read in the bible as a part of column of how to optimize
the new one
what is interesting is
the t distribution
is it complexity between the others to so why is it
workers
you would expect it to like that so the green one we would expect to
be between the red and the blue
so my explanation is
it's sort of abusing it's
ability to adjust the tail thickness
but it's a metric
so what it seeing at the one perilous trying to apply to the other day
so
i think that sort of a complex mixture of overfitting unless fitting that using here
so
there's just quickly summarise the results
this table games
all the calibration solutions
the red ones of the linear ones
two or three parameters
the underfoot
but the ivy
has forty two million parameters
and
there's some overfitting
and then the blue ones that don't the ones
they do a lot better
and
the most complex one
works the based
i'll just show these plots again
the so you can see how improve
from
the general the one
discrimate the
a nonlinear
and the nonlinear
parametric
in conclusion
the linear
calibration
suffers from underfunding
but
we can manage that
by focusing on a specific operating point
non linear calibrations
don't have
the under splitting problem but you have to
watch out for over fitting
but again that can be managed you can regularizer as you would do with
machine learning techniques or
you can
you can do you can use bayesian methods
so that's my story
fancy questions
and ask double question
do you think that these conclusions hold for other kinds of systems and other kind
of data have any experience
it was this only a p lda i-vector system
yes recounted i need only did it on that system on the on
the one database
i would like to speculate
once you've got tested on other data as well