okay so

in this p and this paper i'm going to compared some linear and some non

linear calibration function

so

there's a list of previous papers

all of them used linear calibration we did various interesting things

but

when i was doing that work every time it became evident

the linear calibration has limits let's explore what we can do with the non linear

calibration

so

in this paper

we're going to use lots of data for training the calibration in that case of

plugging solution

works well

we have an upcoming interspeech paper where we're gonna have used tiny amounts of training

data and their a bayesian solution

is an interesting thing to do

so

just a reminder why calibration so

a speaker recognizer about the speaker recognizer is not useful unless it actually does something

so we wanted to make decisions and it's nice if you can make those decisions

be

minimum expected cost bayes decisions

if you take the raw score sort of the recognizer

i don't make good decisions so we like to calibrate them

so that the we can make would cost effective decisions

so

four years we've done pretty well what

linear calibration

why complicate things so

well or

the good things about linear calibration simplicity

it's easy to do

and

then there's the problem of overfitting

the linear calibration as very few parameters so it doesn't overfit that easily

not a problem that should be underestimated even if you have lots of data if

you work at an extreme operating point

the error rates become very low so your data is actually errors not

not the not speech samples of trials you know another errors you gonna have over

fitting problems

and the thing

the linear calibration is monotonic

the score increases the log-likelihood ratio increases

if you do something nonlinear

you might be in a situation

with

the score increases but the log likelihood ratio decreases

and that i don't think even all the soreness in finland can help us on

certain whether we want that kind of thing or not

so

the limitation of linear methods

is

that the

they don't look at all operating points at the same time

you have to choose

where

do we want to operate what cost ratio what prior for the target do we

want to work at

and then you have to thailand your operating point your training objective function

to make that what

so that

why is this a problem

you cannot always know in advance where you're gonna want your system to work

and especially if you dealing with unsupervised data

so

the non linear methods

like can be accurate

accurate over a wider range of operating point

then

you don't need to do this so much gymnastics with your

training objective function

so

the non linear methods or considerably more complex to train i had to go and

find out a lot of things about the basal functions and how to compute the

derivatives

and there are more vulnerable to overfitting so

more complex functions more things can go wrong

so will compare

various flavours

discriminative and generative linear calibrations

and the same for non linear ones

another conclusion is going to be there is some benefit to the nonlinear ones

but you will cover a wide range of operating points

it's

first describe the linear ones

so

it's linear because we take the score that comes out of the system

we scale at by some fact that i and we shifted by some constant be

and then

if we do if we use

gaussian distributions gaussian score distributions you have two distributions one for targets one for non-targets

and

you have a target mean and the non-target mean and then you would share the

variance between the two distributions

if you don't if you have separate variances you get a quadratic function

so in the linear case we sharing that the that stigma

so that

gives a linear generative calibration

or you could be discrimate of in that case you're probabilistic model is just the

formula at the top of the by and you directly trained those parameters by minimizing

cross entropy

so

i said

we have to kind of the objective function to make it work at a specific

operating point

so

what we basically do

it's we white

the target trials and the non-target trials or if you want the most errors and

the false alarm errors

by a factor of all five and one minus alpha so

but school also the training parameters so when you train at first you have to

select your operating point

so

let's you know that all works out present

experimental results first a nonlinear stuff then we'll that of the nonlinear

so

sample experimental setup and i-vector system

and

we trained

the calibrations on a huge amount of scores of a forty million scores and we

pairs to the

on the sre twelve which was described earlier today

like i've about nine million scores

and are evaluation criterion

is the same one that was used again and the nist evaluation

very well known dcf what if you want the by using a direct

and it's normalized

as shown by

the performance of

additional system that doesn't look at the scores of just makes decisions by the prior

alone

so

this is the

result of the

gaussian calibration

what we're looking at is on the horizontal axis is the dcf for the error

rate lower is better

on the

sorry about the collected the that the horizontal axis is you're operating point or your

target prior on the log on scale

so the other would be a problem prior of a half

negative

small priors positive lots priors

and

the

dashed line

is what you would know is minimum dcf

what is the best you can do

if the evaluator sets that the threshold that at every single operating point

so

we trained

the system using three different

values

for

the training weighting parameter all four

so of a much smaller than one means

within george doddington the region

the low false alarm rate

false alarms or more important so you white the more

if you do that

you do well in the region that you want to do well but on the

other side you will see the red curve suffers

if

we set the parameter that the hoff

the does badly almost everywhere

if you set the parameter to the others other side almost one

you get the reverse

on that side it's bad on the side

it's good that that's the blue curve

so

this was generative let's move to discriminant of

so

the picture is slightly better

this is the usual button if you have lots of data discrimate of

of from those bit of the genital

but still

we don't do as well as we might like to over all operating points

so let's see what the non linear methods will do

so

that the i v algorithm also sometimes called a symphonic regression

is a very interesting algorithm

we assign

we

allowing for calibration function any monotonic rising functions

and

then

there's

and optimisation procedures

which essentially selects for every single score

what the function if is going to map it do so it nonparametric

and the very interesting thing is

we don't have to choose

which objective

we actually want to optimize this function class is rich enough that it actually just

optimizes all of them so

you get that automatically

all your objective functions optimized at all operating points on the training data

if you going to the test data

you see

over a wide range of operating points it does work pretty well

but at the extreme negative and we do have a slight problem so

i attribute that overfitting

so

this thing has forty two million parameters is non parametric the parameters grow with the

data

but they're also forty two million inequality constraints

that makes it behind mostly

except there we run out of errors and it stops behaving

so

now we go to the generative

version of nonlinear

and

as i mentioned before

if you just allow

the target distribution on the non-target distribution

you have separate variances we get a nonlinear a quadratic

calibration function

and then

also applied a student's t-distribution

and even a more general distribution for the normal inverse gaussian

what if twenty nine but the important thing is

we got from the formally or gaussian was just has a mean and variance a

location and the scale

you distributions and can control the tail thickness

and then the final one has just skewness so we will see what the

what the times extra parameters what their effectiveness

so

this picture

much better than the previous ones

all of them a better

if we have to choose between them

the blue one the most complex one

does the best

but the gaussian one

those pretty well so the gaussian one is a lot faster than a lot easier

to use

so

by the you don't want to bother with your bessel functions

and complex optimization algorithms

you can read in the bible as a part of column of how to optimize

the new one

what is interesting is

the t distribution

is it complexity between the others to so why is it

workers

you would expect it to like that so the green one we would expect to

be between the red and the blue

so my explanation is

it's sort of abusing it's

ability to adjust the tail thickness

but it's a metric

so what it seeing at the one perilous trying to apply to the other day

so

i think that sort of a complex mixture of overfitting unless fitting that using here

so

there's just quickly summarise the results

this table games

all the calibration solutions

the red ones of the linear ones

two or three parameters

the underfoot

but the ivy

has forty two million parameters

and

there's some overfitting

and then the blue ones that don't the ones

they do a lot better

and

the most complex one

works the based

i'll just show these plots again

the so you can see how improve

from

the general the one

discrimate the

a nonlinear

and the nonlinear

parametric

in conclusion

the linear

calibration

suffers from underfunding

but

we can manage that

by focusing on a specific operating point

non linear calibrations

don't have

the under splitting problem but you have to

watch out for over fitting

but again that can be managed you can regularizer as you would do with

machine learning techniques or

you can

you can do you can use bayesian methods

so that's my story

fancy questions

and ask double question

do you think that these conclusions hold for other kinds of systems and other kind

of data have any experience

it was this only a p lda i-vector system

yes recounted i need only did it on that system on the on

the one database

i would like to speculate

once you've got tested on other data as well