i know things for attending this talk

i am just enough that i'm a researcher the computer science institute

which is a unit to university one aside as some corny set in argentina

to they'll be talking about the initial of calibration in speaker verification

and hopefully by the end of the talk and i'm gonna can be assumed that

these things you need an important issue if you were not already convinced

so the top will be organised this way first and gonna define calibration

and given intuition

then

talk about why we should care about it

which is related also to how to make sure it

and if we find out that bit calibration is bad in a certain system then

how to fix it

and then finally i'll talk about issues of robustness of calibration for speaker verification

the task the main task

on which i will be that samples on in speaker verification

and assume that the audience

in you know the c

knows

well this task but just in case

it's a binary classification task

where the samples

are given by a

two waveforms or two sets of waveforms

but we need to compare to decide whether

they come from the same speaker or from different speakers

so the task is binary classification so much of what i'm gonna say

applies to any binary classification task and we just

speaker verification

okay so what is calibration

that's a we want to build a system that predicts the probability that it will

rain within the next hour

based only on a picture of the sky

so this is these are wary

if we see this picture then we would expect the system to work we don't

know probability say point one

while it was in this picture then we would expect it well would have much

higher probability of rain

it's a closer to one when

are we will say that the system is kind of really

the values that are able by the system coincide

we what we seen

in the data

so

i well calibrated score

should reflect the uncertainty of the system

for example to be concrete

for all the samples

but get a score

or point eight

come the system

then we would expect eighty percent of them to be labeled correctly

that's one data point eight meetings

in that happens

then we will say that the system is what kind of

and then we could be an example of diagram that is used in many tasks

not match a speaker verification on not at all but it's

i think it's very intuitive four

understanding calibration

it's called the reliability of diagram

i'm basically but when it shows is the posteriors

from a system that was random certain data

the posteriors that the system okay

for the class

then we predict

so for example for

this being

we have all the samples for which the system gave a posterior between point eight

point

and what the

diagram shows is the accuracy

on those some

so in there

system was calibrated then we would expect these two we

diagonal

because

what the system predicted

what coincide with the accuracy than we seen also

in this specific case what we actually see that the system was correct more times

then you thought it would be

which is interesting in to a system that underestimates it's coupled

now i to this diagram

from a paper from twenty seventeen

which actually studies the initial calibration on

and different architectures

so it compares on a task

that is quality far one hundred which is the image classification how to different classes

and it compares

the this is the plot that i already showed a

c n from nineteen ninety eight

we address in it

from twenty sixteen

we and they show that actually the new network

much worse calibrated

then the old network

so for this saying being the racial before

then you network actually has an accuracy much lower than we got it should how

which is point five more

so in this is an over confident

the nn

is things

it will do much better than it actually thus

one the other hand being error

from the new network is no

so if you put this network to make decisions that the sessions will be better

than the old ones

but the score studied outputs

cannot be interpreted as posterior settle

it cannot be interpreted as

the certainty that sit that the system has when it makes a decision

so

this is actually a phenomenon that we see a node in speaker recognition basically you

have a badly calibrated bottle tiny still

when discriminately

the problem is that such a model

might be useless in practice depending on the scenario in which we plan to use

so

as i already said

this course

from an is gonna weighting system cannot be interpreted as the certainty

that the system has units decisions

also

the scores cannot be made

i cannot been used to make optimal position

without

having the data to

how does make a decision so that's what i'm gonna talk about in the next

two sets

so how do we make optimal decision in general for binary classification

when usually define a cost function

and this is a very

common cost function which has very nice properties

it's a combination of two terms

one for each class

where

the

maybe part here is the probability of making an error for that class of these

is

the probability of

to see

class

zero

when the true class

was one

we multiply these probability of error by the prior

for that class one

and then we further multiplied by cost which is what we think

it is gonna cost us if we make these are

this is very specific to the application that we're gonna use the system

and for the other classes the same symmetric

so

this is an expected cost

the way to minimize is expected cost is to choose the following

the session

so

for a certain sample x

the text class should be one

in this factor

it is larger than this factor and zero otherwise

and this factor is composed of the cost

the prior

and the likelihood

for the class one

and this is the same forecasting

so

we see here than one we need to make optimal decisions is these likelihood

be of x

given c

now

one we have

is the likelihood then we learned

without formal

is the likelihood when they're

on the training data

that's why amusing here the we go to indicate that these in the cost

these probabilities the one we expect to see testing

one we actually see that's the

while we don't have that

what we have is one we saw in train

so let's say that we train a generative model then our generative model is gonna

be was directly these likelihood

but it will be the likelihood we learned in training

and that's fine we usually just assume

in order to do anything at all the machine learning

we assume that these will generalize to testing

testing

now we may not have the likelihood if we train the discriminative system

in that case we may have the posterior

discriminative systems

training for example with cross entropy any two i'll would posteriors

in that case when we need to do is compare those posteriors by two likelihoods

and for that we use bayes rule

by basically we want to like the by

this be of x and divided by the by

i don't hear that again this is the prior in training

is not the prior

the p we call that i put hearing the cost which is the one we

expect to see testing

and that's the whole

point why we use likelihoods and not posteriors

to make these optimal position

because it gives us the flexibility

two separate

the prior from training from the prior in testing

okay so

going back to the

to the optimal decisions

we have this expression

we can simplify with this expression by defining the log-likelihood ratio

which i'm sure everybody now see

you're working speaker verification

it's basically the spatial between

the likelihood for class one and the likelihood for cassie rule

and we take monopoly because it's

nicer

are we can do a similar thing with that costs

the factors that multiplied these likelihoods here

so we define these data

and their with those definitions we can

simplify the optimal decisions to look like these basically you decide class one

if the llr is larger than

data

otherwise garcia

and the and an untimely computed from the system posteriors

with this expression digits

based rules

after taking the logarithm

you of a scroll so wait

because it was

in both

factors it what in most likely

and

and this is basically the no goals of the posterior minus the notebooks of the

prior

which can be written is way using the energy fine function

okay so in speaker verification the feature x

it's actually a pair of features or even

a pair of sets of features

a one for enrollment and one for test

then class one is the class for target or same speaker

trial

and class zero is the task for impostor or different speaker trial

and we define the cost function or we use an equally dcf in speaker verification

using these

names

for the costs and priors

and

we call the errors be nice be false alarm

and beanies and means

would be

a missing a target trial soul namely non-target trial as an impostor

and a false alarm would be

namely and impostor asset are

and that the racial

looks like this using these names

and if you know

only care about

it's actually this thing to make optimal decisions you don't care about the whole

combinational

values of costs and priors altogether about these things they

so you could impact simplify

the cost functions the families of can cost functions to consider by just using a

single binary and the fact that beat are that is equivalent to having this

triplet for money is that are really just three because

p is a function

so we will be using that a the rest of the talk because it's much

simpler

and it helps a lot in the analysis

basically we simplify all possible cost functions

all combinations of

costs and priors to a single

affect the guitar

so let's see some examples of applications that use different costs

right

so the default

the simplest cost function would be to have equal priors any vocals

and that would give you the threshold zero

that would be the optimum bayes threshold for these cost function

now if you have an application of any sport for examples

speaker authentication where

your goal

he's two

verifying whether somebody

is what they say they are

to their voice

for example two and their

system

then new would expect that most of your cases i've and of e

target trials

because you know how many posters trying to get into your system

on the other hand the cost of making a mistake

is very high

you feel false alarm

so you don't want any of the

was able you impostors getting into the system

that means you need to

said a very high cost alarm

a cost of false alarm

compared to the cost of

and that corresponds with initial

two point three

so basically what you're doing a small with the threshold to the right so that

the this area here on the solid curve

which is the distribution of scores

for the impostor samples

so everything about that racial two point three will be a false-alarm

by moving the initial to the right we are meaning lies in this area

another application that actually is

or seen in terms of course

priors is the speaker search

in that case you're looking for certain specific speaker weeding

another instead of many other speakers

so in that case the probability of finding your speaker is actually no

that's a one-to-one one percent

but the cost that you care about

the errors and you want to avoid are the basis because

you don't want you're looking for one specific speaker that is important to you for

some reason so you know want to meet

so in that case the problem of initial is

a symmetric to the now minus two point three

and in that case what you're trying to minimize is under the dash

it to the left

of the threshold

which is the probability of miss

okay

so to recover before moving onto

evaluation

if we have been and are then

i showed that we can trivially make optimisations for any possible cost function that you

can imagine

when the phone that i gave

but of course these decisions will only be actually optimal if the system outputs are

well calibrated

otherwise they will not you

so how do we figure out

if we have a

well calibrated system

the

question is if you're gonna make your system make decisions using these thresholds that i

showed before the data

then that's when you should evaluate have your system make those decisions using those data

and

see how well the

and then the for the question is

quote we have made better this ensures if we calibrated scores before making the decisions

that will give us sarong

how well calibrated is the system

two meeting

so

the when we usually evaluate performance on binary classification task

these

by using the cost

no wonder you over initial

so we prefix that the racial

using bayes

a decision theory or not

we just

that is commercial and then compute the beanie some people sometime which of these yes

and the two distributions

and then compute the costs

now we can also

and

define matrix that depend on the whole distribution to two sisters

so for example the equal error rate

is defined

by finding the commercial that makes these two areas the same

so basically to computing you need the whole test this deviation

and a similar thing is the minimum dcf

so what you're doing that case is

we official

across the whole range of scores

compute the cost

for almost possible threshold

and then

choose the threshold okay the mean cost

now that minimum cost is actually bounded

and

and it bummed in by

basically dummy decisions

this system that makes to make decisions

if you put

for example you official all the way to write

then you will only make

and mistakes that are misses

everything will be nice

so you'll have been means of one before xenomorph zero

in that case the cost then you will incur is this factor here

when the other hand if you put the threshold a way to the left

then you will only make false alarms and there will be the cost for that

system

will be these factors here

so basically the bound for the meeting these is

the best of those

two case

they're both times systems but one will be better than the other

are we usually use this mindcf to normalize

the dcf so and nist evaluations for example

the

core studies define is the normalized dcf

also

and then finally another thing we can do is we the threshold

we called the puny some people's allow for every possible value of potential

and then gives a score curves like these

and if we transform the axis appropriately then we get the

standard that curves we use for speaker verification

so the cost that i've been talking about can be decomposed

into discrimination and calibration component

so let's see how

that's a we assume a cost or well priors an equal cost

in that case

the optimal threshold will be civil

the bayes optimal threshold would be zero

so

we compare the cost using that

commercial

and we get these

a

given that the priors and costs are the same then the cost will be given

by the average of these two areas

and shown here

now when you can also compute the mean cost as i mentioned before

basically sweet but initial

actual the threshold that gives

the minimum cost

again is the average between these two areas which you see is much smaller than

the average between these two areas in this case

and the difference between

those two cost

can be seen

as the additional cost that you encouraging because your system was makes me scully weight

so this orange area here which is the difference between

the sound

well the areas here on the sum of the areas here

is the cost due to these calibration and that's one way of measuring

how

nice kind of ready to system

so there's discrimination which is how well the scores

separated classes

and there's calibration which is whether the discourse can be interpreted probabilistically

which implies that you can make optimum bayes decisions

if they are kind of work

and the key here is then discrimination is the part

of the

performance that cannot be changed

if we transform the scores into we then invertible transformation

so here's a simple example that a you have these distribution of scores

and you have a threshold t that you chose for some reason

could be the optimal or not

and you transform this course we

any monotonic transformation

whatever that in these example is just an affine transformation

you transform it

and you can also transform the threshold t

with the same exact

function

that there's for that forty will correspond to exactly the same cost

as the threshold t in the original domain

so basically

by doing a monotonic transformation to your scores you cannot change it's discrimination

the minimum cost

then you will be able to find in both cases will be the same

so

the cost of a talking about measures the performance artist single operating point

it evaluates the quality of the car decisions for certain

they

now

and more comprehensive measure

is the cross entropy which is given by this expression and you probably all now

the cross-entropy empirical cross-entropy in the average

all the logarithm of the posterior that the system gives

to the correct class for its

so you want these posterior to be as high as portable one

if possible

no you

and algorithm of zero and if that happens for every sample then you know

cross entropy zero which is what you want

now there's a right weighted version of these cross entropy

which is

basically the same

by

you'll split your samples into two terms

the ones poll

class zero once forecast one

and you we wait

these averages

by and prior

that is these effective prior that i talked about before

so basically you make yourself independent of the priors and you're seen in the test

data

you can evaluate for any

right you work

these posteriors are computed from the and then hours

and the priors

using bayes rule

at least note that these are the priors that you're applied any here

the ones that you need to used to compute the llr

okay and the famous e llr that we used in

nist evaluations any many papers

is defined as these weighted cross entropy when the priors are point five

and it's normalized by the logarithm to one and explained in the next like

what

so the weighted cross entropy can be decomposed also

like the cost

in discrimination and calibration terms

basically you compute the actual weighted cross entropy

and you subtracted

and they

minimum

weighted cross entropy

now this meeting one is not a trivial to obtain ask for the cost you

can't just choose the threshold because here where evaluating this course itself is not just

the decisions

so we need to actually what the scores to get

the best possible way to cross entropy

we don't change in the discrimination

of the scores

and that means

using an one attorney transformation

and there's an algorithm goal will adjacent by annotators

well

which

that's exactly that so in

without changing the rank of the scores the order of the scores

in dallas the best it can to minimize the weighted cross

and so that's what we used to compute

yes delta

which

measures how these kind of reading your system it's

in terms of we present

and this way to present the peace mounted the same last

the cost

by and a system that in this case is the system that out what's

instead of

the posteriors we don't was directly the prior so with the system that doesn't know

anything about its input

but

still nasty

best buy

i would in the priors

and

that means that the worst

in c n r

is one point zero because we were normalized to didn't

right nobles to which is exactly these things when you evaluated i point five

so this means that the

minimum c llr

will never be

where someone

i mean the actual c llr is worse than one then you know for sure

that you're gonna have a difference here

because this is never

larger than one in this is larger than one and then it means you have

a calibration problem

okay

finally in terms of evaluation i wanted to mention these

curves of the applied probability of error curves ache

and the llr shows a single summary number

but you might want to actually seen

the performance across

a range of operating points and that's what this curves two

they basically show the cost

of

as a function of the beat are the effect of peter

which

also defines that data

so

what we see here these

the

these cost

for prior decisions

and the prior decisions are what i mentioned before

basically just a dummy system that always outputs

the priors instead of posteriors

and the red is our system whatever that he's

kind of varying or not

and then dashed curve is the very best you can do if you work to

work your scores using the palm algorithm

so basically the difference for each data the difference between the dashed and the right

is there is calibration and that

operating point

and the nice property of all these curves is that the c in a lower

east proportional to the area under the covers

so the actual see an alarm is proportional to the area under the red curve

and the means the lr is proportional to the area under the dashed

and furthermore the equal error rate is the maximum

of these

a red curve

and their variance of these curves

which accompanies this papers

change in the way the axis and define

okay

so let's see not saying we

already in our system has a kind of a simple

should we worry about it shall we trying to fix it

there's some scenarios where you

there's

no problem if you have a nice calibrated system there is no need to fix

it for example

e

you know what the cost function is ahead of time

and there's development data available

then all you need to do is run on the system for the development data

and find the and

you can best

commercial

for

done them data for that system and that can cost function

and you're that

and

you also the need to worry about calibration if

it you wanna care about ranking

the samples so you want to do not and

likely targets

and nothing

on the other hand it may be very necessary to the calibration in many other

sin

one of them is for example if you don't know ahead of time what the

system will be used for exactly what is the application

i don't means

you don't know the cost function and if you don't know the cost function

you cannot optimize the partial

i had of time

so if you want to give the user of the system and all

then defines these effective bit are

then the system has to be calibrated for the baseline

bayes optimal threshold to be in

really optimal

to work well

a another case where you need to look at iteration is if you want to

get a probabilistic value

from your system

some men sure all the uncertainty that a system has

when you make six

this issue

and you can use that uncertainty for example

to reject samples when the system is uncertain

so

if you're and in our is too close to the threshold then you work planning

to use to make our decisions

then perhaps

you wanna system not to make a decision total the user i don't know

union under some

and another case is when this you actually don't want to make her decisions when

you want to report the value

then his interpretable

not for example in the forensic voice comparison people

okay so

that's a we do want to fix

a calibration we are in one of those scenarios where it matters

one very common approach to do this is to use linear logistic regression

so this assumes that b and an hour

the kind of weighted score

is an affine transformation all whatever your system

and the parameters of these small are the w and b

and uses the weighted cross entropy ask the loss function

now

for to compute the weighted presence of we need posteriors not and then hours so

we need to compare those in a nursing to posteriors and we use this expression

that actual before

which is

the llr is the nobles of the posterior minus the no guards of the right

and it would

basically where these expression we get there not just the functional which is the inverse

of the legit

and

and finally after doing

trivial computations we can these expression which is that bystander

mean and logistic expression

we need to them further like these posterior into the expression of the weighted cross

entropy to get lost

that we can then optimize thus we wish

and finally once we optimize these on

some the data

we can the w and b

that are optimal for that

not rate

so

this is an affine transformation so we doesn't change the shapes

of the distributions at all

basically

these looks like he did nothing

but what indeed is

more

shrink shift and shrink

the axes so that the resulting

scores

are kind of right

and in terms of t and then are you can see that their raw scores

which are these ones

how do very high c and an hour actually higher than one so the where

words and one

and after you calibrate them

which all your the was really scale and shapes

a new data much better see in the lower

these minimum here is that well maybe

the

the

very best you can do

so we define transformation we are actually doing almost as

good as the very best

which means that the affine assumption was actually in this case a quite

this is a real case this is box and of data process we the

be lda system

and then many other approaches to do calibration i'm not gonna cover them because it

would take another

another whole keynote

and

there are nonlinear approaches

which

i in some

cases do better than linear

is a good at some somebody is not perfect

then their originality and basin approaches that actually do quite well when you have very

little data

to train the calibration model

and then they're approaches and goal the way

but

to know data not labeled data

so there's label but

there's

they have and you don't know than they

and those works surprisingly well

so

if we have a kind of really score there

we know we can train the most looks not log-likelihood ratios which means

then we can use them to make optimal decisions

and we can also convert them to posteriors if we wanted to and if we

had the bright

and it and very nice property of that in our is that

if you work to compute

in a collection racial

all your

already calibrated score then you would get

the same thing

so you can treat

this score the in an hour after feature

and you

we compute these racial you would get the same by

i don't this they don't seem to some nice properties like for example

in a calibrated

score

the two distributions have to cross exactly at zero

because when the nn are is zero

these racial is one

which means then these two

have to be the same

and these two are exactly what we're seeing here the densities

the probability density function of the score for each of the two guys

they have to corsets you

and further if we assume that one of these two distributions is gaussian

then the other distributions forced to be gaussian

with the same

standard deviation and with symmetric meetings

and these as i said it's a real example and it's actually quite

close to that assumption

in this box and up to

okay so to recover this problem before we don't

what i've been saying is that occurs equal error rate mindcf

my sure only discrimination performance

basically this means that the nor the usual threshold selections of the nor the usual

how to get

to the actual decisions

from the score

on the other hand the weighted cross entropy on the actual dcf and the ape

curves and measure total form

and

that includes the initial how to

make the decisions

and we can further use these metrics

to compute the

calibration loss

so to see whether the system is well calibrated or not

and if you find the calibration is actually not good then fixing this calibration issues

is

usually see in ideal conditions so you can train an invertible transformation

used in

usually a small representative that said

which is enough because

in many of the approach is the number of parameters are is very small so

you don't need a lot update

the key here though

is then you need a representative that's it

and that's going on

what i'm gonna discussing the nastiest like

so

basically what we of serving right these repeatedly is that calibration of our speaker verification

systems

it is

extremely fragile

it is now for our current system and it has always be

okay since i've been working

on speaker verification for

almost twenty years not

anything like language noise distortions duration they not affect

the calibration parameters

and that means that one to train one condition

it's very unlikely to generalize to another condition

on the other hand the discrimination performance is usually still reasonable

on unseen conditions

so if you train a system on telephone data and you try to use it

on microphone data

is that gonna may not be the best you can do

but he still will be reasonable

on the other hand if you train your calibration model on telephone data and trying

to use it a microphone in many

perform horribly

and this is one example

so

i'm training the calibration set on the conversion

well on two different sets

speakers in the while and sre sixty

that's

and applying those models

but on box in the two

they just

the

scores are identical the raw scores and all and doing is changing the w and

the be based on the calibration set

what we see here

is that the model that was trained with speakers in the while

is extremely good

it's basically almost

perfect

while the model that was trained on a set of sixteen is

quite by

is better than the raw scores but he still quite well

compared to the best you can do

and this is not surprising because

block selects actually quite close to speakers in the white in terms of conditions

by this i sixteen is not

now

you may think maybe sorry sixteen is but just about set for doing calibration

but that's not the case because if you evaluate and sre sixteen

evaluation data

and

then the opposite happens

so the

calibration model that is good in that case is the one that was trained on

the set of sixteen so you these

scores

newman much lower

still an arm than the ones that were

calibrated we'd speakers in the way

in this case again you're almost which in the mean

so basically this tells us that the conditions on which the calibration model is trained

are at determining off

where they're gonna be

good

you have do you have to match the conditions on your evaluation

now

this goes even deeper

if you

zoom into a data set you can actually finest calibration issues within the dataset itself

so

i'm showing

results on sre sixteen evaluation set

when i training calibration parameters exactly one of the same impulse it so this is

a cheating calibration experiment

here

i'm showing the

see an alarm which is the solid bar and the means in an hour which

are here are the same by construction and here is that

relative difference between those two

so where the full set i have not lost

by construction as it said

on the other hand if i start to subset

peaceful set

a randomly or by gender

or my condition

i start to see

one more calibration loss

so than random subset is

fine

it is well calibrated females and males are reasonably well calibrated

but for this specific conditions

there are defined by the language the gender

where there are they to waveforms in the trial come from the same telephone number

or not

then we start to see calibration loss

our to almost twenty percent in this case

so

the distributions so for the

target i don't

female same telephone number set

we see that the distributions are shifted to the fact

they should be aligned with zero remember that the this to the sri distributions if

they were kind of reading they should cross at zero

but they don't

so that means they shifted to the right and that is reasonable because seems they

are the same telephone number for both

sides of the trial

then it means that

they look very much the same

more than if the channels one different

so

everything every trial looks more target

then they should

or than they do in the overall distribution

on the opposite happens on the different telephone number

scores

the shift to the left

and the final comments here is that these mis calibration with dataset

it's also cost in a discrimination problem

because if you pool these

trials as they are is kind of reading

you will get poor discrimination then if you work to first calibrated

and then pooled together

so

there's an interplay here between calibration and discrimination

because

the nist calibration is happening

for different sub conditions within the set

okay so they're been several approaches in the literature

over the last decades at least

that's right to

solve this problem or

condition dependent is calibration

where the

assumption of having a global

calibration model

that has a single w and a single be

for all trials it's actually not as good as such

so most of these approaches assume that there's an external class

or vector representation

the ldc there are given by the metadata

or estimated

that represents the condition of the samples

the enrollment and the samples

and these vectors

are fed into the calibration stage and they are used to condition the parameters of

these calibration stage

here are some approaches if you are interesting to take a look

over all these approaches something quite successful at

making the

final system better actually more discriminative

because they align the distributions of the different sub conditions before putting them together

and that their family of approaches these

where they put the condition awareness

in the back end itself rather than in the calibration stage

so

there's again a condition extractor of some kind

that affects the parameters of them okay

the thing is that this approach doesn't necessarily fix calibration

it improves discrimination in general

but you may still need to the calibration it is but can is deal for

example it be lda look and this i think these cases

what comes out of here is still use kind of

so you still need a

perhaps normal

calibration model and the or

okay and recently we propose an approach that jointly trains the backend

and a condition beep and then

calibrate or

where here we assume that the condition is extracted automatically as a function of the

and mailings themselves

and the whole thing

is trained jointly to optimize weighted cross entropy

so

this model actually gives

excellent calibration performance across so wide range of conditions

you can actually find the paper

and

in the ldc proceedings if you're interest

and there's a very related paper a also in a dc one middle

by daniel garcia romano

which i suggest you taken it to if you're interested in these topics

okay so

to finish up

i didn't talking about two

wide application scenarios for speaker verification technology

one of them

is where you assume that there's development data available for the evaluation conditions

in that case

as i said you can either calibrate the system on my on that data which

is matched

or just

find the best commercial

by really calibration in that

scenario is not a mediation

in fact most speaker verification papers

historically

operate under this scenario

it's also the scenario of the nist evaluations where we usually get development data which

is maybe not perfectly matching but

pretty well matched to what we will see you in the evaluation

not only see this ldc five i found thirty three speaker recognition papers

of which twenty eight fold

in this category

so

the mostly report just equal error rate and dcf some report actual values

some don't

and i think it's fine to just report mean dcf in those cases because you

basically assuming that the

it calibration initial is

easy to sell

so that

if you work to have

development data

a you could train a kind of visual all and you won't reach very close

to the minimum

this year

the actual performance gonna get very close to the

now the still the can be at that

you may still have used calibration problems within sub conditions anything i don't report

actual dcf on this year on some conditions

and that's

she

behind the overall performance

the other big scenario is

and the

the one where we don't have development data

for the above conditions

in that case we cannot calibrate or just a special

on matched conditions we can only whole

that our system will

work well out of the box

from the

all these proceedings i only five

papers that operate on their this scenario where the

actually test

a system that was trained on some condition all

at this data that is on a different conditions

and they do not assume that they have

development data for that

if recognition

so basically we as a community

are very heavily focused on the first scenario have always been is

from historically

and i but this man be why our current speaker verification technology

cannot be used out of the box

we are just

used to

always

asking for development data

in order to tune at least the calibration stage of our system

we know the calibration stage has to be tuned otherwise the system one work

in maybe where someone

so my question is and maybe we can discuss the question and answer session

wouldn't be worth it for as a community to pay more attention to these

scenario

no development data available

i believe that the new and two and approaches have the

potential to be quite good

i generalising

and this is basically based on the

paper that i mentioned that actually

is not really into and

but

almost

and it works

quite well

surprisingly well in terms of calibration across conditions on unseen conditions

so i think it's doable

maybe if we would and therefore as a community then maybe we reduce or even

in it

if we're very optimistic

the performance difference between the two center so maybe we can end up with systems

then

are not so independent of having development data

and perhaps even having development data one how much i don't know or more

the out of the book system

so what would you and tail to develop for these known that scenario

possible we

we have to assume that we will need heterogeneous data for training of course because

if you train a system on telephone data is

quite unlikely that it will generalize to

maybe other condition

the second thing is one has to have doubts

some sets

at least

during development that are not d menus for

hyperparameter two

because otherwise they would not be completely and see

so these sets out to be really

held out until the very and until you just evaluate the system out of the

box as in this scenario that we are imagining

and of course in into report actual matrix and not just meaning because in this

case you cannot assume that you're gonna be able to do kind of racial well

you need to test whether the model

i think i cd as it stands

it's actually giving you

good calibration with the session

and finally it's probably a good idea to also report matrix

on some conditions in the set

because

they the mis calibration issues within the sub conditions maybe he in

within the true distribution of the whole set they compensate each other sometimes

and reporting

metrics sub conditions yes

both actual and minimum something you can actually tell

if there's a calibration

okay

thank you very much for listening and i'm looking forward to your questions in the

next session