0:00:00i know things for attending this talk
0:00:03i am just enough that i'm a researcher the computer science institute
0:00:07which is a unit to university one aside as some corny set in argentina
0:00:13to they'll be talking about the initial of calibration in speaker verification
0:00:17and hopefully by the end of the talk and i'm gonna can be assumed that
0:00:20these things you need an important issue if you were not already convinced
0:00:26so the top will be organised this way first and gonna define calibration
0:00:31and given intuition
0:00:36talk about why we should care about it
0:00:39which is related also to how to make sure it
0:00:43and if we find out that bit calibration is bad in a certain system then
0:00:47how to fix it
0:00:49and then finally i'll talk about issues of robustness of calibration for speaker verification
0:00:55the task the main task
0:00:58on which i will be that samples on in speaker verification
0:01:02and assume that the audience
0:01:03in you know the c
0:01:06well this task but just in case
0:01:10it's a binary classification task
0:01:12where the samples
0:01:13are given by a
0:01:16two waveforms or two sets of waveforms
0:01:18but we need to compare to decide whether
0:01:21they come from the same speaker or from different speakers
0:01:26so the task is binary classification so much of what i'm gonna say
0:01:31applies to any binary classification task and we just
0:01:34speaker verification
0:01:36okay so what is calibration
0:01:39that's a we want to build a system that predicts the probability that it will
0:01:43rain within the next hour
0:01:45based only on a picture of the sky
0:01:47so this is these are wary
0:01:49if we see this picture then we would expect the system to work we don't
0:01:53know probability say point one
0:01:55while it was in this picture then we would expect it well would have much
0:01:58higher probability of rain
0:02:00it's a closer to one when
0:02:03are we will say that the system is kind of really
0:02:06the values that are able by the system coincide
0:02:10we what we seen
0:02:11in the data
0:02:16i well calibrated score
0:02:18should reflect the uncertainty of the system
0:02:21for example to be concrete
0:02:24for all the samples
0:02:25but get a score
0:02:26or point eight
0:02:27come the system
0:02:29then we would expect eighty percent of them to be labeled correctly
0:02:32that's one data point eight meetings
0:02:35in that happens
0:02:37then we will say that the system is what kind of
0:02:40and then we could be an example of diagram that is used in many tasks
0:02:44not match a speaker verification on not at all but it's
0:02:48i think it's very intuitive four
0:02:50understanding calibration
0:02:53it's called the reliability of diagram
0:02:55i'm basically but when it shows is the posteriors
0:02:59from a system that was random certain data
0:03:02the posteriors that the system okay
0:03:04for the class
0:03:05then we predict
0:03:07so for example for
0:03:08this being
0:03:11we have all the samples for which the system gave a posterior between point eight
0:03:17and what the
0:03:18diagram shows is the accuracy
0:03:21on those some
0:03:23so in there
0:03:24system was calibrated then we would expect these two we
0:03:30what the system predicted
0:03:31what coincide with the accuracy than we seen also
0:03:35in this specific case what we actually see that the system was correct more times
0:03:41then you thought it would be
0:03:43which is interesting in to a system that underestimates it's coupled
0:03:49now i to this diagram
0:03:52from a paper from twenty seventeen
0:03:56which actually studies the initial calibration on
0:03:59and different architectures
0:04:01so it compares on a task
0:04:03that is quality far one hundred which is the image classification how to different classes
0:04:08and it compares
0:04:09the this is the plot that i already showed a
0:04:12c n from nineteen ninety eight
0:04:15we address in it
0:04:17from twenty sixteen
0:04:19we and they show that actually the new network
0:04:23much worse calibrated
0:04:25then the old network
0:04:26so for this saying being the racial before
0:04:29then you network actually has an accuracy much lower than we got it should how
0:04:36which is point five more
0:04:38so in this is an over confident
0:04:41the nn
0:04:42is things
0:04:43it will do much better than it actually thus
0:04:46one the other hand being error
0:04:48from the new network is no
0:04:50so if you put this network to make decisions that the sessions will be better
0:04:54than the old ones
0:04:55but the score studied outputs
0:04:58cannot be interpreted as posterior settle
0:05:00it cannot be interpreted as
0:05:02the certainty that sit that the system has when it makes a decision
0:05:09this is actually a phenomenon that we see a node in speaker recognition basically you
0:05:15have a badly calibrated bottle tiny still
0:05:18when discriminately
0:05:20the problem is that such a model
0:05:22might be useless in practice depending on the scenario in which we plan to use
0:05:28as i already said
0:05:29this course
0:05:30from an is gonna weighting system cannot be interpreted as the certainty
0:05:35that the system has units decisions
0:05:40the scores cannot be made
0:05:42i cannot been used to make optimal position
0:05:46having the data to
0:05:48how does make a decision so that's what i'm gonna talk about in the next
0:05:51two sets
0:05:54so how do we make optimal decision in general for binary classification
0:05:59when usually define a cost function
0:06:02and this is a very
0:06:03common cost function which has very nice properties
0:06:07it's a combination of two terms
0:06:09one for each class
0:06:13maybe part here is the probability of making an error for that class of these
0:06:18the probability of
0:06:19to see
0:06:22when the true class
0:06:24was one
0:06:26we multiply these probability of error by the prior
0:06:29for that class one
0:06:31and then we further multiplied by cost which is what we think
0:06:36it is gonna cost us if we make these are
0:06:40this is very specific to the application that we're gonna use the system
0:06:44and for the other classes the same symmetric
0:06:49this is an expected cost
0:06:51the way to minimize is expected cost is to choose the following
0:06:57the session
0:06:59for a certain sample x
0:07:01the text class should be one
0:07:03in this factor
0:07:05it is larger than this factor and zero otherwise
0:07:09and this factor is composed of the cost
0:07:12the prior
0:07:14and the likelihood
0:07:16for the class one
0:07:19and this is the same forecasting
0:07:23we see here than one we need to make optimal decisions is these likelihood
0:07:28be of x
0:07:29given c
0:07:34one we have
0:07:36is the likelihood then we learned
0:07:38without formal
0:07:39is the likelihood when they're
0:07:42on the training data
0:07:43that's why amusing here the we go to indicate that these in the cost
0:07:48these probabilities the one we expect to see testing
0:07:51one we actually see that's the
0:07:55while we don't have that
0:07:57what we have is one we saw in train
0:08:00so let's say that we train a generative model then our generative model is gonna
0:08:05be was directly these likelihood
0:08:07but it will be the likelihood we learned in training
0:08:10and that's fine we usually just assume
0:08:13in order to do anything at all the machine learning
0:08:15we assume that these will generalize to testing
0:08:20now we may not have the likelihood if we train the discriminative system
0:08:25in that case we may have the posterior
0:08:28discriminative systems
0:08:29training for example with cross entropy any two i'll would posteriors
0:08:34in that case when we need to do is compare those posteriors by two likelihoods
0:08:38and for that we use bayes rule
0:08:41by basically we want to like the by
0:08:43this be of x and divided by the by
0:08:46i don't hear that again this is the prior in training
0:08:50is not the prior
0:08:52the p we call that i put hearing the cost which is the one we
0:08:55expect to see testing
0:08:58and that's the whole
0:08:59point why we use likelihoods and not posteriors
0:09:03to make these optimal position
0:09:06because it gives us the flexibility
0:09:08two separate
0:09:09the prior from training from the prior in testing
0:09:14okay so
0:09:16going back to the
0:09:17to the optimal decisions
0:09:19we have this expression
0:09:21we can simplify with this expression by defining the log-likelihood ratio
0:09:26which i'm sure everybody now see
0:09:28you're working speaker verification
0:09:30it's basically the spatial between
0:09:33the likelihood for class one and the likelihood for cassie rule
0:09:37and we take monopoly because it's
0:09:40are we can do a similar thing with that costs
0:09:43the factors that multiplied these likelihoods here
0:09:47so we define these data
0:09:49and their with those definitions we can
0:09:53simplify the optimal decisions to look like these basically you decide class one
0:09:57if the llr is larger than
0:10:00otherwise garcia
0:10:02and the and an untimely computed from the system posteriors
0:10:06with this expression digits
0:10:07based rules
0:10:08after taking the logarithm
0:10:11you of a scroll so wait
0:10:13because it was
0:10:15in both
0:10:18factors it what in most likely
0:10:21and this is basically the no goals of the posterior minus the notebooks of the
0:10:26which can be written is way using the energy fine function
0:10:34okay so in speaker verification the feature x
0:10:39it's actually a pair of features or even
0:10:41a pair of sets of features
0:10:43a one for enrollment and one for test
0:10:46then class one is the class for target or same speaker
0:10:52and class zero is the task for impostor or different speaker trial
0:10:57and we define the cost function or we use an equally dcf in speaker verification
0:11:03using these
0:11:06for the costs and priors
0:11:09we call the errors be nice be false alarm
0:11:12and beanies and means
0:11:14would be
0:11:15a missing a target trial soul namely non-target trial as an impostor
0:11:19and a false alarm would be
0:11:21namely and impostor asset are
0:11:26and that the racial
0:11:27looks like this using these names
0:11:30and if you know
0:11:31only care about
0:11:32it's actually this thing to make optimal decisions you don't care about the whole
0:11:40values of costs and priors altogether about these things they
0:11:44so you could impact simplify
0:11:47the cost functions the families of can cost functions to consider by just using a
0:11:52single binary and the fact that beat are that is equivalent to having this
0:11:58triplet for money is that are really just three because
0:12:02p is a function
0:12:06so we will be using that a the rest of the talk because it's much
0:12:11and it helps a lot in the analysis
0:12:14basically we simplify all possible cost functions
0:12:18all combinations of
0:12:20costs and priors to a single
0:12:22affect the guitar
0:12:27so let's see some examples of applications that use different costs
0:12:33so the default
0:12:36the simplest cost function would be to have equal priors any vocals
0:12:40and that would give you the threshold zero
0:12:43that would be the optimum bayes threshold for these cost function
0:12:49now if you have an application of any sport for examples
0:12:53speaker authentication where
0:12:55your goal
0:12:56he's two
0:12:59verifying whether somebody
0:13:01is what they say they are
0:13:03to their voice
0:13:05for example two and their
0:13:09then new would expect that most of your cases i've and of e
0:13:13target trials
0:13:14because you know how many posters trying to get into your system
0:13:18on the other hand the cost of making a mistake
0:13:22is very high
0:13:23you feel false alarm
0:13:25so you don't want any of the
0:13:27was able you impostors getting into the system
0:13:31that means you need to
0:13:32said a very high cost alarm
0:13:34a cost of false alarm
0:13:36compared to the cost of
0:13:38and that corresponds with initial
0:13:40two point three
0:13:41so basically what you're doing a small with the threshold to the right so that
0:13:44the this area here on the solid curve
0:13:49which is the distribution of scores
0:13:52for the impostor samples
0:13:55so everything about that racial two point three will be a false-alarm
0:14:00by moving the initial to the right we are meaning lies in this area
0:14:05another application that actually is
0:14:07or seen in terms of course
0:14:10priors is the speaker search
0:14:11in that case you're looking for certain specific speaker weeding
0:14:16another instead of many other speakers
0:14:19so in that case the probability of finding your speaker is actually no
0:14:23that's a one-to-one one percent
0:14:26but the cost that you care about
0:14:29the errors and you want to avoid are the basis because
0:14:33you don't want you're looking for one specific speaker that is important to you for
0:14:37some reason so you know want to meet
0:14:40so in that case the problem of initial is
0:14:43a symmetric to the now minus two point three
0:14:46and in that case what you're trying to minimize is under the dash
0:14:51it to the left
0:14:53of the threshold
0:14:54which is the probability of miss
0:15:02so to recover before moving onto
0:15:07if we have been and are then
0:15:09i showed that we can trivially make optimisations for any possible cost function that you
0:15:14can imagine
0:15:16when the phone that i gave
0:15:19but of course these decisions will only be actually optimal if the system outputs are
0:15:25well calibrated
0:15:26otherwise they will not you
0:15:29so how do we figure out
0:15:31if we have a
0:15:32well calibrated system
0:15:35question is if you're gonna make your system make decisions using these thresholds that i
0:15:41showed before the data
0:15:43then that's when you should evaluate have your system make those decisions using those data
0:15:50see how well the
0:15:52and then the for the question is
0:15:54quote we have made better this ensures if we calibrated scores before making the decisions
0:16:01that will give us sarong
0:16:03how well calibrated is the system
0:16:05two meeting
0:16:09the when we usually evaluate performance on binary classification task
0:16:15by using the cost
0:16:17no wonder you over initial
0:16:19so we prefix that the racial
0:16:22using bayes
0:16:23a decision theory or not
0:16:26we just
0:16:26that is commercial and then compute the beanie some people sometime which of these yes
0:16:31and the two distributions
0:16:34and then compute the costs
0:16:37now we can also
0:16:40define matrix that depend on the whole distribution to two sisters
0:16:45so for example the equal error rate
0:16:48is defined
0:16:49by finding the commercial that makes these two areas the same
0:16:54so basically to computing you need the whole test this deviation
0:16:59and a similar thing is the minimum dcf
0:17:01so what you're doing that case is
0:17:04we official
0:17:07across the whole range of scores
0:17:10compute the cost
0:17:11for almost possible threshold
0:17:13and then
0:17:15choose the threshold okay the mean cost
0:17:20now that minimum cost is actually bounded
0:17:23and it bummed in by
0:17:25basically dummy decisions
0:17:27this system that makes to make decisions
0:17:31if you put
0:17:32for example you official all the way to write
0:17:35then you will only make
0:17:37and mistakes that are misses
0:17:40everything will be nice
0:17:42so you'll have been means of one before xenomorph zero
0:17:46in that case the cost then you will incur is this factor here
0:17:50when the other hand if you put the threshold a way to the left
0:17:54then you will only make false alarms and there will be the cost for that
0:17:59will be these factors here
0:18:02so basically the bound for the meeting these is
0:18:05the best of those
0:18:06two case
0:18:08they're both times systems but one will be better than the other
0:18:12are we usually use this mindcf to normalize
0:18:16the dcf so and nist evaluations for example
0:18:20core studies define is the normalized dcf
0:18:26and then finally another thing we can do is we the threshold
0:18:30we called the puny some people's allow for every possible value of potential
0:18:35and then gives a score curves like these
0:18:37and if we transform the axis appropriately then we get the
0:18:43standard that curves we use for speaker verification
0:19:02so the cost that i've been talking about can be decomposed
0:19:05into discrimination and calibration component
0:19:10so let's see how
0:19:12that's a we assume a cost or well priors an equal cost
0:19:18in that case
0:19:19the optimal threshold will be civil
0:19:21the bayes optimal threshold would be zero
0:19:25we compare the cost using that
0:19:28and we get these
0:19:30given that the priors and costs are the same then the cost will be given
0:19:34by the average of these two areas
0:19:36and shown here
0:19:38now when you can also compute the mean cost as i mentioned before
0:19:42basically sweet but initial
0:19:44actual the threshold that gives
0:19:46the minimum cost
0:19:47again is the average between these two areas which you see is much smaller than
0:19:52the average between these two areas in this case
0:19:55and the difference between
0:19:57those two cost
0:19:59can be seen
0:20:00as the additional cost that you encouraging because your system was makes me scully weight
0:20:06so this orange area here which is the difference between
0:20:09the sound
0:20:11well the areas here on the sum of the areas here
0:20:14is the cost due to these calibration and that's one way of measuring
0:20:20nice kind of ready to system
0:20:25so there's discrimination which is how well the scores
0:20:28separated classes
0:20:30and there's calibration which is whether the discourse can be interpreted probabilistically
0:20:34which implies that you can make optimum bayes decisions
0:20:37if they are kind of work
0:20:40and the key here is then discrimination is the part
0:20:43of the
0:20:45performance that cannot be changed
0:20:47if we transform the scores into we then invertible transformation
0:20:52so here's a simple example that a you have these distribution of scores
0:20:58and you have a threshold t that you chose for some reason
0:21:01could be the optimal or not
0:21:03and you transform this course we
0:21:06any monotonic transformation
0:21:09whatever that in these example is just an affine transformation
0:21:13you transform it
0:21:15and you can also transform the threshold t
0:21:18with the same exact
0:21:22that there's for that forty will correspond to exactly the same cost
0:21:27as the threshold t in the original domain
0:21:31so basically
0:21:33by doing a monotonic transformation to your scores you cannot change it's discrimination
0:21:39the minimum cost
0:21:41then you will be able to find in both cases will be the same
0:21:52the cost of a talking about measures the performance artist single operating point
0:21:57it evaluates the quality of the car decisions for certain
0:22:04and more comprehensive measure
0:22:06is the cross entropy which is given by this expression and you probably all now
0:22:11the cross-entropy empirical cross-entropy in the average
0:22:15all the logarithm of the posterior that the system gives
0:22:19to the correct class for its
0:22:22so you want these posterior to be as high as portable one
0:22:25if possible
0:22:27no you
0:22:28and algorithm of zero and if that happens for every sample then you know
0:22:33cross entropy zero which is what you want
0:22:37now there's a right weighted version of these cross entropy
0:22:40which is
0:22:41basically the same
0:22:43you'll split your samples into two terms
0:22:46the ones poll
0:22:49class zero once forecast one
0:22:51and you we wait
0:22:53these averages
0:22:55by and prior
0:22:56that is these effective prior that i talked about before
0:23:01so basically you make yourself independent of the priors and you're seen in the test
0:23:07you can evaluate for any
0:23:09right you work
0:23:13these posteriors are computed from the and then hours
0:23:16and the priors
0:23:18using bayes rule
0:23:19at least note that these are the priors that you're applied any here
0:23:23the ones that you need to used to compute the llr
0:23:27okay and the famous e llr that we used in
0:23:30nist evaluations any many papers
0:23:32is defined as these weighted cross entropy when the priors are point five
0:23:37and it's normalized by the logarithm to one and explained in the next like
0:23:44so the weighted cross entropy can be decomposed also
0:23:47like the cost
0:23:49in discrimination and calibration terms
0:23:52basically you compute the actual weighted cross entropy
0:23:56and you subtracted
0:23:58and they
0:23:59weighted cross entropy
0:24:01now this meeting one is not a trivial to obtain ask for the cost you
0:24:05can't just choose the threshold because here where evaluating this course itself is not just
0:24:11the decisions
0:24:12so we need to actually what the scores to get
0:24:16the best possible way to cross entropy
0:24:19we don't change in the discrimination
0:24:21of the scores
0:24:22and that means
0:24:23using an one attorney transformation
0:24:26and there's an algorithm goal will adjacent by annotators
0:24:31that's exactly that so in
0:24:34without changing the rank of the scores the order of the scores
0:24:38in dallas the best it can to minimize the weighted cross
0:24:42and so that's what we used to compute
0:24:45yes delta
0:24:47measures how these kind of reading your system it's
0:24:50in terms of we present
0:24:53and this way to present the peace mounted the same last
0:24:56the cost
0:24:57by and a system that in this case is the system that out what's
0:25:02instead of
0:25:02the posteriors we don't was directly the prior so with the system that doesn't know
0:25:06anything about its input
0:25:09still nasty
0:25:10best buy
0:25:11i would in the priors
0:25:16that means that the worst
0:25:18in c n r
0:25:20is one point zero because we were normalized to didn't
0:25:23right nobles to which is exactly these things when you evaluated i point five
0:25:29so this means that the
0:25:31minimum c llr
0:25:33will never be
0:25:34where someone
0:25:36i mean the actual c llr is worse than one then you know for sure
0:25:39that you're gonna have a difference here
0:25:41because this is never
0:25:43larger than one in this is larger than one and then it means you have
0:25:46a calibration problem
0:25:51finally in terms of evaluation i wanted to mention these
0:25:55curves of the applied probability of error curves ache
0:25:59and the llr shows a single summary number
0:26:02but you might want to actually seen
0:26:04the performance across
0:26:06a range of operating points and that's what this curves two
0:26:10they basically show the cost
0:26:14as a function of the beat are the effect of peter
0:26:19also defines that data
0:26:22what we see here these
0:26:24these cost
0:26:25for prior decisions
0:26:27and the prior decisions are what i mentioned before
0:26:30basically just a dummy system that always outputs
0:26:34the priors instead of posteriors
0:26:38and the red is our system whatever that he's
0:26:43kind of varying or not
0:26:46and then dashed curve is the very best you can do if you work to
0:26:50work your scores using the palm algorithm
0:26:54so basically the difference for each data the difference between the dashed and the right
0:27:00is there is calibration and that
0:27:02operating point
0:27:05and the nice property of all these curves is that the c in a lower
0:27:08east proportional to the area under the covers
0:27:12so the actual see an alarm is proportional to the area under the red curve
0:27:17and the means the lr is proportional to the area under the dashed
0:27:23and furthermore the equal error rate is the maximum
0:27:26of these
0:27:28a red curve
0:27:30and their variance of these curves
0:27:32which accompanies this papers
0:27:35change in the way the axis and define
0:27:40so let's see not saying we
0:27:43already in our system has a kind of a simple
0:27:45should we worry about it shall we trying to fix it
0:27:50there's some scenarios where you
0:27:52no problem if you have a nice calibrated system there is no need to fix
0:27:56it for example
0:27:59you know what the cost function is ahead of time
0:28:02and there's development data available
0:28:04then all you need to do is run on the system for the development data
0:28:08and find the and
0:28:10you can best
0:28:14done them data for that system and that can cost function
0:28:17and you're that
0:28:20you also the need to worry about calibration if
0:28:23it you wanna care about ranking
0:28:25the samples so you want to do not and
0:28:29likely targets
0:28:30and nothing
0:28:33on the other hand it may be very necessary to the calibration in many other
0:28:39one of them is for example if you don't know ahead of time what the
0:28:43system will be used for exactly what is the application
0:28:45i don't means
0:28:47you don't know the cost function and if you don't know the cost function
0:28:50you cannot optimize the partial
0:28:52i had of time
0:28:53so if you want to give the user of the system and all
0:28:57then defines these effective bit are
0:29:01then the system has to be calibrated for the baseline
0:29:05bayes optimal threshold to be in
0:29:08really optimal
0:29:09to work well
0:29:11a another case where you need to look at iteration is if you want to
0:29:15get a probabilistic value
0:29:18from your system
0:29:19some men sure all the uncertainty that a system has
0:29:23when you make six
0:29:24this issue
0:29:25and you can use that uncertainty for example
0:29:28to reject samples when the system is uncertain
0:29:33if you're and in our is too close to the threshold then you work planning
0:29:36to use to make our decisions
0:29:38then perhaps
0:29:39you wanna system not to make a decision total the user i don't know
0:29:44union under some
0:29:47and another case is when this you actually don't want to make her decisions when
0:29:51you want to report the value
0:29:53then his interpretable
0:29:55not for example in the forensic voice comparison people
0:30:02okay so
0:30:03that's a we do want to fix
0:30:05a calibration we are in one of those scenarios where it matters
0:30:10one very common approach to do this is to use linear logistic regression
0:30:15so this assumes that b and an hour
0:30:18the kind of weighted score
0:30:20is an affine transformation all whatever your system
0:30:25and the parameters of these small are the w and b
0:30:30and uses the weighted cross entropy ask the loss function
0:30:37for to compute the weighted presence of we need posteriors not and then hours so
0:30:41we need to compare those in a nursing to posteriors and we use this expression
0:30:44that actual before
0:30:46which is
0:30:47the llr is the nobles of the posterior minus the no guards of the right
0:30:53and it would
0:30:55basically where these expression we get there not just the functional which is the inverse
0:31:00of the legit
0:31:04and finally after doing
0:31:09trivial computations we can these expression which is that bystander
0:31:13mean and logistic expression
0:31:15we need to them further like these posterior into the expression of the weighted cross
0:31:20entropy to get lost
0:31:21that we can then optimize thus we wish
0:31:24and finally once we optimize these on
0:31:28some the data
0:31:30we can the w and b
0:31:32that are optimal for that
0:31:34not rate
0:31:39this is an affine transformation so we doesn't change the shapes
0:31:42of the distributions at all
0:31:45these looks like he did nothing
0:31:47but what indeed is
0:31:50shrink shift and shrink
0:31:53the axes so that the resulting
0:31:57are kind of right
0:32:01and in terms of t and then are you can see that their raw scores
0:32:05which are these ones
0:32:07how do very high c and an hour actually higher than one so the where
0:32:10words and one
0:32:12and after you calibrate them
0:32:14which all your the was really scale and shapes
0:32:17a new data much better see in the lower
0:32:20these minimum here is that well maybe
0:32:25very best you can do
0:32:27so we define transformation we are actually doing almost as
0:32:32good as the very best
0:32:36which means that the affine assumption was actually in this case a quite
0:32:41this is a real case this is box and of data process we the
0:32:46be lda system
0:32:50and then many other approaches to do calibration i'm not gonna cover them because it
0:32:54would take another
0:32:55another whole keynote
0:32:58there are nonlinear approaches
0:33:02i in some
0:33:04cases do better than linear
0:33:07is a good at some somebody is not perfect
0:33:12then their originality and basin approaches that actually do quite well when you have very
0:33:16little data
0:33:18to train the calibration model
0:33:19and then they're approaches and goal the way
0:33:23to know data not labeled data
0:33:26so there's label but
0:33:28they have and you don't know than they
0:33:31and those works surprisingly well
0:33:36if we have a kind of really score there
0:33:39we know we can train the most looks not log-likelihood ratios which means
0:33:43then we can use them to make optimal decisions
0:33:46and we can also convert them to posteriors if we wanted to and if we
0:33:50had the bright
0:33:53and it and very nice property of that in our is that
0:33:57if you work to compute
0:33:58in a collection racial
0:34:00all your
0:34:01already calibrated score then you would get
0:34:04the same thing
0:34:05so you can treat
0:34:07this score the in an hour after feature
0:34:10and you
0:34:11we compute these racial you would get the same by
0:34:16i don't this they don't seem to some nice properties like for example
0:34:21in a calibrated
0:34:24the two distributions have to cross exactly at zero
0:34:28because when the nn are is zero
0:34:30these racial is one
0:34:32which means then these two
0:34:34have to be the same
0:34:36and these two are exactly what we're seeing here the densities
0:34:40the probability density function of the score for each of the two guys
0:34:44they have to corsets you
0:34:46and further if we assume that one of these two distributions is gaussian
0:34:51then the other distributions forced to be gaussian
0:34:54with the same
0:34:55standard deviation and with symmetric meetings
0:34:59and these as i said it's a real example and it's actually quite
0:35:03close to that assumption
0:35:05in this box and up to
0:35:09okay so to recover this problem before we don't
0:35:13what i've been saying is that occurs equal error rate mindcf
0:35:18my sure only discrimination performance
0:35:21basically this means that the nor the usual threshold selections of the nor the usual
0:35:26how to get
0:35:28to the actual decisions
0:35:29from the score
0:35:31on the other hand the weighted cross entropy on the actual dcf and the ape
0:35:36curves and measure total form
0:35:40that includes the initial how to
0:35:43make the decisions
0:35:45and we can further use these metrics
0:35:48to compute the
0:35:49calibration loss
0:35:51so to see whether the system is well calibrated or not
0:35:57and if you find the calibration is actually not good then fixing this calibration issues
0:36:03usually see in ideal conditions so you can train an invertible transformation
0:36:09used in
0:36:10usually a small representative that said
0:36:13which is enough because
0:36:15in many of the approach is the number of parameters are is very small so
0:36:19you don't need a lot update
0:36:23the key here though
0:36:25is then you need a representative that's it
0:36:27and that's going on
0:36:29what i'm gonna discussing the nastiest like
0:36:33basically what we of serving right these repeatedly is that calibration of our speaker verification
0:36:42it is
0:36:43extremely fragile
0:36:45it is now for our current system and it has always be
0:36:49okay since i've been working
0:36:51on speaker verification for
0:36:53almost twenty years not
0:36:57anything like language noise distortions duration they not affect
0:37:02the calibration parameters
0:37:05and that means that one to train one condition
0:37:08it's very unlikely to generalize to another condition
0:37:11on the other hand the discrimination performance is usually still reasonable
0:37:16on unseen conditions
0:37:17so if you train a system on telephone data and you try to use it
0:37:21on microphone data
0:37:22is that gonna may not be the best you can do
0:37:25but he still will be reasonable
0:37:28on the other hand if you train your calibration model on telephone data and trying
0:37:32to use it a microphone in many
0:37:34perform horribly
0:37:37and this is one example
0:37:40i'm training the calibration set on the conversion
0:37:45well on two different sets
0:37:47speakers in the while and sre sixty
0:37:51and applying those models
0:37:53but on box in the two
0:37:55they just
0:37:58scores are identical the raw scores and all and doing is changing the w and
0:38:02the be based on the calibration set
0:38:05what we see here
0:38:07is that the model that was trained with speakers in the while
0:38:12is extremely good
0:38:13it's basically almost
0:38:17while the model that was trained on a set of sixteen is
0:38:20quite by
0:38:21is better than the raw scores but he still quite well
0:38:24compared to the best you can do
0:38:27and this is not surprising because
0:38:29block selects actually quite close to speakers in the white in terms of conditions
0:38:33by this i sixteen is not
0:38:36you may think maybe sorry sixteen is but just about set for doing calibration
0:38:41but that's not the case because if you evaluate and sre sixteen
0:38:45evaluation data
0:38:48then the opposite happens
0:38:50so the
0:38:51calibration model that is good in that case is the one that was trained on
0:38:54the set of sixteen so you these
0:38:58newman much lower
0:39:00still an arm than the ones that were
0:39:02calibrated we'd speakers in the way
0:39:06in this case again you're almost which in the mean
0:39:10so basically this tells us that the conditions on which the calibration model is trained
0:39:14are at determining off
0:39:17where they're gonna be
0:39:20you have do you have to match the conditions on your evaluation
0:39:27this goes even deeper
0:39:29if you
0:39:30zoom into a data set you can actually finest calibration issues within the dataset itself
0:39:38i'm showing
0:39:39results on sre sixteen evaluation set
0:39:42when i training calibration parameters exactly one of the same impulse it so this is
0:39:47a cheating calibration experiment
0:39:51i'm showing the
0:39:53see an alarm which is the solid bar and the means in an hour which
0:39:57are here are the same by construction and here is that
0:40:01relative difference between those two
0:40:04so where the full set i have not lost
0:40:06by construction as it said
0:40:08on the other hand if i start to subset
0:40:11peaceful set
0:40:13a randomly or by gender
0:40:16or my condition
0:40:17i start to see
0:40:19one more calibration loss
0:40:21so than random subset is
0:40:24it is well calibrated females and males are reasonably well calibrated
0:40:28but for this specific conditions
0:40:31there are defined by the language the gender
0:40:34where there are they to waveforms in the trial come from the same telephone number
0:40:39or not
0:40:41then we start to see calibration loss
0:40:43our to almost twenty percent in this case
0:40:49the distributions so for the
0:40:52target i don't
0:40:53female same telephone number set
0:40:57we see that the distributions are shifted to the fact
0:41:00they should be aligned with zero remember that the this to the sri distributions if
0:41:04they were kind of reading they should cross at zero
0:41:08but they don't
0:41:09so that means they shifted to the right and that is reasonable because seems they
0:41:13are the same telephone number for both
0:41:16sides of the trial
0:41:18then it means that
0:41:19they look very much the same
0:41:23more than if the channels one different
0:41:27everything every trial looks more target
0:41:30then they should
0:41:33or than they do in the overall distribution
0:41:36on the opposite happens on the different telephone number
0:41:40the shift to the left
0:41:42and the final comments here is that these mis calibration with dataset
0:41:49it's also cost in a discrimination problem
0:41:52because if you pool these
0:41:54trials as they are is kind of reading
0:41:56you will get poor discrimination then if you work to first calibrated
0:42:00and then pooled together
0:42:05there's an interplay here between calibration and discrimination
0:42:11the nist calibration is happening
0:42:14for different sub conditions within the set
0:42:21okay so they're been several approaches in the literature
0:42:25over the last decades at least
0:42:28that's right to
0:42:31solve this problem or
0:42:33condition dependent is calibration
0:42:38where the
0:42:39assumption of having a global
0:42:42calibration model
0:42:43that has a single w and a single be
0:42:46for all trials it's actually not as good as such
0:42:50so most of these approaches assume that there's an external class
0:42:54or vector representation
0:42:57the ldc there are given by the metadata
0:42:59or estimated
0:43:02that represents the condition of the samples
0:43:04the enrollment and the samples
0:43:07and these vectors
0:43:10are fed into the calibration stage and they are used to condition the parameters of
0:43:14these calibration stage
0:43:17here are some approaches if you are interesting to take a look
0:43:24over all these approaches something quite successful at
0:43:27making the
0:43:28final system better actually more discriminative
0:43:32because they align the distributions of the different sub conditions before putting them together
0:43:40and that their family of approaches these
0:43:43where they put the condition awareness
0:43:46in the back end itself rather than in the calibration stage
0:43:50there's again a condition extractor of some kind
0:43:54that affects the parameters of them okay
0:43:58the thing is that this approach doesn't necessarily fix calibration
0:44:02it improves discrimination in general
0:44:04but you may still need to the calibration it is but can is deal for
0:44:08example it be lda look and this i think these cases
0:44:12what comes out of here is still use kind of
0:44:15so you still need a
0:44:18perhaps normal
0:44:19calibration model and the or
0:44:23okay and recently we propose an approach that jointly trains the backend
0:44:30and a condition beep and then
0:44:32calibrate or
0:44:33where here we assume that the condition is extracted automatically as a function of the
0:44:38and mailings themselves
0:44:40and the whole thing
0:44:42is trained jointly to optimize weighted cross entropy
0:44:47this model actually gives
0:44:49excellent calibration performance across so wide range of conditions
0:44:53you can actually find the paper
0:44:56in the ldc proceedings if you're interest
0:44:59and there's a very related paper a also in a dc one middle
0:45:04by daniel garcia romano
0:45:06which i suggest you taken it to if you're interested in these topics
0:45:13okay so
0:45:15to finish up
0:45:17i didn't talking about two
0:45:20wide application scenarios for speaker verification technology
0:45:24one of them
0:45:25is where you assume that there's development data available for the evaluation conditions
0:45:32in that case
0:45:33as i said you can either calibrate the system on my on that data which
0:45:38is matched
0:45:39or just
0:45:40find the best commercial
0:45:43by really calibration in that
0:45:45scenario is not a mediation
0:45:49in fact most speaker verification papers
0:45:53operate under this scenario
0:45:55it's also the scenario of the nist evaluations where we usually get development data which
0:46:00is maybe not perfectly matching but
0:46:02pretty well matched to what we will see you in the evaluation
0:46:07not only see this ldc five i found thirty three speaker recognition papers
0:46:13of which twenty eight fold
0:46:15in this category
0:46:19the mostly report just equal error rate and dcf some report actual values
0:46:25some don't
0:46:28and i think it's fine to just report mean dcf in those cases because you
0:46:32basically assuming that the
0:46:35it calibration initial is
0:46:37easy to sell
0:46:39so that
0:46:39if you work to have
0:46:41development data
0:46:43a you could train a kind of visual all and you won't reach very close
0:46:47to the minimum
0:46:48this year
0:46:49the actual performance gonna get very close to the
0:46:54now the still the can be at that
0:46:57you may still have used calibration problems within sub conditions anything i don't report
0:47:02actual dcf on this year on some conditions
0:47:05and that's
0:47:07behind the overall performance
0:47:11the other big scenario is
0:47:13and the
0:47:15the one where we don't have development data
0:47:18for the above conditions
0:47:21in that case we cannot calibrate or just a special
0:47:25on matched conditions we can only whole
0:47:28that our system will
0:47:30work well out of the box
0:47:35from the
0:47:36all these proceedings i only five
0:47:39papers that operate on their this scenario where the
0:47:42actually test
0:47:43a system that was trained on some condition all
0:47:47at this data that is on a different conditions
0:47:50and they do not assume that they have
0:47:52development data for that
0:47:54if recognition
0:47:58so basically we as a community
0:48:00are very heavily focused on the first scenario have always been is
0:48:05from historically
0:48:09and i but this man be why our current speaker verification technology
0:48:14cannot be used out of the box
0:48:16we are just
0:48:18used to
0:48:21asking for development data
0:48:22in order to tune at least the calibration stage of our system
0:48:28we know the calibration stage has to be tuned otherwise the system one work
0:48:33in maybe where someone
0:48:36so my question is and maybe we can discuss the question and answer session
0:48:43wouldn't be worth it for as a community to pay more attention to these
0:48:48no development data available
0:48:53i believe that the new and two and approaches have the
0:48:57potential to be quite good
0:48:59i generalising
0:49:00and this is basically based on the
0:49:03paper that i mentioned that actually
0:49:05is not really into and
0:49:09and it works
0:49:11quite well
0:49:12surprisingly well in terms of calibration across conditions on unseen conditions
0:49:18so i think it's doable
0:49:22maybe if we would and therefore as a community then maybe we reduce or even
0:49:27in it
0:49:28if we're very optimistic
0:49:30the performance difference between the two center so maybe we can end up with systems
0:49:36are not so independent of having development data
0:49:40and perhaps even having development data one how much i don't know or more
0:49:47the out of the book system
0:49:51so what would you and tail to develop for these known that scenario
0:49:56possible we
0:49:57we have to assume that we will need heterogeneous data for training of course because
0:50:02if you train a system on telephone data is
0:50:05quite unlikely that it will generalize to
0:50:07maybe other condition
0:50:11the second thing is one has to have doubts
0:50:15some sets
0:50:16at least
0:50:18during development that are not d menus for
0:50:20hyperparameter two
0:50:22because otherwise they would not be completely and see
0:50:26so these sets out to be really
0:50:30held out until the very and until you just evaluate the system out of the
0:50:34box as in this scenario that we are imagining
0:50:38and of course in into report actual matrix and not just meaning because in this
0:50:42case you cannot assume that you're gonna be able to do kind of racial well
0:50:46you need to test whether the model
0:50:49i think i cd as it stands
0:50:51it's actually giving you
0:50:53good calibration with the session
0:50:56and finally it's probably a good idea to also report matrix
0:51:01on some conditions in the set
0:51:04they the mis calibration issues within the sub conditions maybe he in
0:51:09within the true distribution of the whole set they compensate each other sometimes
0:51:15and reporting
0:51:19metrics sub conditions yes
0:51:21both actual and minimum something you can actually tell
0:51:24if there's a calibration
0:51:28thank you very much for listening and i'm looking forward to your questions in the
0:51:32next session