0:00:15sorry this is the i want to talk i'll give a bit of the background
0:00:19briefly describe the system
0:00:20talk about the data with playing with fourteen distinct conditions of are trying to calibrated
0:00:26a little be existing calibration methods and then on the propose
0:00:31trial based calibration
0:00:34so we backchan a very good background already with the other talks but even very
0:00:37accurate sid systems maybe not well calibrated this means that i and you might have
0:00:42very low equal error rates for conditions evaluated independently
0:00:46but once you pull those together a single threshold to reach that operating point when
0:00:53also just this problem the right here the blue we distributions of target trials the
0:00:59red ones are impostor trials you can see be yellow threshold on the bottom of
0:01:03affording conditions a very quite well but
0:01:07so calibrating correctly for each ink each condition helps us to reduce this threshold variability
0:01:13among many other benefits the fertile small
0:01:17i probably die need a refresher of the other talks
0:01:21what we want essentially is a calibrated scores that we can indicate the weight of
0:01:24evidence for a given trial so that is a the likelihood ratio is this is
0:01:30the person cuticle is not be of the us what's the application forensic evidence in
0:01:38so subsequently if we have calibrated scores we can my competent threshold bayes decisions
0:01:43and this isn't trivial was we've heard without represent to be calibration set and it's
0:01:48difficult to handle the various conditions with a single calibration model and also that later
0:01:54in this talk will be needed measuring system performance of the number of metrics mainly
0:01:59focusing on calibration loss he
0:02:01so this is indicating a how close we are to performing the best we can
0:02:06for a particular operating point
0:02:09and in this work with focusing on equal costs between misses and false alarm sets
0:02:13around the equal-error point
0:02:15all the matrix we using ica loss of this is a more stringent criteria looking
0:02:20at how well calibrated we are cross all points on the operating on need it
0:02:25curved sorry
0:02:27and we're also looking at the average equal error rate across the fourteen conditions so
0:02:30we wanna make sure that if we calibrating assistant we're not losing speaker discriminability
0:02:35and of course for all metrics low is better
0:02:39now i'm here's to calibrate scores across those fourteen conditions sauce that the calibration loss
0:02:44is minimal
0:02:45with a single system
0:02:48a brief
0:02:49flow diagram of the system we using this study purity i-vector a lot ubm large
0:02:54i-vectors trying diners the presence that you can look at hyper for reference to that
0:02:59one the two errors with focusing on in this work out the orange boxes
0:03:04calibration in particular obviously and then is a box called universal audio characterization so this
0:03:09is a wide extracting meta information or side-information automatically from your i-vectors
0:03:16use the evaluation dataset it's appalling condition dataset a given to us by the f
0:03:22b i its source from park different sources and arabic cross in detail liza
0:03:28and it's ninety nine
0:03:30they're number of different raw conditions here that are not you tribute conditions they got
0:03:34cross-language cross-channel mix of both our clean and noisy speech and got a variety of
0:03:39durations in
0:03:41a the there's more details in the paper in terms of speaker break down
0:03:45language break down and
0:03:47on the right hand side there got the equal error rate for my baseline system
0:03:50and just to show you the difficulty i difficulty is increases as we go through
0:03:55the conditions
0:03:58since we want to the calibration data sets with put together three different datasets for
0:04:04this study
0:04:05first one is called email a this is essentially taking that f b i dataset
0:04:10and doing cross validation so trying to calibration model with one half testing on the
0:04:15other half important that around
0:04:16and doing again before pulling the results in getting a matrix
0:04:20the second dataset with labeled matched out this isn't actually matched to the dialogue but
0:04:24we've done the best we can from the sre and fisher data
0:04:28trying to sign languages
0:04:32trying to cross channel cross language trials however we were lacking in cross language cross
0:04:38trials the mixture by them and a few languages whenever them there either
0:04:43a funny the large variability dataset we actually didn't put emphasis on trying to collect
0:04:48i don't like if you are dedicated
0:04:51we simply took a nice variation of sre data and noise re noise that is
0:04:57river and rats clean that as from the darpa rats project so this is
0:05:03five languages of interest from the program you can look i four details on that
0:05:08so it is that large variability dataset was meant to be kinda like let's just
0:05:14try what we can of the calibration model so you wanted those in for the
0:05:19we're gonna be looking at three different calibration training screens the first is global which
0:05:24is generally calibration logistic regression i'll the standard approach many of us to probably or
0:05:29the u
0:05:30there's metadata based analysis once implemented with discriminative purely eye and universal audio characterization this
0:05:36is something that's been a very prominent in past sri evaluations with ball and darpa
0:05:42rats program very useful bit
0:05:45and finally we propose in that role based calibration
0:05:49and that's also based on universal audio characterization to provide metadata
0:05:54let's talk about the existing methods the calibration look at some results and the shortcomings
0:06:01so global or generative calibration he learning single shift and scale for converting as a
0:06:08rule score to a likelihood ratio
0:06:11just on the side he can see what happens when you've got the score distributions
0:06:15with that calibration enough to apply global calibration for fourteen conditions
0:06:20so we're focusing score just very distributions around the remarks or improving lc lost
0:06:26because we targeting around that there
0:06:29so this calibration technique as nicole explain is a effective for a single nine condition
0:06:35but once you but multiple conditions in the
0:06:38you're not actually reducing the variability of your threshold
0:06:42and that's a problem when you've got only condition data
0:06:48quick description only middle based calibration
0:06:51this takes into account sought informational metadata information
0:06:55from each side of the trials that's the enrollment side and the test side
0:07:00the big form of the on how we can but that to a likelihood right
0:07:03here and that's accomplished you can look at i for more details on that one
0:07:08that would discriminant purity i which is used to a jointly point minimum minimize
0:07:13a cross entropy objective
0:07:15well those parameters them bottom the
0:07:17and b and m t represent the u i c vectors so that on the
0:07:23what that
0:07:26this we propose is i think those action in all the c conference a few
0:07:29years back universal audio characterization very simple part
0:07:34take a training dataset dividing classes of interest that much language channel snr
0:07:41and for each of those classes model it with a gaussian so it's a gaussian
0:07:45you continue test sample comes in
0:07:48on the posteriors from each of those gas in the end up with a vector
0:07:50on the right hand side
0:07:52so it's like for instance that you
0:07:54trying to system on french and english to distinguish those two languages and you get
0:07:58a spanish test segment coming in
0:08:00our hypothesis is that the system on sci well sounds like at the same french
0:08:05twenty percent english in kind of reflect that posteriors
0:08:09that's the only
0:08:11let's take a look at the definition of the class c so we want to
0:08:15do here's i
0:08:16a given we had
0:08:18and oracle experiment so we actually to the f b i data via crossvalidation he
0:08:22what we can try now universal audio characterization
0:08:26so we pick out three one different classes snr language and channel
0:08:31and we said what con calibration loss improvement we're gonna get diagram global calibration
0:08:36and that's was listed he and the bottom are sort is what happens if you
0:08:41to each of those fourteen conditions
0:08:42calibrated each one independently
0:08:45a simple the results in court shows tables
0:08:48so that's essentially what should be the best we could the
0:08:51so what we've done here they will start potential of metal based calibration on our
0:08:56two conditions
0:08:57again this is something the guy of the source on the training
0:09:01so we're chosen here language and channel
0:09:03for this item mission
0:09:05let's look at the sensitivity of the universal audio characterization and the training set used
0:09:10for the calibration model
0:09:12the top two lines a what happens when we using an oracle experiment again the
0:09:16detailed i
0:09:18and we comparing global emitted based calibration
0:09:22basically what you can see he is that
0:09:25vertically with the say you lost
0:09:27the middle based calibration improves the sale of we're getting a slight reduction in april
0:09:33error rate the and the c laws improving a little bit as well
0:09:39it's i will do something there which is not to say
0:09:42a if we then look at what happens when we bring any matched dataset remember
0:09:46this is sre fisher data that's meant to try and be similar to the f
0:09:50b i data conditions
0:09:52we see something interest
0:09:54with global calibration if we train the model on the matched i
0:09:58rectly reducing calibration what severely compared to the art i guess that's expected in the
0:10:03set because we don't always have the data that where evaluating on
0:10:07but once we look at metal based calibration
0:10:11if we use the matched data to train the universal audio characterization and then use
0:10:16the actual with the i data to train a calibration model we're not doing too
0:10:20we're getting a subtle improvement in sales
0:10:22the problem occurs once we start using the matched data for the calibration model that's
0:10:26the discriminative but i mean
0:10:29we start to really reduce l performance in calibration and then equal error rate average
0:10:35equal error rate starts the ball
0:10:37so we've got a high sensitivity to the calibration training set here
0:10:44i one hypothesis that we've got than one on there is this may be due
0:10:47to the lack of prof language and cross channel conditions in the linear discriminant space
0:10:53so how do we handle beyond thing trial conditions
0:10:59so i'm my forensic experts to lie we can implement that
0:11:02we can select is represented calibration training set for each individual trial
0:11:08no those two point eight million trials in this database is not easy thing that
0:11:11can be done
0:11:13these drawbacks calibration
0:11:16so smart able body approach of forensic experts and i wasn't meant to replace then
0:11:20by any means
0:11:22but that was the motivation
0:11:24so one adults is the system delays the choice of calibration training data until it
0:11:28knows the conditions of the trial
0:11:31so given a trial we select a representative are representative data set of the enrollment
0:11:36sample then we construct trials against a tight of that's representing of the test sample
0:11:42as well
0:11:44so the challenge he is how do we found that representative
0:11:49i'm gonna work through the box you just showing the process we did for selecting
0:11:53for each individual trial a small subset of thousand target trials
0:11:59and how many impostor trials come at
0:12:02the first thing we do is to extract the u i c vectors from the
0:12:06enrollment sought on the test on this is predicting the conditions essentially all the by
0:12:11size of the trial
0:12:13then we rank normalized slows you icy vectors against the calibration you i see so
0:12:18we've got this candidate calibration dataset which could be the three sets of explain the
0:12:23we extracted u i cs for each of those so we already know the conditions
0:12:27the calibration data from a system specific
0:12:31we're doing rank normalization
0:12:34for those who don't know rank normalization very simple process where you simply replace the
0:12:38actual value
0:12:39in a given dimensional vector
0:12:41with the rank
0:12:43against everything in the calibration so you need a
0:12:46set to come in that you rank against
0:12:50more detailed in the five related to
0:12:52similarity measure a very simple euclidean distance
0:12:55from the rank normalized calibration devices
0:12:59sorry here they have actually been rank normalized against
0:13:03a this allows us to fonn most representative calibration segments for both enrollment and the
0:13:10then as the sorting process so we've done before we got to this point is
0:13:14actually taking the calibration candidate calibration segments and done exhaustive school comparison using acid system
0:13:20you get a calibration score matrix now we're doing is sorting the rose
0:13:24by similarity to the enrollment
0:13:26and then the columns by similarity to test
0:13:29what we end up with is the upper left point of being
0:13:32most representative of the trial that's in given tools here
0:13:38selection involves trying to get a thousand target trials
0:13:41and we simply add to be
0:13:44canada go to be selected calibration set to we get the
0:13:48fighting than the next most representative
0:13:51from the enrol side all the test site which underscores class based on a similarity
0:13:58not i think not here is that the segments without target trial as you going
0:14:02through this process are excluded otherwise you have might have cross-database impostor trials which are
0:14:08actually quite easy and that could bias the calibration model homes
0:14:14that something's tonight about this
0:14:16overcomes the intention is for star on the shortcomings of the middle based calibration but
0:14:21selecting the most representative trials
0:14:24and then learn the calibration model from that
0:14:26representativeness is not guarantee
0:14:29it's not saying that
0:14:30we've got this full of data we can actually find software is then it not
0:14:33is not the case with funding marched are presented
0:14:36in the case that there is nothing like the trial that are coming across it
0:14:41probably wouldn't but to something more like a general
0:14:44a randomly selected calibration model
0:14:47and that supposedly think it's better than overfitting possible
0:14:51so this is suitable for evaluation scenarios where you've gotta have a decision for anything
0:14:55so you've got speech from both sides of trial you need to produce a school
0:14:59for evaluation so that does not represent what forensic experts what the
0:15:03if for instance like hackett data for a given to all that much simply cite
0:15:07we rented this impossible without
0:15:10you know admit just call
0:15:12that's just a few things to keep in one
0:15:16that's look at that result this is on the matched data here results first the
0:15:20global calibration technique
0:15:22a across all the fourteen conditions we're getting a nice improvement average of thirty five
0:15:27percent reduction in c lost
0:15:29and not shown on the slide but in the type of this i twenty percent
0:15:33reduction in sale a more stringent metric
0:15:39so if we compare the three practise now on the large variability data so this
0:15:42is the one pooled from many different sources just to throw the system
0:15:47we see that middle based calibration
0:15:50actually reduce the average of all the theme of
0:15:54at the given operating point
0:15:56but unfortunately increases see a lot and equal error rate as well
0:16:02so again this is probably coming down to the overfitting issue all the lack of
0:16:06trials in a certain conditions
0:16:08where for instance
0:16:10if the condition was coming into the metal based calibration technique that i nice in
0:16:15a few trials for or few errors for
0:16:18it my say pretty confident about that this is the why we should calibrate when
0:16:23in fact it's quite mismatched of the day that's coming
0:16:26a based calibration ever improve the calibration metrics in both god and also improve the
0:16:33discrimination power of the system and this again is probably something that should be expected
0:16:38given that you're trying to apply a single threshold to get the equal error rate
0:16:43and you fourteen conditions
0:16:48pictorially are found this kind of interesting just the have a threshold are going between
0:16:54the different conditions here trained on the large variability data and you can say you
0:16:58basically the metal based calibration the global calibration was to but the spread across the
0:17:02thresholds their trial based calibration on this time scale down one on the
0:17:07a starting to cluster them close to zero obviously it's not article all we haven't
0:17:13succeeded in getting to where we need to be
0:17:17but it's something in the right direction let's suppose
0:17:22in conclusion we can say that
0:17:24well it's difficult to calibrate over a wide range of conditions
0:17:28at a based calibration we show that was struggling
0:17:31when we haven't of the three training conditions or very few all we propose trial
0:17:36based calibration to address that shortcoming
0:17:38and what does this select the calibration training set at test time
0:17:43i'd avoid overfitting to limited trials what using the minimum target trials that one thousand
0:17:47target trials with round
0:17:49and it reverts to a more general class of calibration model if the conditions around
0:17:55future work there's a lot future work here
0:17:57remove the computational bottleneck
0:18:00calibrating two point eight million trials independently
0:18:03so one option them up to closed forms solution but presented that a bleep industry
0:18:10or not the thing that jeff actually mention this
0:18:14some radical in
0:18:15a indication of how representative the calibration set that was selected is full that raw
0:18:20for instance here of the said to select and marched representative set it's that set
0:18:25is in fact something forensic experts wouldn't have chosen
0:18:29the user would one and i
0:18:33can we incorporate phonetic information
0:18:35relevant to joyce talk this one
0:18:38is that the in an i-vector framework something suitable p
0:18:43and finally can we actually learn a way of approximating calibration shift
0:18:48and scale using just the u i c vectors
0:18:52and that concludes montauk just leave you with the
0:18:55flow diagram case or questions
0:19:08so in your are the based calibration there's thanks to there's two components have to
0:19:15train the
0:19:16universal you a c and you try to calibrate rate that's correct
0:19:21what the results reminding in the results would both the use both of those used
0:19:25the matched dataset
0:19:30we well that's it
0:19:35so both for matched worst
0:19:39it was quite bad actually one of your set i think there's a time where
0:19:42you've got a night of the stuff
0:19:45for the signal
0:19:47so the real thank you think then this is just applying is the u s
0:19:51c that's file because which you usage dataset matched dataset trial used for that the
0:19:57data rate
0:19:58so you still have a map sets of this service needs to see
0:20:03so as to leave here it's just the u a c that's the issue
0:20:07ability when we look at this
0:20:09the us the is obviously flying of optical in the field a and also the
0:20:16every equal error rate but if you wanna use the matched data for trying you
0:20:20i c but use the actual evaluation data for the calibration set
0:20:27we're doing
0:20:29for remote to global in the same condition so we we're actually not
0:20:33not be nothing too much from having that sort of my
0:20:46yes i also in your future work that you still thinking about measuring how representative
0:20:54the really is you have some ideas there because that my limited mathematical mind i
0:20:58would think some sort of outlier detection
0:21:01for the case the
0:21:04a new comment on the road thought about at this point to be honest
0:21:08but we know it would be something
0:21:11definitely of interest that equally this is a tool to go along side of forensic
0:21:15expert you know where we know that automatic tools can be used i in the
0:21:21in certain decisions and to have a system that
0:21:24can dynamically calibrate and provide a better decision to the expert that's already a benefit
0:21:31but to have the confidence of the system's calibration is also i