Speech Transcript - Trial-based Calibration for Speaker Recognition in Unseen Conditions

0:00:15	sorry this is the i want to talk i'll give a bit of the background
0:00:19	briefly describe the system
0:00:20	talk about the data with playing with fourteen distinct conditions of are trying to calibrated
0:00:26	a little be existing calibration methods and then on the propose
0:00:31	trial based calibration
0:00:34	so we backchan a very good background already with the other talks but even very
0:00:37	accurate sid systems maybe not well calibrated this means that i and you might have
0:00:42	very low equal error rates for conditions evaluated independently
0:00:46	but once you pull those together a single threshold to reach that operating point when
0:00:52	applied
0:00:53	also just this problem the right here the blue we distributions of target trials the
0:00:59	red ones are impostor trials you can see be yellow threshold on the bottom of
0:01:03	affording conditions a very quite well but
0:01:07	so calibrating correctly for each ink each condition helps us to reduce this threshold variability
0:01:13	among many other benefits the fertile small
0:01:17	i probably die need a refresher of the other talks
0:01:19	but
0:01:21	what we want essentially is a calibrated scores that we can indicate the weight of
0:01:24	evidence for a given trial so that is a the likelihood ratio is this is
0:01:30	the person cuticle is not be of the us what's the application forensic evidence in
0:01:35	cool
0:01:38	so subsequently if we have calibrated scores we can my competent threshold bayes decisions
0:01:43	and this isn't trivial was we've heard without represent to be calibration set and it's
0:01:48	difficult to handle the various conditions with a single calibration model and also that later
0:01:54	in this talk will be needed measuring system performance of the number of metrics mainly
0:01:59	focusing on calibration loss he
0:02:01	so this is indicating a how close we are to performing the best we can
0:02:06	for a particular operating point
0:02:09	and in this work with focusing on equal costs between misses and false alarm sets
0:02:13	around the equal-error point
0:02:15	all the matrix we using ica loss of this is a more stringent criteria looking
0:02:20	at how well calibrated we are cross all points on the operating on need it
0:02:25	curved sorry
0:02:27	and we're also looking at the average equal error rate across the fourteen conditions so
0:02:30	we wanna make sure that if we calibrating assistant we're not losing speaker discriminability
0:02:35	and of course for all metrics low is better
0:02:39	now i'm here's to calibrate scores across those fourteen conditions sauce that the calibration loss
0:02:44	is minimal
0:02:45	with a single system
0:02:48	a brief
0:02:49	flow diagram of the system we using this study purity i-vector a lot ubm large
0:02:54	i-vectors trying diners the presence that you can look at hyper for reference to that
0:02:59	one the two errors with focusing on in this work out the orange boxes
0:03:04	calibration in particular obviously and then is a box called universal audio characterization so this
0:03:09	is a wide extracting meta information or side-information automatically from your i-vectors
0:03:16	use the evaluation dataset it's appalling condition dataset a given to us by the f
0:03:22	b i its source from park different sources and arabic cross in detail liza
0:03:28	and it's ninety nine
0:03:30	they're number of different raw conditions here that are not you tribute conditions they got
0:03:34	cross-language cross-channel mix of both our clean and noisy speech and got a variety of
0:03:39	durations in
0:03:41	a the there's more details in the paper in terms of speaker break down
0:03:45	language break down and
0:03:47	on the right hand side there got the equal error rate for my baseline system
0:03:50	and just to show you the difficulty i difficulty is increases as we go through
0:03:55	the conditions
0:03:58	since we want to the calibration data sets with put together three different datasets for
0:04:04	this study
0:04:05	first one is called email a this is essentially taking that f b i dataset
0:04:10	and doing cross validation so trying to calibration model with one half testing on the
0:04:15	other half important that around
0:04:16	and doing again before pulling the results in getting a matrix
0:04:20	the second dataset with labeled matched out this isn't actually matched to the dialogue but
0:04:24	we've done the best we can from the sre and fisher data
0:04:28	trying to sign languages
0:04:31	and
0:04:32	trying to cross channel cross language trials however we were lacking in cross language cross
0:04:38	channel
0:04:38	trials the mixture by them and a few languages whenever them there either
0:04:43	a funny the large variability dataset we actually didn't put emphasis on trying to collect
0:04:48	i don't like if you are dedicated
0:04:51	we simply took a nice variation of sre data and noise re noise that is
0:04:57	river and rats clean that as from the darpa rats project so this is
0:05:03	five languages of interest from the program you can look i four details on that
0:05:08	so it is that large variability dataset was meant to be kinda like let's just
0:05:14	try what we can of the calibration model so you wanted those in for the
0:05:18	evaluation
0:05:19	we're gonna be looking at three different calibration training screens the first is global which
0:05:24	is generally calibration logistic regression i'll the standard approach many of us to probably or
0:05:29	the u
0:05:30	there's metadata based analysis once implemented with discriminative purely eye and universal audio characterization this
0:05:36	is something that's been a very prominent in past sri evaluations with ball and darpa
0:05:42	rats program very useful bit
0:05:45	and finally we propose in that role based calibration
0:05:49	and that's also based on universal audio characterization to provide metadata
0:05:54	let's talk about the existing methods the calibration look at some results and the shortcomings
0:06:01	so global or generative calibration he learning single shift and scale for converting as a
0:06:08	rule score to a likelihood ratio
0:06:11	just on the side he can see what happens when you've got the score distributions
0:06:15	with that calibration enough to apply global calibration for fourteen conditions
0:06:20	so we're focusing score just very distributions around the remarks or improving lc lost
0:06:26	because we targeting around that there
0:06:29	so this calibration technique as nicole explain is a effective for a single nine condition
0:06:35	but once you but multiple conditions in the
0:06:38	you're not actually reducing the variability of your threshold
0:06:42	and that's a problem when you've got only condition data
0:06:48	quick description only middle based calibration
0:06:51	this takes into account sought informational metadata information
0:06:55	from each side of the trials that's the enrollment side and the test side
0:07:00	the big form of the on how we can but that to a likelihood right
0:07:03	here and that's accomplished you can look at i for more details on that one
0:07:08	that would discriminant purity i which is used to a jointly point minimum minimize
0:07:13	a cross entropy objective
0:07:15	well those parameters them bottom the
0:07:17	and b and m t represent the u i c vectors so that on the
0:07:22	next
0:07:23	what that
0:07:26	this we propose is i think those action in all the c conference a few
0:07:29	years back universal audio characterization very simple part
0:07:34	take a training dataset dividing classes of interest that much language channel snr
0:07:39	gender
0:07:41	and for each of those classes model it with a gaussian so it's a gaussian
0:07:44	backend
0:07:45	you continue test sample comes in
0:07:48	on the posteriors from each of those gas in the end up with a vector
0:07:50	on the right hand side
0:07:52	so it's like for instance that you
0:07:54	trying to system on french and english to distinguish those two languages and you get
0:07:58	a spanish test segment coming in
0:08:00	our hypothesis is that the system on sci well sounds like at the same french
0:08:05	twenty percent english in kind of reflect that posteriors
0:08:09	that's the only
0:08:11	let's take a look at the definition of the class c so we want to
0:08:15	do here's i
0:08:16	a given we had
0:08:18	and oracle experiment so we actually to the f b i data via crossvalidation he
0:08:22	what we can try now universal audio characterization
0:08:26	so we pick out three one different classes snr language and channel
0:08:31	and we said what con calibration loss improvement we're gonna get diagram global calibration
0:08:36	and that's was listed he and the bottom are sort is what happens if you
0:08:41	to each of those fourteen conditions
0:08:42	calibrated each one independently
0:08:45	a simple the results in court shows tables
0:08:48	so that's essentially what should be the best we could the
0:08:51	so what we've done here they will start potential of metal based calibration on our
0:08:56	two conditions
0:08:57	again this is something the guy of the source on the training
0:09:01	so we're chosen here language and channel
0:09:03	for this item mission
0:09:05	let's look at the sensitivity of the universal audio characterization and the training set used
0:09:10	for the calibration model
0:09:12	the top two lines a what happens when we using an oracle experiment again the
0:09:16	detailed i
0:09:18	and we comparing global emitted based calibration
0:09:22	basically what you can see he is that
0:09:25	vertically with the say you lost
0:09:27	the middle based calibration improves the sale of we're getting a slight reduction in april
0:09:33	error rate the and the c laws improving a little bit as well
0:09:37	sorry
0:09:39	it's i will do something there which is not to say
0:09:42	a if we then look at what happens when we bring any matched dataset remember
0:09:46	this is sre fisher data that's meant to try and be similar to the f
0:09:50	b i data conditions
0:09:52	we see something interest
0:09:54	with global calibration if we train the model on the matched i
0:09:58	rectly reducing calibration what severely compared to the art i guess that's expected in the
0:10:03	set because we don't always have the data that where evaluating on
0:10:07	but once we look at metal based calibration
0:10:11	if we use the matched data to train the universal audio characterization and then use
0:10:16	the actual with the i data to train a calibration model we're not doing too
0:10:19	bad
0:10:20	we're getting a subtle improvement in sales
0:10:22	the problem occurs once we start using the matched data for the calibration model that's
0:10:26	the discriminative but i mean
0:10:29	we start to really reduce l performance in calibration and then equal error rate average
0:10:35	equal error rate starts the ball
0:10:37	so we've got a high sensitivity to the calibration training set here
0:10:44	i one hypothesis that we've got than one on there is this may be due
0:10:47	to the lack of prof language and cross channel conditions in the linear discriminant space
0:10:53	so how do we handle beyond thing trial conditions
0:10:59	so i'm my forensic experts to lie we can implement that
0:11:02	we can select is represented calibration training set for each individual trial
0:11:08	no those two point eight million trials in this database is not easy thing that
0:11:11	can be done
0:11:13	these drawbacks calibration
0:11:16	so smart able body approach of forensic experts and i wasn't meant to replace then
0:11:20	by any means
0:11:22	but that was the motivation
0:11:24	so one adults is the system delays the choice of calibration training data until it
0:11:28	knows the conditions of the trial
0:11:31	so given a trial we select a representative are representative data set of the enrollment
0:11:36	sample then we construct trials against a tight of that's representing of the test sample
0:11:42	as well
0:11:44	so the challenge he is how do we found that representative
0:11:49	i'm gonna work through the box you just showing the process we did for selecting
0:11:53	for each individual trial a small subset of thousand target trials
0:11:59	and how many impostor trials come at
0:12:02	the first thing we do is to extract the u i c vectors from the
0:12:06	enrollment sought on the test on this is predicting the conditions essentially all the by
0:12:11	size of the trial
0:12:13	then we rank normalized slows you icy vectors against the calibration you i see so
0:12:18	we've got this candidate calibration dataset which could be the three sets of explain the
0:12:22	way
0:12:23	we extracted u i cs for each of those so we already know the conditions
0:12:27	the calibration data from a system specific
0:12:31	we're doing rank normalization
0:12:34	for those who don't know rank normalization very simple process where you simply replace the
0:12:38	actual value
0:12:39	in a given dimensional vector
0:12:41	with the rank
0:12:43	against everything in the calibration so you need a
0:12:46	set to come in that you rank against
0:12:50	more detailed in the five related to
0:12:52	similarity measure a very simple euclidean distance
0:12:55	from the rank normalized calibration devices
0:12:59	sorry here they have actually been rank normalized against
0:13:03	a this allows us to fonn most representative calibration segments for both enrollment and the
0:13:08	test
0:13:10	then as the sorting process so we've done before we got to this point is
0:13:14	actually taking the calibration candidate calibration segments and done exhaustive school comparison using acid system
0:13:20	you get a calibration score matrix now we're doing is sorting the rose
0:13:24	by similarity to the enrollment
0:13:26	and then the columns by similarity to test
0:13:29	what we end up with is the upper left point of being
0:13:32	most representative of the trial that's in given tools here
0:13:38	selection involves trying to get a thousand target trials
0:13:41	and we simply add to be
0:13:44	canada go to be selected calibration set to we get the
0:13:48	fighting than the next most representative
0:13:51	from the enrol side all the test site which underscores class based on a similarity
0:13:55	measure
0:13:58	not i think not here is that the segments without target trial as you going
0:14:02	through this process are excluded otherwise you have might have cross-database impostor trials which are
0:14:08	actually quite easy and that could bias the calibration model homes
0:14:14	that something's tonight about this
0:14:16	overcomes the intention is for star on the shortcomings of the middle based calibration but
0:14:21	selecting the most representative trials
0:14:24	and then learn the calibration model from that
0:14:26	representativeness is not guarantee
0:14:29	it's not saying that
0:14:30	we've got this full of data we can actually find software is then it not
0:14:33	is not the case with funding marched are presented
0:14:36	in the case that there is nothing like the trial that are coming across it
0:14:41	probably wouldn't but to something more like a general
0:14:44	a randomly selected calibration model
0:14:47	and that supposedly think it's better than overfitting possible
0:14:51	so this is suitable for evaluation scenarios where you've gotta have a decision for anything
0:14:55	so you've got speech from both sides of trial you need to produce a school
0:14:59	for evaluation so that does not represent what forensic experts what the
0:15:03	if for instance like hackett data for a given to all that much simply cite
0:15:07	we rented this impossible without
0:15:10	you know admit just call
0:15:12	that's just a few things to keep in one
0:15:16	that's look at that result this is on the matched data here results first the
0:15:20	global calibration technique
0:15:22	a across all the fourteen conditions we're getting a nice improvement average of thirty five
0:15:27	percent reduction in c lost
0:15:29	and not shown on the slide but in the type of this i twenty percent
0:15:33	reduction in sale a more stringent metric
0:15:39	so if we compare the three practise now on the large variability data so this
0:15:42	is the one pooled from many different sources just to throw the system
0:15:47	we see that middle based calibration
0:15:50	actually reduce the average of all the theme of
0:15:54	at the given operating point
0:15:56	but unfortunately increases see a lot and equal error rate as well
0:16:02	so again this is probably coming down to the overfitting issue all the lack of
0:16:06	trials in a certain conditions
0:16:08	where for instance
0:16:10	if the condition was coming into the metal based calibration technique that i nice in
0:16:15	a few trials for or few errors for
0:16:18	it my say pretty confident about that this is the why we should calibrate when
0:16:23	in fact it's quite mismatched of the day that's coming
0:16:26	a based calibration ever improve the calibration metrics in both god and also improve the
0:16:33	discrimination power of the system and this again is probably something that should be expected
0:16:38	given that you're trying to apply a single threshold to get the equal error rate
0:16:42	point
0:16:43	and you fourteen conditions
0:16:48	pictorially are found this kind of interesting just the have a threshold are going between
0:16:54	the different conditions here trained on the large variability data and you can say you
0:16:58	basically the metal based calibration the global calibration was to but the spread across the
0:17:02	thresholds their trial based calibration on this time scale down one on the
0:17:07	a starting to cluster them close to zero obviously it's not article all we haven't
0:17:13	succeeded in getting to where we need to be
0:17:17	but it's something in the right direction let's suppose
0:17:22	in conclusion we can say that
0:17:24	well it's difficult to calibrate over a wide range of conditions
0:17:28	at a based calibration we show that was struggling
0:17:31	when we haven't of the three training conditions or very few all we propose trial
0:17:36	based calibration to address that shortcoming
0:17:38	and what does this select the calibration training set at test time
0:17:43	i'd avoid overfitting to limited trials what using the minimum target trials that one thousand
0:17:47	target trials with round
0:17:49	and it reverts to a more general class of calibration model if the conditions around
0:17:53	same
0:17:55	future work there's a lot future work here
0:17:57	remove the computational bottleneck
0:18:00	calibrating two point eight million trials independently
0:18:03	so one option them up to closed forms solution but presented that a bleep industry
0:18:10	or not the thing that jeff actually mention this
0:18:14	some radical in
0:18:15	a indication of how representative the calibration set that was selected is full that raw
0:18:20	for instance here of the said to select and marched representative set it's that set
0:18:25	is in fact something forensic experts wouldn't have chosen
0:18:29	the user would one and i
0:18:33	can we incorporate phonetic information
0:18:35	relevant to joyce talk this one
0:18:38	is that the in an i-vector framework something suitable p
0:18:43	and finally can we actually learn a way of approximating calibration shift
0:18:48	and scale using just the u i c vectors
0:18:52	and that concludes montauk just leave you with the
0:18:55	flow diagram case or questions
0:19:08	so in your are the based calibration there's thanks to there's two components have to
0:19:15	train the
0:19:16	universal you a c and you try to calibrate rate that's correct
0:19:21	what the results reminding in the results would both the use both of those used
0:19:25	the matched dataset
0:19:30	we well that's it
0:19:35	so both for matched worst
0:19:39	it was quite bad actually one of your set i think there's a time where
0:19:42	you've got a night of the stuff
0:19:45	for the signal
0:19:47	so the real thank you think then this is just applying is the u s
0:19:51	c that's file because which you usage dataset matched dataset trial used for that the
0:19:57	data rate
0:19:58	so you still have a map sets of this service needs to see
0:20:03	so as to leave here it's just the u a c that's the issue
0:20:07	ability when we look at this
0:20:09	the us the is obviously flying of optical in the field a and also the
0:20:16	every equal error rate but if you wanna use the matched data for trying you
0:20:20	i c but use the actual evaluation data for the calibration set
0:20:26	we
0:20:27	we're doing
0:20:29	for remote to global in the same condition so we we're actually not
0:20:33	not be nothing too much from having that sort of my
0:20:46	yes i also in your future work that you still thinking about measuring how representative
0:20:54	the really is you have some ideas there because that my limited mathematical mind i
0:20:58	would think some sort of outlier detection
0:21:01	for the case the
0:21:02	but
0:21:04	a new comment on the road thought about at this point to be honest
0:21:08	but we know it would be something
0:21:11	i
0:21:11	definitely of interest that equally this is a tool to go along side of forensic
0:21:15	expert you know where we know that automatic tools can be used i in the
0:21:21	in certain decisions and to have a system that
0:21:24	can dynamically calibrate and provide a better decision to the expert that's already a benefit
0:21:31	but to have the confidence of the system's calibration is also i

Trial-based Calibration for Speaker Recognition in Unseen Conditions

Calibration, Evaluation & Forensics

Mitchell Mclaren, Aaron Lawson, Luciana Ferrer, Nicolas Scheffer and Yun Lei