Speech Transcript - Effects of the New Testing Paradigm of the 2012 NIST Speaker Recognition Evaluation

0:00:15	well everyone today i'm going to talk about the effects of the new testing paradigm
0:00:19	on the nist sre twelve
0:00:21	this work was done in collaboration with many colleagues
0:00:24	i including alvin events john george and jack
0:00:30	so before talking about what change nursery twelve let's just reminders also some things to
0:00:34	say the same
0:00:36	the task i industry twelve was text independent speaker detection
0:00:40	by speaker detection i mean
0:00:43	given some speech from a target speaker and some speech from a non target speaker
0:00:47	determine whether the target speaker and the non-target speaker the same person
0:00:51	evaluation consisted a long series of trials where a target trial is when the target
0:00:57	speaker the non-target speaker were the same and non-target trial where the target speaker non-target
0:01:02	speaker different so that much was the same
0:01:05	something the change history twelve was the joint knowledge of target speakers allowed
0:01:09	so on the past each trial had to be processed independently from one another
0:01:14	but in a street well
0:01:16	it was permissible to use knowledge of other target speakers for trial and this gave
0:01:23	rise to a distinction in the non-target speakers
0:01:26	and namely whether they were among the target speakers in which case they were considered
0:01:30	the known to man target speaker
0:01:32	or if they were among than not among the target speakers there are considered an
0:01:37	unknown non-target speaker
0:01:41	on this there were more and more very training data for each target speaker and
0:01:45	the majority of the target speakers in the evaluation
0:01:47	i had more than one segments for training
0:01:51	and in those cases a often the training data itself was varied
0:01:57	consisting of for example interviews recorded over a microphone phone calls recorded over a microphone
0:02:02	or phone calls recorded over telephone channel
0:02:09	most of the speakers in the sre twelve and was the target speakers are used
0:02:14	in prior evaluations which is something that was very different and they were identified in
0:02:18	advance and all their speech from these prior evaluations is made available
0:02:25	of those eighteen hundred at a new data was collected from three hundred and twenty
0:02:30	speakers roughly and roughly seventy of to we're not present in the prior evaluations and
0:02:35	those speakers had a single phone conversation released at the time of the evaluation
0:02:43	so we need these changes one question that may come up as y
0:02:47	and there were several reasons among them was to explore methods realising large quantities of
0:02:53	training data
0:02:55	to allow participants extended period of time to work on modeling techniques
0:03:00	to determine the benefit of allowing joint knowledge of target speakers particular the benefit of
0:03:05	for performance and also to increase the efficiency of the data collection
0:03:13	interest the data at the target speaker training data broke down into two cases if
0:03:17	released in advance of the evaluation the target speaker training data consisted of prior evaluation
0:03:24	data collected as part of the ldc mixtures of one three six and if released
0:03:29	at the start of the evaluation
0:03:31	the training data was a single phone conversation record as part of mixture seven
0:03:37	for the test segments most of them came from a newly collected corpus country next
0:03:41	which were phone calls a report over telephone channel
0:03:45	from prior mixture speakers
0:03:48	and they're also smaller number of phone conversations from the mixer seven corpus and
0:03:54	and these are phone conversations recorded other over a telephone channel or microphone channel
0:04:02	so there were many different types of trials include in the evaluation for example they
0:04:09	were trials where the speech had
0:04:14	noise added to it or it was reported in the naturally noisy environment
0:04:20	but among the trials we wanted to emphasise some subsets of particular interest us so
0:04:26	these recall common conditions there were five common conditions in the evaluation for today's presentation
0:04:32	will just going to focus on two
0:04:34	and those or interview speech in test without added noise and telephone channel speech and
0:04:40	test without added noise
0:04:45	so here we see in very round numbers the number of trials for each of
0:04:48	these common conditions
0:04:50	for common condition one again interview test with no added noise a roughly three thousand
0:04:57	target trials forty six thousand non-target trials from no non-target speakers sdc two thousand trials
0:05:04	from one or non-target speakers
0:05:08	in the core test which was required of all participants and optional test was assumption
0:05:14	the same but just with very a with a larger number of trials
0:05:21	and you see the numbers there likewise
0:05:27	target
0:05:28	non-target
0:05:29	speakers
0:05:30	non-targets
0:05:34	so let's look at some results
0:05:38	so here we see common condition two
0:05:41	which is telephone channel speech and test without added noise
0:05:45	and this is the results from one leading system the others are similar
0:05:51	and as might be expected better performance was able observed for known speakers that's the
0:05:56	red line
0:05:58	compared to the unknown speakers that's the black line
0:06:04	one thing to note is that known speakers had multiple telephone conversations and sometimes even
0:06:10	interviews
0:06:11	as their training data
0:06:14	so accuracy of the same system
0:06:16	but on common condition one which is interview speech and test without added noise
0:06:22	but unlike the last slide we saw
0:06:25	there's not a lot of difference between the two curves
0:06:29	and that gave was initially puzzling we wanted to know why and as it turns
0:06:35	out
0:06:36	and the known speakers for this common condition i were only known from a single
0:06:41	telephone channel recording
0:06:43	so where's and the previous slide the known speakers had a large amount of training
0:06:48	data by which to know them i hear the speakers where only known by a
0:06:55	single telephone channel
0:06:57	some in addition to having a small amount of data the trials were cross channel
0:07:05	so in addition to this concept of known and unknown non-target speakers
0:07:10	other also known and unknown systems
0:07:12	and what we mean here is that unknown systems presume that all of the non-target
0:07:17	trials came from no non-target speakers
0:07:21	and all systems presume that all the non-target trials were spoken by unknown non-target speakers
0:07:26	so customers also that
0:07:28	accuracy just a regular system only extended trials for common condition two
0:07:36	and we see the thin dotted lines
0:07:40	i'm not sure if we can actually see that especially in the back
0:07:43	but those are ninety percent confidence bounds which suggest that there was a significant difference
0:07:49	in performance
0:07:51	between the known non-targets on the on a non-targets again read is the colour for
0:07:57	the known non-targets black for the and known non-targets
0:08:04	so here we see an unknown system again that's where the system always presume that
0:08:09	the non-target speaker was a known and as might the expected
0:08:14	there is little this difference observed between the two curves
0:08:21	the accuracy and on system again that's where of the system presumes that all of
0:08:26	the non-target speakers are unknown which is just say there are among the target speakers
0:08:32	all of these are from the same site
0:08:36	and you're actually compared to two slides back
0:08:40	the performance differences is enhanced
0:08:47	so
0:08:48	summary sre twelve was an experiment with a new protocol and how speakers were made
0:08:53	known to the systems
0:08:54	after conversational telephone speech segments performance was improved when speakers are known to the system
0:09:01	for interview test segments such improvement was not observed that was just do the setup
0:09:05	of the evaluation
0:09:07	he was not observable stuff to say that would be observed if the evaluation allows
0:09:13	others actually a lot more information and that was covering the paper and other papers
0:09:22	covering things that we learn from the evaluation so let me encourage you to
0:09:27	look at those a more to contact us that address
0:09:34	in addition
0:09:36	considering future evaluations there is a question of whether allow enjoying knowledge of a target
0:09:43	speakers is a good idea going forward
0:09:46	one thing to note is that
0:09:48	joint knowledge of target speakers makes result increasingly dependent on the target speaker selected introduced
0:09:54	a trial independence
0:09:56	so this makes estimating
0:10:01	an error rates more difficult
0:10:03	also something to consider is whether to continue having multi session and multichannel
0:10:10	training for the target speakers
0:10:15	so nist will resume a series b on the i-vector challenge in the a near
0:10:22	future
0:10:24	some interest he's is
0:10:27	been expressed within the community regarding performing testing and acoustic environments different
0:10:35	from those of prior evaluations joe made mention that
0:10:40	some utility and that
0:10:43	also one thing to note is that's
0:10:48	in order to be able to conduct these types of evaluations it is necessary to
0:10:52	collect realistic in challenging speech data
0:10:56	which is both expensive and time-consuming
0:10:59	but in order to do that and have even better evaluation lessons learned from sre
0:11:05	twelve
0:11:07	will be take into account and considered in the next evaluation so i probably have
0:11:11	lots of time for questions
0:11:15	so thank you
0:11:30	so looking at your the
0:11:32	c one in c to the common condition one into yes can you talk to
0:11:38	the number of actual speakers were involved in the c one versus e c two
0:11:43	no trials but speakers right
0:11:46	the short answer is
0:11:49	yes but not now "'cause" i don't have that information handy but we did look
0:11:52	at that i can recall precisely
0:11:55	well one of the things i did not yet another but i recall the c
0:11:59	one had it on the order of about fifty forty three speakers involved only
0:12:03	so
0:12:04	i think we have
0:12:06	comparing those two about the effects ago about the known and this couple things changing
0:12:11	simultaneously the microphone in the television only yes hand the pool is much smaller "'cause"
0:12:16	i think it was only true from drawn from
0:12:20	mixture seven
0:12:21	right
0:12:22	so that's actually really excellent point that we try to emphasise during the evaluation workshop
0:12:26	but i neglected to mention here
0:12:28	is that the common conditions really we're not compare able at all
0:12:34	in this evaluation so the speakers were different and the
0:12:43	basically all the conditions change so it's i don't think you for noting that it's
0:12:47	inappropriate to make those comparisons
0:12:49	across common conditions within a common condition
0:12:52	it was interesting to look at some of the sub
0:12:57	some factor performance
0:13:08	could you write just commenting if you're going to be following up on his or
0:13:11	her as part of the nist unnecessary process
0:13:15	so this is actually something we've been looking into
0:13:19	pretty extensively the short answer is it's remains to be determined but the long answers
0:13:24	this is something we're seeking to do
0:13:31	okay
0:13:32	make it i've a practise it
0:13:36	are criteria you said at the end of the presentation they that they'll be focus
0:13:39	on multichannel enrollment a training conditions
0:13:43	once the question
0:13:44	whether question is like cyanide in the last sre twelve you present at the workshop
0:13:50	i think those any one thing that the
0:13:52	my kind enrollment or telephone and enrollment it seems like focus wasn't neto maybe that
0:13:58	just wasn't nothing just this time ramp up to still is a big challenge awfully
0:14:02	so it just one if that was still going to be effective some continuing evaluations
0:14:08	well that's a question and one of the things that we're very eager for is
0:14:12	to get feedback from one
0:14:15	from the community one thing that is
0:14:19	time consuming and
0:14:21	if not expensive the difficulty is setting up the evaluation even with the data
0:14:25	and so
0:14:28	we're much more likely to include that again of people will actually participate
0:14:34	also got a second question if of got on a i'm not sure if you
0:14:37	where the nn i-vector paragon that's come out for frame framework for sre twelve
0:14:44	very impressive performance particular on telephone conditions as you mine i that the nn you
0:14:49	need a lot of data for training and things very difficult to get that level
0:14:53	one thing are afraid of is
0:14:56	teams that might not have the infrastructure do such thing
0:15:00	how would like here with the other things that do have the infrastructure in future
0:15:04	evaluations is there are something that can be done about that such as the i-vector
0:15:08	challenge with the i-vectors are presented
0:15:11	just one and you've got thoughts on that
0:15:15	in short no but that's a good question and
0:15:20	something that model
0:15:21	we
0:15:23	perfectly willing to explore
0:15:31	i just want to common to one o or of your conclusion point your i'm
0:15:35	be happy to know but
0:15:37	you have been mine with this point source of course to extend the v nist
0:15:41	databases with new challenging conditions
0:15:45	but i think it's also interesting to us
0:15:48	increase the query actual conditions we have a lot of for to do on the
0:15:53	act recognition by increasing cell use given number of speakers
0:15:57	maybe buying one out of menu chewed
0:16:00	and by adding
0:16:03	in of the data per speaker of course it will
0:16:08	for us to the reviewers over the evaluation protocol and look at the results per
0:16:14	speaker like
0:16:16	jodie the past the us also look also at the difference is that if you
0:16:23	just
0:16:24	select randomly one thousand test
0:16:28	in a lot that the bayes to do you have some performance differences if you
0:16:34	choice so one set compare to your the sets and a lot of things like
0:16:38	that
0:16:39	i think
0:16:47	i

Effects of the New Testing Paradigm of the 2012 NIST Speaker Recognition Evaluation

Calibration, Evaluation & Forensics

Alvin F. Martin, Craig S Greenberg, Vincent M. Stanford, John M. Howard, George R. Doddington and John J. Godfrey