Speech Transcript - The 2011 BEST Speaker Recognition Interim Assessment

0:00:15	okay the third talk we have is on the two thousand best speaker recognition interim
0:00:23	assessment and crickets can present
0:00:42	okay so we have
0:00:43	for all results from best and how important is the calibrated test data
0:00:49	so now i'm going to give you the background information to help you interpret what
0:00:54	you're previously
0:00:59	a quick on the best program it stands for biometrics exploitation of science and technology
0:01:06	was my or program that once in two thousand and nine with the objective of
0:01:10	advancing the state-of-the-art in biometrics including speaker recognition models anymore about it but if you
0:01:15	want more information here's a link
0:01:20	no best managed what we call the best the best
0:01:25	which was be best evaluation speaker track for the best interim assessment the objective here
0:01:32	was to measure progress in speaker recognition relative to performance prior to sort the program
0:01:38	also to measure performance on data due to speaker recognition evaluations
0:01:45	a cell for those already which clusterise
0:01:51	please
0:01:53	no one's one to close there is
0:01:56	well the yeah the thing i like to know is
0:01:58	i should work people have not everyone is closer as you want have anonymity here
0:02:03	who does not know what speaker detection is the users right
0:02:08	okay
0:02:10	how about a target speaker
0:02:12	anyone not know what this is or non-target
0:02:14	or test segment
0:02:16	anyone that this is
0:02:18	okay
0:02:20	i just like a segment of speech data containing one or more unknown speakers
0:02:25	and the last anyone not know what mixtures
0:02:30	okay this is a set of speech corpora collectible the ldc to support speaker recognition
0:02:38	okay
0:02:39	so the data used for the evaluation was very big
0:02:44	very complex a bigger than anything a nist as the speaker recognition prior to that
0:02:50	and that could over thousand speakers at two thousand audio segments and forty one million
0:02:55	trials
0:02:57	make use of previous history collections is what was newly collected a mixer data which
0:03:02	we just below that was
0:03:04	including mixture one and two
0:03:06	which at the five different languages arabic english mandarin russian and spanish
0:03:11	mixer five
0:03:13	so mixer five was used in sre wait for those are from the of the
0:03:17	collection these are the mixer five speakers were not used in this way
0:03:22	mixer six likewise was using the sri to use of the seven year so speakers
0:03:27	that we're not using the sri ten
0:03:29	oh used greybeard
0:03:31	so people metrical we used gradient sre ten also but not really sticky and it
0:03:37	was
0:03:38	to be able to use it in this evaluation of that we were encouraged not
0:03:42	really sticky
0:03:43	as well as mixture seven which was newly collected with the objective of addressing new
0:03:50	sources variability
0:03:52	to analyze this part of the best evaluation
0:03:56	and
0:03:57	so i think i one over the so that there were nine core conditions men
0:04:01	to focus the research
0:04:04	first telephone wires train and test on telephone phone calls this is that like common
0:04:09	condition
0:04:10	clusters are used for maybe since the beginning
0:04:14	microphone train and test on far-field microphone uses sort of the interview condition
0:04:22	channel condition where we train on microphone and test on telephone of but limit or
0:04:27	consideration phone calls
0:04:29	another microphone condition where we trained on four or near field mikes and test on
0:04:34	telephone also limited the phone calls
0:04:37	and the speaking style condition we train on interview test on phone calls these restricted
0:04:40	the microphones
0:04:42	a language condition where we train on a multiple languages train it doesn't languages
0:04:48	second one where we train and test on two languages of the single both microphones
0:04:52	and phone calls
0:04:54	oh
0:04:55	and telephone
0:04:56	a multisession train
0:04:59	second multisession train the difference between these two is one was test on phone call
0:05:03	the other test on interview
0:05:04	and that's really
0:05:07	so to look at
0:05:09	actors a effective performance we really want to focus the evaluation on achieving a more
0:05:15	measuring system robustness
0:05:17	and also we saw yesterday that fact
0:05:21	errors we're condition the three categories
0:05:24	intrinsic extrinsic and metric
0:05:28	as these are all actually tested in the best evaluation
0:05:32	speech style where there is an interview or phone call
0:05:35	a vocal effort
0:05:36	i where there's normal vocal effort these recorded over a cell phone low vocal effort
0:05:41	or high vocal effort
0:05:43	in terms of the vocal effort
0:05:45	i vocal effort low vocal effort reduced of your headsets
0:05:49	morning and collected for internal phone calls
0:05:52	and high vocal effort the headset and the noise butlers i don't those of the
0:05:57	intentional artifact low vocal effort there was not always but a high side to encouraging
0:06:03	the speaker to lower his or her voice
0:06:06	another set question that remains whether this is a realistic approach is well that's what
0:06:12	actually is produced
0:06:14	regardless it seems the effects are interesting
0:06:18	as we talk about intrinsic in terms of extrinsic factors oh yeah channel which is
0:06:23	microphone versus telephone a different microphone types
0:06:26	and telephone we had transmission different transmission and handsets
0:06:33	and
0:06:34	something new and
0:06:37	a relatively interesting was changing the distance between interviewer and subject
0:06:44	to see if there was some additional vocal effort that could be listed in that
0:06:48	way
0:06:49	reverberation here we have that was artificially added but in reality there were two different
0:06:54	rooms one that was meant to be reverberant one that was meant to be not
0:06:58	reverberant
0:06:59	so that was both
0:07:01	for natural reverberation
0:07:03	as well as additive
0:07:05	reverberation and additive noise
0:07:11	in terms of additive noise than how we do this
0:07:14	oh sorry in terms of a river how we did this using a procedure that
0:07:17	was proposed by mitre
0:07:18	but actually implemented by mit lincoln labs
0:07:21	and was i don't of the participants at transfer
0:07:27	the method was the transform collected signals that have
0:07:32	reverberation qualities of a particular rooms
0:07:36	a given a range of dimensions and service conditions
0:07:39	as george show there were seven different reverberation conditions
0:07:43	point one six or so to one point three or so rt sixty
0:07:50	additive noise was also implemented by mit lincoln labs
0:07:54	other two noise types one which is each be easy which is heating ventilation and
0:08:02	air conditioning
0:08:03	so this is a sort of standard office room background noise
0:08:07	as well as the speech spectrum noise so this was a
0:08:10	a gaussian noise filtered to these spectra spectrum of speech
0:08:16	and there are two different noise levels well one fifteen db the other six db
0:08:22	and these were
0:08:23	see message weighted
0:08:25	five percent correctly
0:08:28	sources here if you
0:08:34	as speech spectrum fifteen db
0:08:40	i
0:08:44	i
0:08:46	i
0:08:50	vol
0:08:53	and the sixty V
0:08:59	i
0:09:00	i
0:09:03	i
0:09:04	i
0:09:08	and now extracted sixteen
0:09:15	i
0:09:18	i
0:09:18	i
0:09:24	a
0:09:29	okay
0:09:30	yeah and in terms of parametric factor
0:09:33	as there were a five different languages
0:09:36	and there's also eating data from greybeard
0:09:39	same data is returned as a so this is the reason why
0:09:42	we were encouraged not to should be the key
0:09:45	and there were multiple training session something that was new to the best in the
0:09:50	past without multiple training sessions that were phone calls that we actually in this case
0:09:55	that multiple training sessions over interviews somewheres the same speech over a microphone summers of
0:10:00	maybe one or two microphones but different speech
0:10:12	so something we've seen a few times
0:10:15	but maybe i'll explain some is the primary metric was different for best originally conceived
0:10:21	and
0:10:21	nineteen ninety six
0:10:24	to "'cause"
0:10:26	something like sixteen years to implement
0:10:29	a the false alarm rate and the corresponding miss rate of ten percent as set
0:10:35	as distinct but i one of the advantages that was simple and clearly defined
0:10:40	and the false alarm rate may be viewed as representing the cost of the wasted
0:10:44	listening effort incurred by using the system a specified miss rate
0:10:49	i in contrast equal error rate it does focus on the low false alarm region
0:10:53	which is likely to be of interest to a number of applications
0:10:57	and we can use the role of thirty to determine that with render target trials
0:11:02	that we'd only need to enter target trials or so to get the required miss
0:11:07	rate
0:11:13	so let's look at just a few general performance trends i should note here that
0:11:16	we are sharing general trends there were some things that were system specific but what
0:11:23	we're trying to share here
0:11:24	is are things that were a common across the systems that were submitted for this
0:11:28	evaluation
0:11:32	so in terms of language
0:11:35	what we see is the green lines are baseline system so this is a system
0:11:42	that was
0:11:43	i meant to be the state-of-the-art prior to the start of the evaluation
0:11:49	and the blue was a system submitted
0:11:52	for the best evaluation
0:11:54	the solid lines
0:11:57	or english
0:11:58	dashed lines or spanish
0:12:01	two things to is that the system submitted for the
0:12:05	a best evaluation shows better performance than the baseline across the range of operating points
0:12:11	also performance on english and spanish data were comparable
0:12:16	so not a big language effect
0:12:19	so this is my to help
0:12:21	and calibrate people's eyes
0:12:24	to the new metric
0:12:27	so here we see
0:12:30	the false alarm rate of ten percent miss rate
0:12:33	somewhere around twenty point eight
0:12:37	percent
0:12:39	second one that run point one percent
0:12:43	and the third and fourth
0:12:49	is that both point five percent
0:12:54	here's another look at language this is for mixture one and two
0:12:58	single system
0:13:00	and the system performed mostly better on english and spanish and on the other languages
0:13:07	and fact some systems perform better on spanish than english why this would be sort
0:13:12	of a missed regions right
0:13:14	and another chance to
0:13:17	the chance to calibrate your eyes to the new metric
0:13:19	so we can across the ten percent
0:13:21	the intersections the primary metric for best
0:13:27	so for speaking style
0:13:29	we're looking at one system's performance again line is train and test on interview the
0:13:35	green line is train on interview test on phone call the restraint on phonecall test
0:13:39	on phone call
0:13:41	and interview train and test gives best performance but there's a confounding factor here interview
0:13:49	rooms were actually longer
0:13:51	the test segments of this could explain why perform better
0:13:58	so we did a test in history
0:14:01	ten one vocal effort and found something somewhat surprising that low vocal effort perform better
0:14:07	than normal high vocal effort expected high vocal effort to perform worse but we also
0:14:11	expected low vocal effort before worse than normal vocal effort
0:14:16	similarly and the best evaluation high vocal effort stands out across systems as part
0:14:21	and low vocal effort stands out across systems is easy
0:14:27	and this is consistent similar to what we thought of this return
0:14:33	so there were a thing seven reverb conditions plus no reverb
0:14:40	so these very widely in terms of how much reverb was applied for those who
0:14:48	are not
0:14:50	able to immediately imagine what something would sound like based on an rt sixty let
0:14:55	me play a couple examples
0:15:00	so this was the least reverberant
0:15:03	condition
0:15:05	very well only
0:15:17	and the most reverberant condition
0:15:20	i
0:15:24	i
0:15:28	i
0:15:30	oh
0:15:33	and the thing to notice on this plot the degradation corresponds tardy sixty time and
0:15:39	that despite the mismatch that's that the train is on a noisy
0:15:45	are not reverb speech and the test is on reverb speech despite this mismatch performance
0:15:50	seem to be better with a small amount of reverb
0:15:54	that was no reverb at all
0:15:56	was also somewhat surprising
0:16:00	curious in an additive noise again no noise and train but noise and test
0:16:06	and when testing without noise fifteen db noise had little effect and sixteen to be
0:16:13	a more
0:16:14	one thing that was kind of interesting to us
0:16:17	was that if you look at the
0:16:21	oh for example the red in the dark blue line or the site and the
0:16:26	green line and there's not a lot of difference
0:16:29	and this is the price of expected speech spectrum noise to be more difficult to
0:16:32	do with other than expect noise
0:16:35	but as you can see there really was not a great you difference between the
0:16:39	two spectra
0:16:43	one other point i think to make have been made a suggestion that maybe one
0:16:48	percent is more appropriate under certain circumstances and ten percent and this condition that performance
0:16:53	was
0:16:55	and good enough that we were not able to measure
0:16:58	the primary metric because it never reached a ten percent that's right
0:17:05	so it's analysis largest and most complex by a fair amount this leads speaker recognition
0:17:12	evaluation today examine several factors affecting performance observed several surprising results some of which were
0:17:22	recreation of the results in sre ten that there were new conditions for example additive
0:17:26	noise and reverberation also provided some surprises which measures the earlier
0:17:31	us to use of synthetic data this is very exciting to us because it's
0:17:35	other times difficult expensive to collect data and
0:17:39	difficult to control
0:17:41	so if synthetic data turns out to be a reasonable way to evaluate systems and
0:17:47	the future this should be very exciting and useful discovery
0:17:53	and finally there is improvement observed over the baseline of all conditions
0:17:58	so participants should be policemen that
0:18:02	thank you
0:18:06	i
0:18:12	i
0:18:17	okay
0:18:19	i
0:18:20	oh
0:18:22	oh
0:18:24	i
0:18:32	yeah i have to do not sure
0:18:37	oh
0:18:45	i
0:18:47	oh
0:18:54	i
0:19:00	i
0:19:03	i
0:19:18	yeah
0:19:19	i
0:19:20	i
0:19:23	i
0:19:28	i
0:19:32	i
0:19:45	oh
0:19:59	i
0:19:59	i
0:20:08	i
0:20:09	oh
0:20:10	oh
0:20:10	oh
0:20:11	i
0:20:14	yes
0:20:17	yes so that was not explored in this evaluation but this is a fruitful area
0:20:21	for future
0:20:22	exploration
0:20:31	sure
0:20:38	i
0:20:39	yes cost
0:20:43	i
0:20:48	yes
0:20:56	yes
0:20:58	so and should probably admit at this point that we have not done deeper analysis
0:21:05	on basically any of this
0:21:09	it was a large and complex evaluation so in order to be able to handle
0:21:13	the main points
0:21:15	we had to do what are tied to read from the best
0:21:18	idea actually would like to explore this more
0:21:24	i
0:21:26	i
0:21:28	yes
0:21:33	yes and i guess i should point that out not just in the no noise
0:21:36	but across all the rivers we have this active speech just with different
0:21:43	yes
0:21:46	i
0:21:49	i think it was the same trend
0:21:52	of which was surprising but
0:21:55	maybe people can offer some
0:21:59	explanation or some intuition is that what this might be
0:22:20	i
0:22:26	i
0:22:35	i
0:22:38	oh
0:22:39	i
0:22:41	i
0:22:46	i
0:22:47	i
0:22:48	i
0:22:51	oh
0:22:54	interested look
0:22:57	i
0:22:58	yes
0:23:00	i
0:23:01	i see so two of them to make sure i understand what you're saying and
0:23:04	that the target scores were two widely distributed you don't are simply sums the thought
0:23:18	oh
0:23:21	yes
0:23:23	i
0:23:31	okay

The 2011 BEST Speaker Recognition Interim Assessment

SESSION 09: Speaker Recognition Evaluation

Craig Greenberg