Speech Transcript - The NIST Speaker Recognition Evaluations

0:00:28	Good morning ladies and gentleman,
0:00:30	welcome to the third day of your Odyssey Workshop.
0:00:35	Out of fifty one papers, twenty seven have been presented over the last two days
0:00:41	and we have another twenty one to go.
0:00:44	twenty four to go, if I'm doing the calculation right. Twenty four to go and
0:00:48	yesterday was the... all papers were on... mainly on i-vectors. And we can say yesterday
0:00:53	was the i-vector day. And today papers are... except one paper, there are two major
0:01:00	sessions. One is language recognition evaluation and then features for speaker recognition.
0:01:08	My name is Ambikairajah, I'm from the University of New South Wales in Sydney, Australia.
0:01:12	I have the pleasure of introducing to you our plenary speaker for today, doctor Alvin
0:01:18	Martin.
0:01:19	Alvin will speak about the NIST speaker recognition evaluation plan for two thousand twelve and
0:01:26	beyond.
0:01:28	And he has coordinated NIST series of evaluation since nineteen ninety six in the areas
0:01:35	of speaker recognition, language and dialect recognition. And the evaluation work he's involved
0:01:41	collection and selection and preprocessing of data, and writing the evaluation plan, and evaluation of
0:01:50	the results, coordinate in the workshop and... and many more tasks.
0:01:57	He has served as a mathematician in the Multimodal information Group at NIST since nineteen
0:02:04	ninety one and to two thousand eleven.
0:02:08	Alvin holds a Ph.D. degree in mathematics from the Yale University. Please join me in
0:02:14	welcoming doctor Alvin Martin
0:02:25	Okay! Thank you! Thank you for that introduction and thank you for the invitation
0:02:32	to do this talk. I'm here to talk about this speaker evaluations and, as you
0:02:39	know, I have
0:02:42	at NIST
0:02:43	and I remain
0:02:46	associated with NIST for this workshop, however
0:02:51	I am here
0:02:53	independently, so everything I say or
0:02:58	I'm responsible for everything and no one else is, opinions are all my own.
0:03:11	I guess I might... don't think I subject to any restrictions, but
0:03:15	I'm at the clock.
0:03:21	okay
0:03:25	stay closer to this. An outline of the
0:03:29	Topics I hope to cover... Gonna talk about some early history, things that preceded the
0:03:35	evaluations, the current series of evaluation. The things that happened during the early times of
0:03:42	the evaluation
0:03:44	and giving kind of a history of the evaluations and in part of past Odysseys
0:03:52	evaluation... who's involved with I should note my debt to Doug Reynolds who gave a
0:03:59	similar
0:04:01	talk on these matters four years ago in Stellenbosch and I will update one of
0:04:08	the slides that
0:04:11	he presented there. Gonna say some things from the point of view of an evaluation
0:04:18	organiser, about ... about evaluation organisation. Say something about performance factors to look at, something
0:04:26	about metrics which we've already talked about at the others workshop. Say something about progress
0:04:34	measuring progress over time
0:04:36	and when we talk about the future, quitting SRE twelve evaluation process currently going on,
0:04:44	it will take place in the end of this year and then
0:04:47	so about what might happen after this year
0:04:53	The early history
0:04:57	ones I would mention
0:04:59	One thing that backed to the interesting speaker recognition evaluation success of speech recognition evaluation
0:05:08	back
0:05:10	in ... in the eighties and the early nineties, this
0:05:13	very much
0:05:15	involved in this, in this showed the benefits of independent evaluation on common data sets.
0:05:21	I'll show a slide of that in a minute.
0:05:24	I will mention the collection of various early corpora that were appropriate for speaker recognition:
0:05:30	TIMIT, KING and YOHO, but most especially Switchboard. It was a multi-purpose corpus that was
0:05:37	collected around nineteen ninety one, so one of the purposes that they had in mind
0:05:41	was speaker recognition, collected conversations from a large number of speakers so that you have
0:05:49	multiple conversations for each speaker. Success led to the collection later Switchboard two and similar
0:06:00	collections. And in fact in the aftermath of Switchboard, The Linguistic Data Consortium was created
0:06:09	in nineteen ninety two with the purpose of supporting a further speech and also text
0:06:16	collections in the... in the United States, and onto the first Odyssey, all wasn't called
0:06:23	Odyssey, it was Martigny in nineteen ninety four followed by several others. I will
0:06:30	show pictures and make a few remarks on those. And andthere were early NIST evaluations.
0:06:36	We date the current series of speaker evaluaqtions nineteen ninety six, but there were evaluations
0:06:41	in ninety two ninety five. There was a DARPA program evaluation at several sites involving
0:06:47	the DARPA programming in ninety two. In ninety five there was a preliminary evaluation that
0:06:53	used Switchboard one data and at the six sites. But these earlier evaluations, the emphasis
0:07:00	was rather on speaker identification
0:07:03	on closed set rather than on open-set recognition that we've come... to know in ...
0:07:11	in the series of evaluations
0:07:17	So here's this favourite slide on speech recognition. The Benchmark Test history. So these... you
0:07:28	know, these... the word error rate is on the lyrical scale, logarythmic scale
0:07:34	start from nineteen eighty eight
0:07:39	and this show best system performance of various evaluation, various conditions in... In successive years,
0:07:47	or years when evaluations were held. So pointing out, of course, is the big fall
0:07:52	in error rates when multiple sites participated on common corpora and we looked at error
0:07:59	rates and
0:08:00	with probably fixed conditions we could see progress being evident, specially this is showing the
0:08:06	early series. This
0:08:10	this... we came a mile over in the evaluation cycle research, collect data evaluate, show
0:08:16	progress that gave inspiration to other evaluations and in particular, speakers
0:08:28	okay, so now
0:08:31	do some walk down memory lane
0:08:35	the first
0:08:36	workshop of this series was Martigny in nineteen ninety four
0:08:43	It was called Workshop on automatic speaker recognition, identification and verification
0:08:48	and that workshop, you know, was the very first of this
0:08:54	recently will attended, but not as well as this one. And there were various presentations
0:08:59	and there were many different corpora, many different performance measures and it was very difficult
0:09:04	to make meaningful comparisons. I presented here one of the papers I presented papers that
0:09:10	interest from the NIST evaluation point of view. There was a paper on public databases
0:09:17	for speaker recognition and verification. It was given there
0:09:26	And to pull the other of the early ones... Avignon, nineteen ninety eight. Speaker recognition
0:09:32	and it's commercial and forensic application is what it was called. We called... also known
0:09:38	as RLA2C from the French title.
0:09:43	and one observation is that in terms of the talks there
0:09:47	TIMIT was a preferred corpus
0:09:52	for many was
0:09:53	too clean, too easy corpus. I remember Doug making comments that he didn't wanna listen
0:09:58	anymore. Papers that described results from TIMIT... there's also characterized by sometimes bitter debate over
0:10:08	forensics and how good job forensic experts could do with that at speaker recognition
0:10:18	there were
0:10:19	several
0:10:23	missed speaker evaluations related papers... actually, three of them that combined into
0:10:33	this paper in speech communication
0:10:36	from three presentations, perhaps most memorable was the presentation by George Doddington who told us
0:10:43	all how to do the speaker recognition evaluation
0:10:48	So, this was a talk that laid out the various principles, and most of the
0:10:53	principles are kept and followed in our evaluation series, includes a discussion of the
0:11:00	one golden rule of thirty
0:11:07	Crete, two thousand one
0:11:10	Two thousand one, A Speaker Odyssey, took the official name. Speaker recognition workshop. That was
0:11:15	the first official Odyssey
0:11:17	it was characterized by more emphasis on evaluation. There was an evaluation track that was
0:11:22	persuaded, the NIST was
0:11:25	involved with
0:11:31	So, one of the presentations, the NIST presentation, I think I
0:11:37	gave it...
0:11:40	the history of NIST evaluations up to that ... that point and I will actually
0:11:46	show a slide form there later on.
0:11:50	another
0:11:51	key presentation was... was one by several people from the Department of Defence: Phonetic, idiolectal
0:11:58	and acoustic speaker recognition, that was... these remained their ideas that were being pursued at
0:12:03	the time and that were influencing the course of research that point I think the
0:12:07	name was Noan George had a lot to do with that. He had the paper
0:12:13	on idiolectal techniques as well
0:12:21	Toledo in two thousand and four,
0:12:26	I think was really where Odyssey came of age
0:12:32	It was... it was well attended, I think it probably remains
0:12:37	the most
0:12:39	highly attended of the Odysseys. It was the first Odyssey in which we had the
0:12:44	NIST
0:12:45	SRE workshop, held in conjunction at the same location. That was to be repeated in
0:12:51	Puerto Rico in two thousand six and Bordeaux in two thousand ten. It was also
0:12:57	the first
0:12:58	Odyssey to include language recognition units. It had two notable key notes on forensic recognition
0:13:10	earlier in Avignon ... these were two excellent well receieved parts and since then, Odyssey
0:13:17	has been established biannual event that's been held every two years
0:13:26	And that this data presentation, I think Mark Przybocki and I gave called The speaker
0:13:32	recognition evaluation chronicles. And it was to be reprised, I think that about two years
0:13:39	later in Puerto Rico. So, Odyssey has marched on
0:13:49	Two thousand six was in Puerto Rico I find, incredibly, the picture of it. Two
0:13:55	thousand and eight, Stellenbosch hosted by Niko. Twenty ten, two years ago we were in
0:14:02	Brno. This is the logo designed by Honza's children. And now we're here in Singapore,
0:14:11	and I think
0:14:12	before we finish this workshop we will hear about plans for Odyssey in twenty fourteen.
0:14:22	Okay! Let's move on to talk about organisation.
0:14:26	think about evaluation. The thought are that is
0:14:30	part of the organisation responsible for organising evaluations. And questions are which tasks are we
0:14:36	to do, key principles, all this ... some of the milestones will be take directly
0:14:42	taught.
0:14:44	I've done different evaluations and talk about that participation
0:14:53	So which speaker recognition problem? These are research evaluations, but what is the application, environment
0:15:02	and alignment? Well, we know what we have done, but it won't be necessarily obvious
0:15:08	before we started, but it would be access control, the important commercial application. It might
0:15:14	have formed the model. It would raise s question of text independent or text dependent.
0:15:22	There are some problems, I think we shuld do text- dependent. In part of the
0:15:26	access control is the
0:15:27	prior probability of target used to be high.
0:15:32	their forensic aplications that could theoretically be or there's person spotting
0:15:39	of course, is the way ... sometimes the way we went. Inherently in person spotting
0:15:43	the prior probability target is low, it's text independent
0:15:49	Well, in ninety six... and we'll all look at the ninety six evaluation plan. The
0:15:53	separated NIST evaluations would concentrate on speaker spotting, emphasising low false alarm
0:16:01	area of the
0:16:04	performance curve
0:16:08	Some of the principles have been the speaker spotting
0:16:12	in our primary task
0:16:17	we were research system oriented, you know. Application inspired but in it to research
0:16:25	NIST traditionally, with some exceptions, doesn't do product testing in the English. You do the
0:16:31	product testing to advance the technology. We searched the principle that we're gonna pool across
0:16:36	target speakers
0:16:38	people had to
0:16:41	Get scores that will say that work independent on target speaker or having a performance
0:16:49	curve rate every speaker and then just averaging performance curves
0:16:52	and we emphasize the
0:16:58	alarm rate region, both scores and decisions were required in that context with the other
0:17:06	system
0:17:09	NIko suggested that George is gonna talk about tomorrow that calibration matters. It is part
0:17:15	... part of the problem, the adress
0:17:20	Some basics... Our evaluations were open to all willing participants to anyone that
0:17:27	you know, follow rules. I could get... get the data and run all the trials
0:17:33	and come to the workshop where research oriented we have tried to
0:17:40	discourage commercialised competition. Now, we don't want people saying an advertisements, we the missed ideal
0:17:51	our evaluation that's featured with the evaluation plans is specified applying all the rules or
0:17:58	all the detailed evaluation, we'll look at one.
0:18:02	Each evaluation is followed by workshop
0:18:05	these workshops were limited to participants plus interested government organizations that every site or team
0:18:12	that our participance was expected, we represented. At some of them we talked meaningfully about
0:18:18	the evaluation system. The evaluation datasets that we subsequently published, made publicly available by the
0:18:30	LDC. I would not give ... that remains the aim... remains the case the SRE
0:18:39	o eight data is currently available. In particular, sites getting started in research may wanna
0:18:45	be later... 'cause are able to obtain it. Typically, we'd like to have not the
0:18:50	most recent eval, but the next most recent eval, in this case that's o eight,
0:18:53	available publicly probably next year SRE o ten will be made available, heopefully the LRE
0:19:02	o nine, to mention language eval, will soon become available
0:19:12	okay
0:19:13	with one hand on this web page
0:19:20	hpage for the speaker eval, list of past speaker evals in for each year, you
0:19:24	can click on and get the information on the evaluation trought that year
0:19:28	started in nineteen ninety seven. For some reason, the nineteen ninety six evaluation plan things
0:19:37	have been lost, but I asked Craig to search for it and he found it,
0:19:42	so I hope that will get put out, but that mean
0:19:47	what went into the evaluation plan, the first evaluation plan of the current series, which
0:19:52	we said the emphasis will be on issues of handset variation and test segment duration
0:19:56	in traditional goals as were said to drive the technology forward, measure state-of-the-art, find the
0:20:02	most promising
0:20:04	approaches
0:20:06	Task has been task of the hypothesized speaker, segment of conversational speech on the telephone.
0:20:11	That's been expanded, of course, in recent years. Interestingly, are you surprised to see this?
0:20:18	The research objective, given our overall ten percent miss rate for target speakers, is to
0:20:24	minimize the overall false alarm rate.
0:20:29	That is, actually, what we said in ninety six. It is not what we emphasized
0:20:33	in the year since
0:20:34	until
0:20:38	this past year, as you heard in the best evaluation, that's was made the official...
0:20:43	Craig is gonna talk about the best evaluation tomorrow, so in that sense, come full
0:20:51	circle.
0:20:54	but this also mentions that performance is expressed in terms of the detection cost function.
0:21:00	And the researchers than minimize DCF. They also specify the research objective that I am
0:21:06	natural emphasize and I don't think we'd achieve the... achieve uniform performance across all target
0:21:11	speakers. There have been some investigations about classes of speakers and
0:21:18	sometimes attributed Doddington different
0:21:22	types of speakers in different levels of difficulty
0:21:31	so again the task is given up
0:21:34	speaker... target speaker in that segment
0:21:37	is the hypothesis if that speaker's true or false
0:21:43	two measured performances in two related ways. Detection performance from the decisions and detection performance
0:21:50	characterized by roc.
0:21:53	word is roc
0:21:56	here is the dcf formula we're all familiar with. We have parameters cost
0:22:06	which was once expressed as ten, also false alarm as one and the prior probability
0:22:11	target
0:22:13	expressed as point zero one. We also... in this old computerized DCF for a range
0:22:18	of p target in a sense where to return to that promise in the current
0:22:24	evaluation site,
0:22:29	Here we say our ROC will be constructed by pooling decision scores
0:22:35	these scores will then be sorted and plotted on PROC plots.
0:22:42	PROC are ROCs plotted on normal probability
0:22:47	plots. So this was in nineteen ninety six, the term for what we now
0:22:53	all refer to as
0:22:56	as DET plots
0:23:01	we talk about various conditions ... results by duration not this type decision previous task
0:23:12	or reqiure explicit decisions
0:23:14	and that scores of multiple target speakers are pooled before plotting the PROCs. So that
0:23:21	requires score normalization across speakers. So that was the key emphasis that was new in
0:23:27	the ninety six evaluation
0:23:30	previously. Now we honor the term DET curve following the nineteeen ninety seven Eurospeech papers,
0:23:38	which preserved ... used the term DET curve, the detection error tradeoff. I think George
0:23:45	had a role in choosing tyhat name
0:23:49	George turning to one person involved, another is... you may know as Tom Crystal. Incouraging
0:23:56	the use of ...of this kind of curve that linearizes
0:24:01	a performance curves assuming normal distributions
0:24:05	and
0:24:08	I was surprised to find that there's a Wikipedia page for DET plots. So, this
0:24:16	is the page showing the linearizing effect.
0:24:23	okay, now we talk about milestones
0:24:27	These are sorted down, others may choose different ones, but you know We realized that
0:24:33	we had earlier evaluations in ninety two and ninety five, the first in the series
0:24:37	was in ninety six.
0:24:38	Two thousand is first that we had a language other than English, we used the
0:24:41	AHUMADA,
0:24:42	Spanish data, along with other data. Two thousand one was
0:24:48	rather late, we were in the United States with first evaluation cellular phone data. Two
0:24:55	thousand one we also started providing ASR transcripts, errorful transcripts. We had kind of limited
0:25:01	forensic evaluation using a small FBI database in two thousand two. Also, two thousand two
0:25:08	there was the SuperSid workshop, one of the projects that Johns Hopkins workshop; it followed
0:25:14	the SRE and helped to advance the technology. Other Baltimore workshops that followed up on
0:25:22	speaker recognition. Many people here participated. Two thousand five
0:25:27	first multiple languages, bilingual speakers
0:25:34	in the eval... Also the first microphone recordings of telephone calls and therefore included some
0:25:43	cross channel trials. Interview data, like with the mixer corporate day in two thousand eight
0:25:48	have been used in two thousand ten. Two thousand ten involved the
0:25:53	new DCF and the cost function stressing even lower false alarm rate, a little more
0:25:58	about that later. Also in two thousand ten there are lots of things coming out
0:26:03	in the recent years. We have been collecting high and low vocal effort data; also
0:26:09	some data that look at aging. Two thousand ten also featured HASR, the human assisted
0:26:14	speaker recognition. Evaluation small set that invited some systems that involve human as well as
0:26:23	automatic systems.
0:26:25	Twenty eleven is best. We had a broad range of test conditions, including add noise
0:26:30	and reverb, Craig will be telling you about that tomorrow.
0:26:34	Twenty twelve, it's gonna involve target speakers to find beforehand
0:26:43	participation
0:26:47	participation
0:26:48	grown
0:26:52	begin with. The number in fifty eight is... we have it in... these numbers are
0:26:57	all a little fuzzy in terms of what's a site, what's a team, but I
0:27:03	think of these numbers like... these are the ones that Doug used a few years
0:27:06	ago and updated them. Fifty eight in twenty ten
0:27:10	Doug, the MIT has provided... I think we're not doing physical notebooks anymore, but when
0:27:16	we did, provided a cover pictures of the notebooks that the sold
0:27:22	sure wanted to. One thing to note for understandable reasons, I guess, is the big
0:27:28	increase in participation after two thousand one
0:27:32	and the point I should notice is handling
0:27:36	the scores of participating sites becomes a management problem. To a lot more work doing
0:27:41	the evaluation of fifty eight participants than one dozen participants, and you know,
0:27:48	this is the... actually this is a
0:27:50	can't handle scores of the participants, that is handling this
0:27:55	trial scores of all these participants, it doesn't matter if score is of participants, they're
0:28:00	score participants
0:28:05	so this is one of Doug's cover slides from two thousand four showing logos of
0:28:11	all the sites and in the centre is a DET curve
0:28:15	condition of primary interest, common condition well
0:28:21	systems
0:28:23	from two thousand six
0:28:27	than Doug for those efforts
0:28:29	So here it is, the graph,
0:28:32	ninety two and ninety five were
0:28:34	outside the series and had limited number of participants. Twenty eleven is the best evaluation,
0:28:40	it also was limited to a very few participants
0:28:45	otherwise, you didn't see the train... particularly those that trained after two thousand one growing
0:28:50	to the
0:28:52	fifty eight alongtime twenty ten. Twenty twelve evaluation to the registration is open, was being
0:29:00	open over the summer and last count I had is thirty eight and I expect
0:29:04	that's going to grow
0:29:09	So, this is a slide
0:29:12	from
0:29:14	two thousand one presentation at Odyssey that describe the evaluations up to that point
0:29:23	in the center is the number of target speakers and trials, so the first
0:29:28	six evaluation on Switchboard one had forty speakers that had really a lot of conversations
0:29:36	and one of the trains in the other evals restored more speakers up to eight
0:29:39	hundred by two thousand
0:29:44	we... each case to find a primary condition
0:29:50	whether we are basing that on the number of handsets in training
0:29:56	or whether we... can we emphasize different number... different phone number trials, we were looking
0:30:01	at the issues of electret versus
0:30:04	a carbon button, that was a big issue is the days of landline towers. So,
0:30:12	this specifies the primary conditions and evaluation features for these early evaluations
0:30:23	here is an attempt without putting in numbers to update some of that for
0:30:31	evaluations after two thousand one
0:30:36	we end up pulling primary condition of a common condition in that everyone but that
0:30:44	the true for the official chart that we first evaluate all other conditions a when
0:30:52	we introduced different languages to the common condition involved English only all the kinds of
0:30:59	handsets so time to trade it on know and how well a problems that
0:31:06	and on the right you see some of the other features that came out anew.
0:31:10	Cellular data was added, multilingual data
0:31:15	came on in two thousand five
0:31:23	two thousand six we had some microphone tests
0:31:27	and then
0:31:28	things only got more complicated in the most recent evaluations
0:31:32	on terms of common conditions, in two thousand eight we had eight common conditions
0:31:37	two thousand ten we had nine common conditions. Two thousand twelve five common conditions that
0:31:46	classified
0:31:47	so in eight, we contrasted English in bilingual and contrasted interview
0:31:53	in conversational telephone speech
0:31:56	in two thousand ten we were contrasting nineteen telephone channels, interviewing conversation speech and high,
0:32:02	low and normal vocal effort. Two thousand twelve we get interview test without noise or
0:32:08	with added noise or repeated with added noise or with conversation phone test collected in
0:32:14	a noisy environment.
0:32:19	two thousand eight and ten involved interviews with the mics collected over multiple microphone channels
0:32:29	two thousand ten, of course, added high and low vocal effort
0:32:33	effort in aging with the Greybeard corporate in two thousandten also introduced HASR. Two thousand
0:32:40	twelve offered more about target speakers, specified in advance.
0:32:49	So, something about performance factors.
0:32:52	I'll try not to say too much of this, but in terms of what we've
0:32:55	looked at over the years, we've tried to look at demographic factors
0:33:00	like sex and in general, there have been exceptions. The performance has been a bit
0:33:05	better in male speakers than female. Kind early I would look at age and Geordge
0:33:11	more recently has done a study of age and recent evaluation; he may say something
0:33:16	about that tomorrow. Education factor... haven't looked in too much. One very interesting thing in
0:33:21	getting the early evaluation is to look at mean pitch.
0:33:24	people's
0:33:27	test segments and training. And
0:33:31	if he's put a non-target trials between
0:33:34	similar pitch or pitch not.. it means it's similar, not close. The difference... and even
0:33:41	more interesting, look at target trials, where the meet pitch was the same or not
0:33:46	similar pressure person and all that it seriously that's all
0:33:52	speaking style
0:33:56	conversational telephone interview, particularly .... A lot of data has been collected on that. Vocal
0:34:04	effort, more recently. The questions about
0:34:06	defining vocal effort and have it coillected. Aging switchboard with the reviewed corpus ... limited
0:34:14	time collecting it is difficult. These are the intrinsic factors related to the speaker.
0:34:22	The other category, extrinsic factors relates to the collection by microphone or telephone channel. Telephone
0:34:28	channel, landline, cellular, VOIP is something we work on. Earlier times, since days, carbon versus
0:34:36	electret. Telephone handset type; various types are various microphones in the recent evaluation of matched
0:34:43	to mismatched microphones. Placement of the microphone relative to the speaker and
0:34:49	background noise and room reverberation
0:34:53	talk about that tomorrow and it kind of takes the best
0:34:59	And finally, parametric factors. Duration of training test, and also the number of training segments,
0:35:06	the training sessions which
0:35:10	evaluations that have eight sessions of training for telephone speech could greatly improve performance. We
0:35:16	tried carrying along for many years, ten seconds is short duration of things, but there's
0:35:20	also the increase in duration, especially in twenty twelve, we're gonna have lots of sessions
0:35:26	and duration
0:35:28	in training and I think, perhaps the emphasis is larger than
0:35:33	seen the effects of multiple sessions and more data in evaluation. English, of course, has
0:35:42	been the predominant language, but several of the evaluations include a variety of other languages
0:35:48	and one of the hopes is that performance is good in every language as English.
0:35:53	We at first suspected the reason that overall performance had been better in English is
0:35:58	due to the regularity and more quantity of the data available in Englis. Cross language
0:36:04	trials are a separate challenge
0:36:08	okay
0:36:09	the metrics
0:36:13	Mention equal error rate, it is with us, it's part of our lives, in it's
0:36:17	substance. I tried to discourage it, but ... It is easy to understand
0:36:28	in some ways
0:36:30	at least amount of data
0:36:33	but, you know, doesn't deal with calibration issues and basically the operating point of equal
0:36:38	error rate is not the operating point of applications
0:36:44	high target
0:36:49	Prior probablities target or may have load not really equal. Decision cost has been our
0:36:58	main state bread and butter, we'll hear more about that. CLLR has been championed by
0:37:04	Niko, we talked about it
0:37:07	monday and we've talked about just looking at false alarm rate, it affects miss rate,
0:37:12	which we return to in best. So, you all know about the decision cost function,
0:37:18	it's the sum of the specified parameters
0:37:22	First we normalize it by the cost of a system that has no intelligence, but
0:37:28	simply always decides yes, always decides no, so the worst possible score is one.
0:37:36	So the parameters that were mentioned in ninety six, these were the parameters form ninety
0:37:40	six to two thousand eight,
0:37:43	twenty will reach for domain, conditions for core and extended test.
0:37:48	we changed
0:37:49	what's the miss is one, false alarm is one, target point zero one.
0:37:56	the driving force, and a lot of people
0:38:00	were upset, their scepticism with
0:38:06	create systems before that. I think the outcome has been relatively satisfactory, I tink people
0:38:14	feel that they developed a good systems
0:38:17	before this
0:38:22	Niko talked about
0:38:24	cllr, he noted that George suggested limiting cllr to
0:38:31	to false alarm rate, covers a broad range of operating points.
0:38:37	Fixed miss rate, we said, has it's roots in ninety six, but
0:38:40	is used in twenty twelve. It's practical for applications, it may be viewed as cost
0:38:45	for listening to false alarms. Some conditions... conditions were really good, you see, can't get
0:38:53	a ten percent miss rate, maybe appropriate for one percent miss rate.
0:38:59	recording progress
0:39:01	How do we do that? it's always difficult to assure test set comparability, if you're
0:39:07	collecting data the same way as before, is it really equal tested? Well, we encourage
0:39:11	participants in the evaluations to run their prior systems, moth old systems.
0:39:15	a new data, which gives us some measure
0:39:18	But, even more, it's been a problem with changing technologies, you know, ninety six landline
0:39:24	phones predominated, we dealed with carbon and electret.
0:39:28	now, the world is largely cellular, we need to explore VOIP , present the new
0:39:34	channel. So, the technology makes changing and with progress we will make the test harder.
0:39:41	Always want to add new evaluation conditions, new bells and whistles.
0:39:44	More channel types, more speaking styles, languages... the size of the evaluation data measures.
0:39:52	In two thousand eleven, we explored externally added noise and reverb. The noise will continue
0:39:58	in this year. So, Doug attempted in two thousand
0:40:04	eight to
0:40:06	to look at this, to explore existing condition, the course of years and looka at
0:40:10	the best system.
0:40:12	and here is an updated version of his slide, showing for more or less fixed
0:40:17	conditins.
0:40:20	logarythmic of a
0:40:24	DCF, I believe
0:40:26	where things worked. This numbers go up to two thousand six
0:40:32	with added data in the right, two thousand eight showed
0:40:37	some continued progress on various test conditions. Then in twenty ten
0:40:43	we threw in the new measure. That really messes things up, numbers went up, but
0:40:49	they're not directly comparable. This is the current
0:40:58	of our history slide tracking progress
0:41:02	so, let's, you know, turn to the future
0:41:08	SRE twelve
0:41:10	target speakers
0:41:11	at most are specified in advance. There are speakers in recent past evaluations. I think
0:41:17	it's something in the order of two thousand. That it's best why it is potential
0:41:23	target speakers. So, sites can know about these targets, they have all the data, they
0:41:28	can
0:41:29	develop their systems to take advantage of that. All prior speeches are available for training.
0:41:34	There would be some new target speakers with training data provided at evaluation times; that's
0:41:39	one check on the effect of providing the targets in advance. We also have a
0:41:46	test segments that will include non-target speakers.
0:41:53	that is the big change for twenty twelve. Also, new interview speech will be provided,
0:41:59	and was mentioned yesterday, in sixteen bit
0:42:01	linear PCM
0:42:04	some of the test phone calls are gonna be collected specifically in noisy environments
0:42:11	And moreover, we're gonna have artificial
0:42:14	noise, you know - added noise, test was done best... some test segment. Another challenge
0:42:23	in this community. But, will this be an effectively easier task
0:42:29	because we find the targets in advance and subsets
0:42:33	It's... it mix it into these partially ... close that trial, you know, you are
0:42:40	allowed to know not only about the one target, but these two thousand other targets,
0:42:45	will that make a difference? We had, you know, we have open workshops. you know,
0:42:49	workshops where the participants... we debate these things. Last December this got debated how much
0:42:57	will this
0:42:58	change the system? Will it make the problem too easy?
0:43:04	It was ... we could have conditions when people were asked to assume
0:43:10	that segment is target so it... since things fully close that.... or to assume no
0:43:15	information about targets other than that of the actual trial.
0:43:20	clearly speaker siding is in the past, so people do this, their results provide basis
0:43:25	for comparison. This is what's to be
0:43:31	investigate to be seen in SRE twelve. In terms of metrics, log- likelihood ratios now
0:43:40	are required. And since we're doing that, no hard decisions are asked for.
0:43:48	in terms of primary metric
0:43:53	you know, just use the
0:43:56	the dcf of
0:43:57	point ten, but Niko pointed out that if you're not really required to calibrate your
0:44:06	log likelihood ratios if you're only using it at one operating point.
0:44:12	so therefore
0:44:17	to require calibration and stability, we're gonna actually have two DCS and take the average
0:44:24	of them. Also, cllr is alternative. Cllr m ten ... Niko referred to
0:44:32	the limit cllr trials with
0:44:41	high miss rate
0:44:45	so
0:44:47	the formula for TCF. We have three parameters, but we're working right at this one
0:44:51	parameter beta and so
0:44:55	the cost function of the simple average TCF one , TCF two, where cost of
0:45:01	one where target priors are either point zero one aas is twenty ten, or point
0:45:07	zero one
0:45:09	order things to
0:45:11	that would be
0:45:13	the official metric
0:45:17	and finally
0:45:20	future hold
0:45:22	that of course
0:45:24	none of us knows, but
0:45:27	twenty twelve , the outcome will determine whether all this
0:45:34	idea of prespecified targets will be
0:45:37	an effective one, that doesn't make the problem too easy or bigger, now we're gonna
0:45:41	see
0:45:43	Artificially added noise will be included in noise and reverb added may be part of
0:45:49	the future.
0:45:50	HASR twelve will be repeated, HASR ten had both other tests of fifteen hundred trials
0:45:56	or a hundred and fifty. HASR twelve will have even twenty or two hundred, and
0:46:04	anyone, you know, those with forensic interest, but anyone interested in to involve human assisted
0:46:10	systems is invited to participate in HASR twelve, I would like to get more participations
0:46:15	this year
0:46:17	HASR extraction
0:46:21	and answer is
0:46:24	to just bigger
0:46:27	fifty or more particiating sites. Data volume is now getting up to terabytes.
0:46:34	best evaluation that so much, this year will be in twenty twelve, because we're only
0:46:41	providing
0:46:41	most priors will be run test data but that... you know, the numbers are
0:46:48	segments are in the hundreds of thousands and the number trials
0:46:53	is going to be in millions, tens of millions, even hundreds of millions before the
0:46:58	optional full sets of trials, so
0:47:05	likely you see the schedule moving to an every three year-one but details really need
0:47:11	to be
0:47:14	a lot more
0:47:19	I don't know it, but I think that's where
0:47:23	I'll finish
0:47:25	discussion
0:48:00	segments for a speaker
0:48:03	know about our speakers
0:56:12	so
0:56:31	I didin't say they are normal curve
0:58:08	Right, so LDC has an agreement with the sponsors suppert the lre in sre evaluations,
0:58:14	that we will keep the most recent evaluation set blind, hold it back from publications
0:58:20	and the general LDC catalog until the new data set has been created and so
0:58:25	part of the timing of the publication of those eval sets is
0:58:30	the requirement to have a new blind set
0:58:35	is the current evaluation set and I
0:58:38	ascending we have
0:58:39	answers. We can raise that issue with them, and give them the key back they
0:58:43	were getting
0:58:56	right, so the SRE eval set is just being finished now
0:59:02	as soon as that's finalised
0:59:05	sre ten will be put into the queue for publication. Sre... sre ten, will be
0:59:10	put to queue for publication
0:59:13	it is sort of rolling in circle
0:59:21	you'll have to ask the sponsor about that, I can't speak of their motivation, only
0:59:25	they're contractually obligated to delay the publication of discussion.
0:59:54	right well LDC is also balancing the needs of the consortium as a whole and
0:59:59	so we... we are staging publications in the catalog, balancing a number of factors, so
1:00:05	the speaker recognition and language
1:00:07	are the one of the communities that we support. I hear your concern, we can
1:00:11	certainly raise this issue with the sponsors and see if there's any
1:00:16	any ... if we can provide. But, at this point, I think this is the
1:00:20	strategy that spending
1:03:02	this one

The NIST Speaker Recognition Evaluations

Plenary Session

Alvin Martin