0:00:28Good morning ladies and gentleman,
0:00:30welcome to the third day of your Odyssey Workshop.
0:00:35Out of fifty one papers, twenty seven have been presented over the last two days
0:00:41and we have another twenty one to go.
0:00:44twenty four to go, if I'm doing the calculation right. Twenty four to go and
0:00:48yesterday was the... all papers were on... mainly on i-vectors. And we can say yesterday
0:00:53was the i-vector day. And today papers are... except one paper, there are two major
0:01:00sessions. One is language recognition evaluation and then features for speaker recognition.
0:01:08My name is Ambikairajah, I'm from the University of New South Wales in Sydney, Australia.
0:01:12I have the pleasure of introducing to you our plenary speaker for today, doctor Alvin
0:01:19Alvin will speak about the NIST speaker recognition evaluation plan for two thousand twelve and
0:01:28And he has coordinated NIST series of evaluation since nineteen ninety six in the areas
0:01:35of speaker recognition, language and dialect recognition. And the evaluation work he's involved
0:01:41collection and selection and preprocessing of data, and writing the evaluation plan, and evaluation of
0:01:50the results, coordinate in the workshop and... and many more tasks.
0:01:57He has served as a mathematician in the Multimodal information Group at NIST since nineteen
0:02:04ninety one and to two thousand eleven.
0:02:08Alvin holds a Ph.D. degree in mathematics from the Yale University. Please join me in
0:02:14welcoming doctor Alvin Martin
0:02:25Okay! Thank you! Thank you for that introduction and thank you for the invitation
0:02:32to do this talk. I'm here to talk about this speaker evaluations and, as you
0:02:39know, I have
0:02:42at NIST
0:02:43and I remain
0:02:46associated with NIST for this workshop, however
0:02:51I am here
0:02:53independently, so everything I say or
0:02:58I'm responsible for everything and no one else is, opinions are all my own.
0:03:11I guess I might... don't think I subject to any restrictions, but
0:03:15I'm at the clock.
0:03:25stay closer to this. An outline of the
0:03:29Topics I hope to cover... Gonna talk about some early history, things that preceded the
0:03:35evaluations, the current series of evaluation. The things that happened during the early times of
0:03:42the evaluation
0:03:44and giving kind of a history of the evaluations and in part of past Odysseys
0:03:52evaluation... who's involved with I should note my debt to Doug Reynolds who gave a
0:04:01talk on these matters four years ago in Stellenbosch and I will update one of
0:04:08the slides that
0:04:11he presented there. Gonna say some things from the point of view of an evaluation
0:04:18organiser, about ... about evaluation organisation. Say something about performance factors to look at, something
0:04:26about metrics which we've already talked about at the others workshop. Say something about progress
0:04:34measuring progress over time
0:04:36and when we talk about the future, quitting SRE twelve evaluation process currently going on,
0:04:44it will take place in the end of this year and then
0:04:47so about what might happen after this year
0:04:53The early history
0:04:57ones I would mention
0:04:59One thing that backed to the interesting speaker recognition evaluation success of speech recognition evaluation
0:05:10in ... in the eighties and the early nineties, this
0:05:13very much
0:05:15involved in this, in this showed the benefits of independent evaluation on common data sets.
0:05:21I'll show a slide of that in a minute.
0:05:24I will mention the collection of various early corpora that were appropriate for speaker recognition:
0:05:30TIMIT, KING and YOHO, but most especially Switchboard. It was a multi-purpose corpus that was
0:05:37collected around nineteen ninety one, so one of the purposes that they had in mind
0:05:41was speaker recognition, collected conversations from a large number of speakers so that you have
0:05:49multiple conversations for each speaker. Success led to the collection later Switchboard two and similar
0:06:00collections. And in fact in the aftermath of Switchboard, The Linguistic Data Consortium was created
0:06:09in nineteen ninety two with the purpose of supporting a further speech and also text
0:06:16collections in the... in the United States, and onto the first Odyssey, all wasn't called
0:06:23Odyssey, it was Martigny in nineteen ninety four followed by several others. I will
0:06:30show pictures and make a few remarks on those. And andthere were early NIST evaluations.
0:06:36We date the current series of speaker evaluaqtions nineteen ninety six, but there were evaluations
0:06:41in ninety two ninety five. There was a DARPA program evaluation at several sites involving
0:06:47the DARPA programming in ninety two. In ninety five there was a preliminary evaluation that
0:06:53used Switchboard one data and at the six sites. But these earlier evaluations, the emphasis
0:07:00was rather on speaker identification
0:07:03on closed set rather than on open-set recognition that we've come... to know in ...
0:07:11in the series of evaluations
0:07:17So here's this favourite slide on speech recognition. The Benchmark Test history. So these... you
0:07:28know, these... the word error rate is on the lyrical scale, logarythmic scale
0:07:34start from nineteen eighty eight
0:07:39and this show best system performance of various evaluation, various conditions in... In successive years,
0:07:47or years when evaluations were held. So pointing out, of course, is the big fall
0:07:52in error rates when multiple sites participated on common corpora and we looked at error
0:07:59rates and
0:08:00with probably fixed conditions we could see progress being evident, specially this is showing the
0:08:06early series. This
0:08:10this... we came a mile over in the evaluation cycle research, collect data evaluate, show
0:08:16progress that gave inspiration to other evaluations and in particular, speakers
0:08:28okay, so now
0:08:31do some walk down memory lane
0:08:35the first
0:08:36workshop of this series was Martigny in nineteen ninety four
0:08:43It was called Workshop on automatic speaker recognition, identification and verification
0:08:48and that workshop, you know, was the very first of this
0:08:54recently will attended, but not as well as this one. And there were various presentations
0:08:59and there were many different corpora, many different performance measures and it was very difficult
0:09:04to make meaningful comparisons. I presented here one of the papers I presented papers that
0:09:10interest from the NIST evaluation point of view. There was a paper on public databases
0:09:17for speaker recognition and verification. It was given there
0:09:26And to pull the other of the early ones... Avignon, nineteen ninety eight. Speaker recognition
0:09:32and it's commercial and forensic application is what it was called. We called... also known
0:09:38as RLA2C from the French title.
0:09:43and one observation is that in terms of the talks there
0:09:47TIMIT was a preferred corpus
0:09:52for many was
0:09:53too clean, too easy corpus. I remember Doug making comments that he didn't wanna listen
0:09:58anymore. Papers that described results from TIMIT... there's also characterized by sometimes bitter debate over
0:10:08forensics and how good job forensic experts could do with that at speaker recognition
0:10:18there were
0:10:23missed speaker evaluations related papers... actually, three of them that combined into
0:10:33this paper in speech communication
0:10:36from three presentations, perhaps most memorable was the presentation by George Doddington who told us
0:10:43all how to do the speaker recognition evaluation
0:10:48So, this was a talk that laid out the various principles, and most of the
0:10:53principles are kept and followed in our evaluation series, includes a discussion of the
0:11:00one golden rule of thirty
0:11:07Crete, two thousand one
0:11:10Two thousand one, A Speaker Odyssey, took the official name. Speaker recognition workshop. That was
0:11:15the first official Odyssey
0:11:17it was characterized by more emphasis on evaluation. There was an evaluation track that was
0:11:22persuaded, the NIST was
0:11:25involved with
0:11:31So, one of the presentations, the NIST presentation, I think I
0:11:37gave it...
0:11:40the history of NIST evaluations up to that ... that point and I will actually
0:11:46show a slide form there later on.
0:11:51key presentation was... was one by several people from the Department of Defence: Phonetic, idiolectal
0:11:58and acoustic speaker recognition, that was... these remained their ideas that were being pursued at
0:12:03the time and that were influencing the course of research that point I think the
0:12:07name was Noan George had a lot to do with that. He had the paper
0:12:13on idiolectal techniques as well
0:12:21Toledo in two thousand and four,
0:12:26I think was really where Odyssey came of age
0:12:32It was... it was well attended, I think it probably remains
0:12:37the most
0:12:39highly attended of the Odysseys. It was the first Odyssey in which we had the
0:12:45SRE workshop, held in conjunction at the same location. That was to be repeated in
0:12:51Puerto Rico in two thousand six and Bordeaux in two thousand ten. It was also
0:12:57the first
0:12:58Odyssey to include language recognition units. It had two notable key notes on forensic recognition
0:13:10earlier in Avignon ... these were two excellent well receieved parts and since then, Odyssey
0:13:17has been established biannual event that's been held every two years
0:13:26And that this data presentation, I think Mark Przybocki and I gave called The speaker
0:13:32recognition evaluation chronicles. And it was to be reprised, I think that about two years
0:13:39later in Puerto Rico. So, Odyssey has marched on
0:13:49Two thousand six was in Puerto Rico I find, incredibly, the picture of it. Two
0:13:55thousand and eight, Stellenbosch hosted by Niko. Twenty ten, two years ago we were in
0:14:02Brno. This is the logo designed by Honza's children. And now we're here in Singapore,
0:14:11and I think
0:14:12before we finish this workshop we will hear about plans for Odyssey in twenty fourteen.
0:14:22Okay! Let's move on to talk about organisation.
0:14:26think about evaluation. The thought are that is
0:14:30part of the organisation responsible for organising evaluations. And questions are which tasks are we
0:14:36to do, key principles, all this ... some of the milestones will be take directly
0:14:44I've done different evaluations and talk about that participation
0:14:53So which speaker recognition problem? These are research evaluations, but what is the application, environment
0:15:02and alignment? Well, we know what we have done, but it won't be necessarily obvious
0:15:08before we started, but it would be access control, the important commercial application. It might
0:15:14have formed the model. It would raise s question of text independent or text dependent.
0:15:22There are some problems, I think we shuld do text- dependent. In part of the
0:15:26access control is the
0:15:27prior probability of target used to be high.
0:15:32their forensic aplications that could theoretically be or there's person spotting
0:15:39of course, is the way ... sometimes the way we went. Inherently in person spotting
0:15:43the prior probability target is low, it's text independent
0:15:49Well, in ninety six... and we'll all look at the ninety six evaluation plan. The
0:15:53separated NIST evaluations would concentrate on speaker spotting, emphasising low false alarm
0:16:01area of the
0:16:04performance curve
0:16:08Some of the principles have been the speaker spotting
0:16:12in our primary task
0:16:17we were research system oriented, you know. Application inspired but in it to research
0:16:25NIST traditionally, with some exceptions, doesn't do product testing in the English. You do the
0:16:31product testing to advance the technology. We searched the principle that we're gonna pool across
0:16:36target speakers
0:16:38people had to
0:16:41Get scores that will say that work independent on target speaker or having a performance
0:16:49curve rate every speaker and then just averaging performance curves
0:16:52and we emphasize the
0:16:58alarm rate region, both scores and decisions were required in that context with the other
0:17:09NIko suggested that George is gonna talk about tomorrow that calibration matters. It is part
0:17:15... part of the problem, the adress
0:17:20Some basics... Our evaluations were open to all willing participants to anyone that
0:17:27you know, follow rules. I could get... get the data and run all the trials
0:17:33and come to the workshop where research oriented we have tried to
0:17:40discourage commercialised competition. Now, we don't want people saying an advertisements, we the missed ideal
0:17:51our evaluation that's featured with the evaluation plans is specified applying all the rules or
0:17:58all the detailed evaluation, we'll look at one.
0:18:02Each evaluation is followed by workshop
0:18:05these workshops were limited to participants plus interested government organizations that every site or team
0:18:12that our participance was expected, we represented. At some of them we talked meaningfully about
0:18:18the evaluation system. The evaluation datasets that we subsequently published, made publicly available by the
0:18:30LDC. I would not give ... that remains the aim... remains the case the SRE
0:18:39o eight data is currently available. In particular, sites getting started in research may wanna
0:18:45be later... 'cause are able to obtain it. Typically, we'd like to have not the
0:18:50most recent eval, but the next most recent eval, in this case that's o eight,
0:18:53available publicly probably next year SRE o ten will be made available, heopefully the LRE
0:19:02o nine, to mention language eval, will soon become available
0:19:13with one hand on this web page
0:19:20hpage for the speaker eval, list of past speaker evals in for each year, you
0:19:24can click on and get the information on the evaluation trought that year
0:19:28started in nineteen ninety seven. For some reason, the nineteen ninety six evaluation plan things
0:19:37have been lost, but I asked Craig to search for it and he found it,
0:19:42so I hope that will get put out, but that mean
0:19:47what went into the evaluation plan, the first evaluation plan of the current series, which
0:19:52we said the emphasis will be on issues of handset variation and test segment duration
0:19:56in traditional goals as were said to drive the technology forward, measure state-of-the-art, find the
0:20:02most promising
0:20:06Task has been task of the hypothesized speaker, segment of conversational speech on the telephone.
0:20:11That's been expanded, of course, in recent years. Interestingly, are you surprised to see this?
0:20:18The research objective, given our overall ten percent miss rate for target speakers, is to
0:20:24minimize the overall false alarm rate.
0:20:29That is, actually, what we said in ninety six. It is not what we emphasized
0:20:33in the year since
0:20:38this past year, as you heard in the best evaluation, that's was made the official...
0:20:43Craig is gonna talk about the best evaluation tomorrow, so in that sense, come full
0:20:54but this also mentions that performance is expressed in terms of the detection cost function.
0:21:00And the researchers than minimize DCF. They also specify the research objective that I am
0:21:06natural emphasize and I don't think we'd achieve the... achieve uniform performance across all target
0:21:11speakers. There have been some investigations about classes of speakers and
0:21:18sometimes attributed Doddington different
0:21:22types of speakers in different levels of difficulty
0:21:31so again the task is given up
0:21:34speaker... target speaker in that segment
0:21:37is the hypothesis if that speaker's true or false
0:21:43two measured performances in two related ways. Detection performance from the decisions and detection performance
0:21:50characterized by roc.
0:21:53word is roc
0:21:56here is the dcf formula we're all familiar with. We have parameters cost
0:22:06which was once expressed as ten, also false alarm as one and the prior probability
0:22:13expressed as point zero one. We also... in this old computerized DCF for a range
0:22:18of p target in a sense where to return to that promise in the current
0:22:24evaluation site,
0:22:29Here we say our ROC will be constructed by pooling decision scores
0:22:35these scores will then be sorted and plotted on PROC plots.
0:22:42PROC are ROCs plotted on normal probability
0:22:47plots. So this was in nineteen ninety six, the term for what we now
0:22:53all refer to as
0:22:56as DET plots
0:23:01we talk about various conditions ... results by duration not this type decision previous task
0:23:12or reqiure explicit decisions
0:23:14and that scores of multiple target speakers are pooled before plotting the PROCs. So that
0:23:21requires score normalization across speakers. So that was the key emphasis that was new in
0:23:27the ninety six evaluation
0:23:30previously. Now we honor the term DET curve following the nineteeen ninety seven Eurospeech papers,
0:23:38which preserved ... used the term DET curve, the detection error tradeoff. I think George
0:23:45had a role in choosing tyhat name
0:23:49George turning to one person involved, another is... you may know as Tom Crystal. Incouraging
0:23:56the use of ...of this kind of curve that linearizes
0:24:01a performance curves assuming normal distributions
0:24:08I was surprised to find that there's a Wikipedia page for DET plots. So, this
0:24:16is the page showing the linearizing effect.
0:24:23okay, now we talk about milestones
0:24:27These are sorted down, others may choose different ones, but you know We realized that
0:24:33we had earlier evaluations in ninety two and ninety five, the first in the series
0:24:37was in ninety six.
0:24:38Two thousand is first that we had a language other than English, we used the
0:24:42Spanish data, along with other data. Two thousand one was
0:24:48rather late, we were in the United States with first evaluation cellular phone data. Two
0:24:55thousand one we also started providing ASR transcripts, errorful transcripts. We had kind of limited
0:25:01forensic evaluation using a small FBI database in two thousand two. Also, two thousand two
0:25:08there was the SuperSid workshop, one of the projects that Johns Hopkins workshop; it followed
0:25:14the SRE and helped to advance the technology. Other Baltimore workshops that followed up on
0:25:22speaker recognition. Many people here participated. Two thousand five
0:25:27first multiple languages, bilingual speakers
0:25:34in the eval... Also the first microphone recordings of telephone calls and therefore included some
0:25:43cross channel trials. Interview data, like with the mixer corporate day in two thousand eight
0:25:48have been used in two thousand ten. Two thousand ten involved the
0:25:53new DCF and the cost function stressing even lower false alarm rate, a little more
0:25:58about that later. Also in two thousand ten there are lots of things coming out
0:26:03in the recent years. We have been collecting high and low vocal effort data; also
0:26:09some data that look at aging. Two thousand ten also featured HASR, the human assisted
0:26:14speaker recognition. Evaluation small set that invited some systems that involve human as well as
0:26:23automatic systems.
0:26:25Twenty eleven is best. We had a broad range of test conditions, including add noise
0:26:30and reverb, Craig will be telling you about that tomorrow.
0:26:34Twenty twelve, it's gonna involve target speakers to find beforehand
0:26:52begin with. The number in fifty eight is... we have it in... these numbers are
0:26:57all a little fuzzy in terms of what's a site, what's a team, but I
0:27:03think of these numbers like... these are the ones that Doug used a few years
0:27:06ago and updated them. Fifty eight in twenty ten
0:27:10Doug, the MIT has provided... I think we're not doing physical notebooks anymore, but when
0:27:16we did, provided a cover pictures of the notebooks that the sold
0:27:22sure wanted to. One thing to note for understandable reasons, I guess, is the big
0:27:28increase in participation after two thousand one
0:27:32and the point I should notice is handling
0:27:36the scores of participating sites becomes a management problem. To a lot more work doing
0:27:41the evaluation of fifty eight participants than one dozen participants, and you know,
0:27:48this is the... actually this is a
0:27:50can't handle scores of the participants, that is handling this
0:27:55trial scores of all these participants, it doesn't matter if score is of participants, they're
0:28:00score participants
0:28:05so this is one of Doug's cover slides from two thousand four showing logos of
0:28:11all the sites and in the centre is a DET curve
0:28:15condition of primary interest, common condition well
0:28:23from two thousand six
0:28:27than Doug for those efforts
0:28:29So here it is, the graph,
0:28:32ninety two and ninety five were
0:28:34outside the series and had limited number of participants. Twenty eleven is the best evaluation,
0:28:40it also was limited to a very few participants
0:28:45otherwise, you didn't see the train... particularly those that trained after two thousand one growing
0:28:50to the
0:28:52fifty eight alongtime twenty ten. Twenty twelve evaluation to the registration is open, was being
0:29:00open over the summer and last count I had is thirty eight and I expect
0:29:04that's going to grow
0:29:09So, this is a slide
0:29:14two thousand one presentation at Odyssey that describe the evaluations up to that point
0:29:23in the center is the number of target speakers and trials, so the first
0:29:28six evaluation on Switchboard one had forty speakers that had really a lot of conversations
0:29:36and one of the trains in the other evals restored more speakers up to eight
0:29:39hundred by two thousand
0:29:44we... each case to find a primary condition
0:29:50whether we are basing that on the number of handsets in training
0:29:56or whether we... can we emphasize different number... different phone number trials, we were looking
0:30:01at the issues of electret versus
0:30:04a carbon button, that was a big issue is the days of landline towers. So,
0:30:12this specifies the primary conditions and evaluation features for these early evaluations
0:30:23here is an attempt without putting in numbers to update some of that for
0:30:31evaluations after two thousand one
0:30:36we end up pulling primary condition of a common condition in that everyone but that
0:30:44the true for the official chart that we first evaluate all other conditions a when
0:30:52we introduced different languages to the common condition involved English only all the kinds of
0:30:59handsets so time to trade it on know and how well a problems that
0:31:06and on the right you see some of the other features that came out anew.
0:31:10Cellular data was added, multilingual data
0:31:15came on in two thousand five
0:31:23two thousand six we had some microphone tests
0:31:27and then
0:31:28things only got more complicated in the most recent evaluations
0:31:32on terms of common conditions, in two thousand eight we had eight common conditions
0:31:37two thousand ten we had nine common conditions. Two thousand twelve five common conditions that
0:31:47so in eight, we contrasted English in bilingual and contrasted interview
0:31:53in conversational telephone speech
0:31:56in two thousand ten we were contrasting nineteen telephone channels, interviewing conversation speech and high,
0:32:02low and normal vocal effort. Two thousand twelve we get interview test without noise or
0:32:08with added noise or repeated with added noise or with conversation phone test collected in
0:32:14a noisy environment.
0:32:19two thousand eight and ten involved interviews with the mics collected over multiple microphone channels
0:32:29two thousand ten, of course, added high and low vocal effort
0:32:33effort in aging with the Greybeard corporate in two thousandten also introduced HASR. Two thousand
0:32:40twelve offered more about target speakers, specified in advance.
0:32:49So, something about performance factors.
0:32:52I'll try not to say too much of this, but in terms of what we've
0:32:55looked at over the years, we've tried to look at demographic factors
0:33:00like sex and in general, there have been exceptions. The performance has been a bit
0:33:05better in male speakers than female. Kind early I would look at age and Geordge
0:33:11more recently has done a study of age and recent evaluation; he may say something
0:33:16about that tomorrow. Education factor... haven't looked in too much. One very interesting thing in
0:33:21getting the early evaluation is to look at mean pitch.
0:33:27test segments and training. And
0:33:31if he's put a non-target trials between
0:33:34similar pitch or pitch not.. it means it's similar, not close. The difference... and even
0:33:41more interesting, look at target trials, where the meet pitch was the same or not
0:33:46similar pressure person and all that it seriously that's all
0:33:52speaking style
0:33:56conversational telephone interview, particularly .... A lot of data has been collected on that. Vocal
0:34:04effort, more recently. The questions about
0:34:06defining vocal effort and have it coillected. Aging switchboard with the reviewed corpus ... limited
0:34:14time collecting it is difficult. These are the intrinsic factors related to the speaker.
0:34:22The other category, extrinsic factors relates to the collection by microphone or telephone channel. Telephone
0:34:28channel, landline, cellular, VOIP is something we work on. Earlier times, since days, carbon versus
0:34:36electret. Telephone handset type; various types are various microphones in the recent evaluation of matched
0:34:43to mismatched microphones. Placement of the microphone relative to the speaker and
0:34:49background noise and room reverberation
0:34:53talk about that tomorrow and it kind of takes the best
0:34:59And finally, parametric factors. Duration of training test, and also the number of training segments,
0:35:06the training sessions which
0:35:10evaluations that have eight sessions of training for telephone speech could greatly improve performance. We
0:35:16tried carrying along for many years, ten seconds is short duration of things, but there's
0:35:20also the increase in duration, especially in twenty twelve, we're gonna have lots of sessions
0:35:26and duration
0:35:28in training and I think, perhaps the emphasis is larger than
0:35:33seen the effects of multiple sessions and more data in evaluation. English, of course, has
0:35:42been the predominant language, but several of the evaluations include a variety of other languages
0:35:48and one of the hopes is that performance is good in every language as English.
0:35:53We at first suspected the reason that overall performance had been better in English is
0:35:58due to the regularity and more quantity of the data available in Englis. Cross language
0:36:04trials are a separate challenge
0:36:09the metrics
0:36:13Mention equal error rate, it is with us, it's part of our lives, in it's
0:36:17substance. I tried to discourage it, but ... It is easy to understand
0:36:28in some ways
0:36:30at least amount of data
0:36:33but, you know, doesn't deal with calibration issues and basically the operating point of equal
0:36:38error rate is not the operating point of applications
0:36:44high target
0:36:49Prior probablities target or may have load not really equal. Decision cost has been our
0:36:58main state bread and butter, we'll hear more about that. CLLR has been championed by
0:37:04Niko, we talked about it
0:37:07monday and we've talked about just looking at false alarm rate, it affects miss rate,
0:37:12which we return to in best. So, you all know about the decision cost function,
0:37:18it's the sum of the specified parameters
0:37:22First we normalize it by the cost of a system that has no intelligence, but
0:37:28simply always decides yes, always decides no, so the worst possible score is one.
0:37:36So the parameters that were mentioned in ninety six, these were the parameters form ninety
0:37:40six to two thousand eight,
0:37:43twenty will reach for domain, conditions for core and extended test.
0:37:48we changed
0:37:49what's the miss is one, false alarm is one, target point zero one.
0:37:56the driving force, and a lot of people
0:38:00were upset, their scepticism with
0:38:06create systems before that. I think the outcome has been relatively satisfactory, I tink people
0:38:14feel that they developed a good systems
0:38:17before this
0:38:22Niko talked about
0:38:24cllr, he noted that George suggested limiting cllr to
0:38:31to false alarm rate, covers a broad range of operating points.
0:38:37Fixed miss rate, we said, has it's roots in ninety six, but
0:38:40is used in twenty twelve. It's practical for applications, it may be viewed as cost
0:38:45for listening to false alarms. Some conditions... conditions were really good, you see, can't get
0:38:53a ten percent miss rate, maybe appropriate for one percent miss rate.
0:38:59recording progress
0:39:01How do we do that? it's always difficult to assure test set comparability, if you're
0:39:07collecting data the same way as before, is it really equal tested? Well, we encourage
0:39:11participants in the evaluations to run their prior systems, moth old systems.
0:39:15a new data, which gives us some measure
0:39:18But, even more, it's been a problem with changing technologies, you know, ninety six landline
0:39:24phones predominated, we dealed with carbon and electret.
0:39:28now, the world is largely cellular, we need to explore VOIP , present the new
0:39:34channel. So, the technology makes changing and with progress we will make the test harder.
0:39:41Always want to add new evaluation conditions, new bells and whistles.
0:39:44More channel types, more speaking styles, languages... the size of the evaluation data measures.
0:39:52In two thousand eleven, we explored externally added noise and reverb. The noise will continue
0:39:58in this year. So, Doug attempted in two thousand
0:40:04eight to
0:40:06to look at this, to explore existing condition, the course of years and looka at
0:40:10the best system.
0:40:12and here is an updated version of his slide, showing for more or less fixed
0:40:20logarythmic of a
0:40:24DCF, I believe
0:40:26where things worked. This numbers go up to two thousand six
0:40:32with added data in the right, two thousand eight showed
0:40:37some continued progress on various test conditions. Then in twenty ten
0:40:43we threw in the new measure. That really messes things up, numbers went up, but
0:40:49they're not directly comparable. This is the current
0:40:58of our history slide tracking progress
0:41:02so, let's, you know, turn to the future
0:41:08SRE twelve
0:41:10target speakers
0:41:11at most are specified in advance. There are speakers in recent past evaluations. I think
0:41:17it's something in the order of two thousand. That it's best why it is potential
0:41:23target speakers. So, sites can know about these targets, they have all the data, they
0:41:29develop their systems to take advantage of that. All prior speeches are available for training.
0:41:34There would be some new target speakers with training data provided at evaluation times; that's
0:41:39one check on the effect of providing the targets in advance. We also have a
0:41:46test segments that will include non-target speakers.
0:41:53that is the big change for twenty twelve. Also, new interview speech will be provided,
0:41:59and was mentioned yesterday, in sixteen bit
0:42:01linear PCM
0:42:04some of the test phone calls are gonna be collected specifically in noisy environments
0:42:11And moreover, we're gonna have artificial
0:42:14noise, you know - added noise, test was done best... some test segment. Another challenge
0:42:23in this community. But, will this be an effectively easier task
0:42:29because we find the targets in advance and subsets
0:42:33It's... it mix it into these partially ... close that trial, you know, you are
0:42:40allowed to know not only about the one target, but these two thousand other targets,
0:42:45will that make a difference? We had, you know, we have open workshops. you know,
0:42:49workshops where the participants... we debate these things. Last December this got debated how much
0:42:57will this
0:42:58change the system? Will it make the problem too easy?
0:43:04It was ... we could have conditions when people were asked to assume
0:43:10that segment is target so it... since things fully close that.... or to assume no
0:43:15information about targets other than that of the actual trial.
0:43:20clearly speaker siding is in the past, so people do this, their results provide basis
0:43:25for comparison. This is what's to be
0:43:31investigate to be seen in SRE twelve. In terms of metrics, log- likelihood ratios now
0:43:40are required. And since we're doing that, no hard decisions are asked for.
0:43:48in terms of primary metric
0:43:53you know, just use the
0:43:56the dcf of
0:43:57point ten, but Niko pointed out that if you're not really required to calibrate your
0:44:06log likelihood ratios if you're only using it at one operating point.
0:44:12so therefore
0:44:17to require calibration and stability, we're gonna actually have two DCS and take the average
0:44:24of them. Also, cllr is alternative. Cllr m ten ... Niko referred to
0:44:32the limit cllr trials with
0:44:41high miss rate
0:44:47the formula for TCF. We have three parameters, but we're working right at this one
0:44:51parameter beta and so
0:44:55the cost function of the simple average TCF one , TCF two, where cost of
0:45:01one where target priors are either point zero one aas is twenty ten, or point
0:45:07zero one
0:45:09order things to
0:45:11that would be
0:45:13the official metric
0:45:17and finally
0:45:20future hold
0:45:22that of course
0:45:24none of us knows, but
0:45:27twenty twelve , the outcome will determine whether all this
0:45:34idea of prespecified targets will be
0:45:37an effective one, that doesn't make the problem too easy or bigger, now we're gonna
0:45:43Artificially added noise will be included in noise and reverb added may be part of
0:45:49the future.
0:45:50HASR twelve will be repeated, HASR ten had both other tests of fifteen hundred trials
0:45:56or a hundred and fifty. HASR twelve will have even twenty or two hundred, and
0:46:04anyone, you know, those with forensic interest, but anyone interested in to involve human assisted
0:46:10systems is invited to participate in HASR twelve, I would like to get more participations
0:46:15this year
0:46:17HASR extraction
0:46:21and answer is
0:46:24to just bigger
0:46:27fifty or more particiating sites. Data volume is now getting up to terabytes.
0:46:34best evaluation that so much, this year will be in twenty twelve, because we're only
0:46:41most priors will be run test data but that... you know, the numbers are
0:46:48segments are in the hundreds of thousands and the number trials
0:46:53is going to be in millions, tens of millions, even hundreds of millions before the
0:46:58optional full sets of trials, so
0:47:05likely you see the schedule moving to an every three year-one but details really need
0:47:11to be
0:47:14a lot more
0:47:19I don't know it, but I think that's where
0:47:23I'll finish
0:48:00segments for a speaker
0:48:03know about our speakers
0:56:31I didin't say they are normal curve
0:58:08Right, so LDC has an agreement with the sponsors suppert the lre in sre evaluations,
0:58:14that we will keep the most recent evaluation set blind, hold it back from publications
0:58:20and the general LDC catalog until the new data set has been created and so
0:58:25part of the timing of the publication of those eval sets is
0:58:30the requirement to have a new blind set
0:58:35is the current evaluation set and I
0:58:38ascending we have
0:58:39answers. We can raise that issue with them, and give them the key back they
0:58:43were getting
0:58:56right, so the SRE eval set is just being finished now
0:59:02as soon as that's finalised
0:59:05sre ten will be put into the queue for publication. Sre... sre ten, will be
0:59:10put to queue for publication
0:59:13it is sort of rolling in circle
0:59:21you'll have to ask the sponsor about that, I can't speak of their motivation, only
0:59:25they're contractually obligated to delay the publication of discussion.
0:59:54right well LDC is also balancing the needs of the consortium as a whole and
0:59:59so we... we are staging publications in the catalog, balancing a number of factors, so
1:00:05the speaker recognition and language
1:00:07are the one of the communities that we support. I hear your concern, we can
1:00:11certainly raise this issue with the sponsors and see if there's any
1:00:16any ... if we can provide. But, at this point, I think this is the
1:00:20strategy that spending
1:03:02this one