0:00:15well everyone today i'm going to talk about the effects of the new testing paradigm
0:00:19on the nist sre twelve
0:00:21this work was done in collaboration with many colleagues
0:00:24i including alvin events john george and jack
0:00:30so before talking about what change nursery twelve let's just reminders also some things to
0:00:34say the same
0:00:36the task i industry twelve was text independent speaker detection
0:00:40by speaker detection i mean
0:00:43given some speech from a target speaker and some speech from a non target speaker
0:00:47determine whether the target speaker and the non-target speaker the same person
0:00:51evaluation consisted a long series of trials where a target trial is when the target
0:00:57speaker the non-target speaker were the same and non-target trial where the target speaker non-target
0:01:02speaker different so that much was the same
0:01:05something the change history twelve was the joint knowledge of target speakers allowed
0:01:09so on the past each trial had to be processed independently from one another
0:01:14but in a street well
0:01:16it was permissible to use knowledge of other target speakers for trial and this gave
0:01:23rise to a distinction in the non-target speakers
0:01:26and namely whether they were among the target speakers in which case they were considered
0:01:30the known to man target speaker
0:01:32or if they were among than not among the target speakers there are considered an
0:01:37unknown non-target speaker
0:01:41on this there were more and more very training data for each target speaker and
0:01:45the majority of the target speakers in the evaluation
0:01:47i had more than one segments for training
0:01:51and in those cases a often the training data itself was varied
0:01:57consisting of for example interviews recorded over a microphone phone calls recorded over a microphone
0:02:02or phone calls recorded over telephone channel
0:02:09most of the speakers in the sre twelve and was the target speakers are used
0:02:14in prior evaluations which is something that was very different and they were identified in
0:02:18advance and all their speech from these prior evaluations is made available
0:02:25of those eighteen hundred at a new data was collected from three hundred and twenty
0:02:30speakers roughly and roughly seventy of to we're not present in the prior evaluations and
0:02:35those speakers had a single phone conversation released at the time of the evaluation
0:02:43so we need these changes one question that may come up as y
0:02:47and there were several reasons among them was to explore methods realising large quantities of
0:02:53training data
0:02:55to allow participants extended period of time to work on modeling techniques
0:03:00to determine the benefit of allowing joint knowledge of target speakers particular the benefit of
0:03:05for performance and also to increase the efficiency of the data collection
0:03:13interest the data at the target speaker training data broke down into two cases if
0:03:17released in advance of the evaluation the target speaker training data consisted of prior evaluation
0:03:24data collected as part of the ldc mixtures of one three six and if released
0:03:29at the start of the evaluation
0:03:31the training data was a single phone conversation record as part of mixture seven
0:03:37for the test segments most of them came from a newly collected corpus country next
0:03:41which were phone calls a report over telephone channel
0:03:45from prior mixture speakers
0:03:48and they're also smaller number of phone conversations from the mixer seven corpus and
0:03:54and these are phone conversations recorded other over a telephone channel or microphone channel
0:04:02so there were many different types of trials include in the evaluation for example they
0:04:09were trials where the speech had
0:04:14noise added to it or it was reported in the naturally noisy environment
0:04:20but among the trials we wanted to emphasise some subsets of particular interest us so
0:04:26these recall common conditions there were five common conditions in the evaluation for today's presentation
0:04:32will just going to focus on two
0:04:34and those or interview speech in test without added noise and telephone channel speech and
0:04:40test without added noise
0:04:45so here we see in very round numbers the number of trials for each of
0:04:48these common conditions
0:04:50for common condition one again interview test with no added noise a roughly three thousand
0:04:57target trials forty six thousand non-target trials from no non-target speakers sdc two thousand trials
0:05:04from one or non-target speakers
0:05:08in the core test which was required of all participants and optional test was assumption
0:05:14the same but just with very a with a larger number of trials
0:05:21and you see the numbers there likewise
0:05:34so let's look at some results
0:05:38so here we see common condition two
0:05:41which is telephone channel speech and test without added noise
0:05:45and this is the results from one leading system the others are similar
0:05:51and as might be expected better performance was able observed for known speakers that's the
0:05:56red line
0:05:58compared to the unknown speakers that's the black line
0:06:04one thing to note is that known speakers had multiple telephone conversations and sometimes even
0:06:11as their training data
0:06:14so accuracy of the same system
0:06:16but on common condition one which is interview speech and test without added noise
0:06:22but unlike the last slide we saw
0:06:25there's not a lot of difference between the two curves
0:06:29and that gave was initially puzzling we wanted to know why and as it turns
0:06:36and the known speakers for this common condition i were only known from a single
0:06:41telephone channel recording
0:06:43so where's and the previous slide the known speakers had a large amount of training
0:06:48data by which to know them i hear the speakers where only known by a
0:06:55single telephone channel
0:06:57some in addition to having a small amount of data the trials were cross channel
0:07:05so in addition to this concept of known and unknown non-target speakers
0:07:10other also known and unknown systems
0:07:12and what we mean here is that unknown systems presume that all of the non-target
0:07:17trials came from no non-target speakers
0:07:21and all systems presume that all the non-target trials were spoken by unknown non-target speakers
0:07:26so customers also that
0:07:28accuracy just a regular system only extended trials for common condition two
0:07:36and we see the thin dotted lines
0:07:40i'm not sure if we can actually see that especially in the back
0:07:43but those are ninety percent confidence bounds which suggest that there was a significant difference
0:07:49in performance
0:07:51between the known non-targets on the on a non-targets again read is the colour for
0:07:57the known non-targets black for the and known non-targets
0:08:04so here we see an unknown system again that's where the system always presume that
0:08:09the non-target speaker was a known and as might the expected
0:08:14there is little this difference observed between the two curves
0:08:21the accuracy and on system again that's where of the system presumes that all of
0:08:26the non-target speakers are unknown which is just say there are among the target speakers
0:08:32all of these are from the same site
0:08:36and you're actually compared to two slides back
0:08:40the performance differences is enhanced
0:08:48summary sre twelve was an experiment with a new protocol and how speakers were made
0:08:53known to the systems
0:08:54after conversational telephone speech segments performance was improved when speakers are known to the system
0:09:01for interview test segments such improvement was not observed that was just do the setup
0:09:05of the evaluation
0:09:07he was not observable stuff to say that would be observed if the evaluation allows
0:09:13others actually a lot more information and that was covering the paper and other papers
0:09:22covering things that we learn from the evaluation so let me encourage you to
0:09:27look at those a more to contact us that address
0:09:34in addition
0:09:36considering future evaluations there is a question of whether allow enjoying knowledge of a target
0:09:43speakers is a good idea going forward
0:09:46one thing to note is that
0:09:48joint knowledge of target speakers makes result increasingly dependent on the target speaker selected introduced
0:09:54a trial independence
0:09:56so this makes estimating
0:10:01an error rates more difficult
0:10:03also something to consider is whether to continue having multi session and multichannel
0:10:10training for the target speakers
0:10:15so nist will resume a series b on the i-vector challenge in the a near
0:10:24some interest he's is
0:10:27been expressed within the community regarding performing testing and acoustic environments different
0:10:35from those of prior evaluations joe made mention that
0:10:40some utility and that
0:10:43also one thing to note is that's
0:10:48in order to be able to conduct these types of evaluations it is necessary to
0:10:52collect realistic in challenging speech data
0:10:56which is both expensive and time-consuming
0:10:59but in order to do that and have even better evaluation lessons learned from sre
0:11:07will be take into account and considered in the next evaluation so i probably have
0:11:11lots of time for questions
0:11:15so thank you
0:11:30so looking at your the
0:11:32c one in c to the common condition one into yes can you talk to
0:11:38the number of actual speakers were involved in the c one versus e c two
0:11:43no trials but speakers right
0:11:46the short answer is
0:11:49yes but not now "'cause" i don't have that information handy but we did look
0:11:52at that i can recall precisely
0:11:55well one of the things i did not yet another but i recall the c
0:11:59one had it on the order of about fifty forty three speakers involved only
0:12:04i think we have
0:12:06comparing those two about the effects ago about the known and this couple things changing
0:12:11simultaneously the microphone in the television only yes hand the pool is much smaller "'cause"
0:12:16i think it was only true from drawn from
0:12:20mixture seven
0:12:22so that's actually really excellent point that we try to emphasise during the evaluation workshop
0:12:26but i neglected to mention here
0:12:28is that the common conditions really we're not compare able at all
0:12:34in this evaluation so the speakers were different and the
0:12:43basically all the conditions change so it's i don't think you for noting that it's
0:12:47inappropriate to make those comparisons
0:12:49across common conditions within a common condition
0:12:52it was interesting to look at some of the sub
0:12:57some factor performance
0:13:08could you write just commenting if you're going to be following up on his or
0:13:11her as part of the nist unnecessary process
0:13:15so this is actually something we've been looking into
0:13:19pretty extensively the short answer is it's remains to be determined but the long answers
0:13:24this is something we're seeking to do
0:13:32make it i've a practise it
0:13:36are criteria you said at the end of the presentation they that they'll be focus
0:13:39on multichannel enrollment a training conditions
0:13:43once the question
0:13:44whether question is like cyanide in the last sre twelve you present at the workshop
0:13:50i think those any one thing that the
0:13:52my kind enrollment or telephone and enrollment it seems like focus wasn't neto maybe that
0:13:58just wasn't nothing just this time ramp up to still is a big challenge awfully
0:14:02so it just one if that was still going to be effective some continuing evaluations
0:14:08well that's a question and one of the things that we're very eager for is
0:14:12to get feedback from one
0:14:15from the community one thing that is
0:14:19time consuming and
0:14:21if not expensive the difficulty is setting up the evaluation even with the data
0:14:25and so
0:14:28we're much more likely to include that again of people will actually participate
0:14:34also got a second question if of got on a i'm not sure if you
0:14:37where the nn i-vector paragon that's come out for frame framework for sre twelve
0:14:44very impressive performance particular on telephone conditions as you mine i that the nn you
0:14:49need a lot of data for training and things very difficult to get that level
0:14:53one thing are afraid of is
0:14:56teams that might not have the infrastructure do such thing
0:15:00how would like here with the other things that do have the infrastructure in future
0:15:04evaluations is there are something that can be done about that such as the i-vector
0:15:08challenge with the i-vectors are presented
0:15:11just one and you've got thoughts on that
0:15:15in short no but that's a good question and
0:15:20something that model
0:15:23perfectly willing to explore
0:15:31i just want to common to one o or of your conclusion point your i'm
0:15:35be happy to know but
0:15:37you have been mine with this point source of course to extend the v nist
0:15:41databases with new challenging conditions
0:15:45but i think it's also interesting to us
0:15:48increase the query actual conditions we have a lot of for to do on the
0:15:53act recognition by increasing cell use given number of speakers
0:15:57maybe buying one out of menu chewed
0:16:00and by adding
0:16:03in of the data per speaker of course it will
0:16:08for us to the reviewers over the evaluation protocol and look at the results per
0:16:14speaker like
0:16:16jodie the past the us also look also at the difference is that if you
0:16:24select randomly one thousand test
0:16:28in a lot that the bayes to do you have some performance differences if you
0:16:34choice so one set compare to your the sets and a lot of things like
0:16:39i think