| 0:00:14 | she this work was done in collaboration with a very large number of colleagues and | 
|---|
| 0:00:20 | everyone the latter part work so | 
|---|
| 0:00:24 | i'd like to think | 
|---|
| 0:00:27 | disarray and george daniel jack tommy alvin alan mark and dog | 
|---|
| 0:00:37 | so the goal of the challenge was to support and encourage development of new methods | 
|---|
| 0:00:42 | for speaker detection utilizing i-vectors in the intent was to explore new ideas and machine | 
|---|
| 0:00:49 | learning for used in speaker recognition | 
|---|
| 0:00:53 | and to trying to the field more accessible for people outside of the audio processing | 
|---|
| 0:00:56 | community and to improve the performance of the technology | 
|---|
| 0:01:05 | the chunk format for people who don't know | 
|---|
| 0:01:08 | was to use i-vectors us also the sure been in audio was to distribute the | 
|---|
| 0:01:12 | i-vectors themselves and it was all hosted on a web platform so was entirely online | 
|---|
| 0:01:18 | the registration the system submission and receiving results all online | 
|---|
| 0:01:26 | the reason for using i-vectors in the web platform was to attempt to expand the | 
|---|
| 0:01:32 | number and types of participants including ones from the ml community | 
|---|
| 0:01:36 | and to allow iterative submissions with the fast turnaround and order to support research progress | 
|---|
| 0:01:43 | during the actual evaluation | 
|---|
| 0:01:48 | another think that was that different from what people may be accustomed to with the | 
|---|
| 0:01:51 | regular sre was that's a large development set of unlabeled i-vectors was distributed to be | 
|---|
| 0:02:00 | used for dev data the intent there was to encourage new creative approaches the modelling | 
|---|
| 0:02:06 | and in particular | 
|---|
| 0:02:08 | the use of clustering to improve performance | 
|---|
| 0:02:12 | in addition to these things one thing we were hoping to do was to set | 
|---|
| 0:02:16 | a precedence or at least have a proof of concept for future evaluations where there | 
|---|
| 0:02:22 | can be web based registration data distribution potentially and results submission trying to make this | 
|---|
| 0:02:28 | more efficient and more user friendly | 
|---|
| 0:02:32 | after the community | 
|---|
| 0:02:36 | so | 
|---|
| 0:02:38 | the objective straight the data selection was to include multiple training sessions for each target | 
|---|
| 0:02:43 | speaker in the main evaluation test in a recent histories | 
|---|
| 0:02:47 | an optional test is involved multiple training sessions but | 
|---|
| 0:02:52 | in this challenge we wanted to include that for everyone to do is the main | 
|---|
| 0:02:59 | focus | 
|---|
| 0:03:01 | also i same handset target trials and cross sex nontarget trials both the which or | 
|---|
| 0:03:08 | unusual | 
|---|
| 0:03:10 | for the regular sre | 
|---|
| 0:03:12 | also something different was taking i-vectors from a log normal distribution as opposed to | 
|---|
| 0:03:20 | some discrete uniform | 
|---|
| 0:03:25 | durations the reason for this was filters more realistic and it's the challenge that's | 
|---|
| 0:03:32 | people seemed eager to address | 
|---|
| 0:03:34 | and also well just varying the duration allows us to do | 
|---|
| 0:03:39 | post evaluation analysis | 
|---|
| 0:03:43 | so the task is speaker detection which hopefully everybody here knows by the third data | 
|---|
| 0:03:48 | for the c | 
|---|
| 0:03:49 | and the system i was evaluated over a set of trials where you trial compared | 
|---|
| 0:03:55 | a target speaker model in this case this was a set of five i-vectors and | 
|---|
| 0:04:00 | it test speech segment comprised of a single i-vector | 
|---|
| 0:04:05 | the system determines whether or not the speaker and the test segment as a target | 
|---|
| 0:04:08 | speaker about put in a single real number | 
|---|
| 0:04:12 | and no decision was necessary | 
|---|
| 0:04:15 | the trial outputs are then compare to ground truth to compute a performance measure which | 
|---|
| 0:04:19 | for the i-vector challenge was in dcf | 
|---|
| 0:04:24 | hopefully people know what target trials and non-target trials and misses and false alarms are | 
|---|
| 0:04:29 | it does anyone not know that | 
|---|
| 0:04:33 | okay if not come see me afterwards and measure was dcf which is essentially just | 
|---|
| 0:04:40 | the miss rates a time plus one hundred times of false alarm rate | 
|---|
| 0:04:46 | and the official overall measure was the mindcf | 
|---|
| 0:04:54 | seen here | 
|---|
| 0:04:57 | so the challenge i-vectors were produced with the system developed jointly we between johns hopkins | 
|---|
| 0:05:02 | and mit lincoln labs and it uses the standard mfccs and don't the acoustic features | 
|---|
| 0:05:10 | and use the gmm train set | 
|---|
| 0:05:17 | the source data were the ldc mixer corpora particular mixtures one three seven as well | 
|---|
| 0:05:22 | as remakes and included around sixty thousand telephone call sides from about six thousand speakers | 
|---|
| 0:05:28 | and the duration of these calls were up to five minutes drawn from a log | 
|---|
| 0:05:34 | normal distribution | 
|---|
| 0:05:35 | with the mean of nearly forty seconds | 
|---|
| 0:05:39 | for each selected segment participants were provided with this example dimensional i-vector as well as | 
|---|
| 0:05:45 | the duration from of the speech from which the i'd draw a vector was extracted | 
|---|
| 0:05:52 | so this is the data and then the data was partitioned into a development set | 
|---|
| 0:05:56 | and enrollment test set | 
|---|
| 0:05:59 | after the development partition the calls were from speakers without test data | 
|---|
| 0:06:04 | and consisted of round three six thousand telephone call sides from around five thousand speakers | 
|---|
| 0:06:10 | and as i said earlier it was unlabeled so | 
|---|
| 0:06:15 | no speaker like bowls we're given with the development partition | 
|---|
| 0:06:21 | for the enrollment and test | 
|---|
| 0:06:24 | calls were from speakers with at least five calls from different phone numbers and at | 
|---|
| 0:06:28 | least eight calls from a single phone number consisted of a about thirteen hundred target | 
|---|
| 0:06:34 | speakers | 
|---|
| 0:06:36 | i'm sort target models | 
|---|
| 0:06:38 | almost ten thousand test i-vectors and the target trials we're limited to ten same intent | 
|---|
| 0:06:44 | different phone number calls per speaker and non-target trials came from other target speakers as | 
|---|
| 0:06:49 | well as a five hundred speakers who are not | 
|---|
| 0:06:53 | other target speakers two hundred fifty males and | 
|---|
| 0:06:56 | fifty females | 
|---|
| 0:07:00 | the trials consisted of all possible pairs of a target speaker and the test i-vector | 
|---|
| 0:07:07 | about twelve and half a million trials | 
|---|
| 0:07:10 | and included cross sex nontarget trials as willow same number | 
|---|
| 0:07:14 | target trials | 
|---|
| 0:07:16 | the trials were divided into two randomly selected subsets that someone asked about this the | 
|---|
| 0:07:21 | speakers did overlap between the progress subset and the evaluation subset | 
|---|
| 0:07:29 | forty percent was used for a progress subset which | 
|---|
| 0:07:36 | was what was used to monitor progress and people familiar maybe not from where i | 
|---|
| 0:07:41 | should say with the | 
|---|
| 0:07:42 | i challenge there was a | 
|---|
| 0:07:46 | a progress board where people could see how they were doing and how other people | 
|---|
| 0:07:52 | we're doing | 
|---|
| 0:07:53 | and that was actually | 
|---|
| 0:07:59 | updated using the progress set and sixty percent of the data was held out | 
|---|
| 0:08:04 | into the end of the evaluation period | 
|---|
| 0:08:07 | and then the system submissions were scored for the official results | 
|---|
| 0:08:13 | using this remaining sixty percent | 
|---|
| 0:08:18 | so some structure to the evaluation system output for each trial could be based only | 
|---|
| 0:08:24 | on the trials model and test i-vectors as well as the durations provided and the | 
|---|
| 0:08:29 | provided development data | 
|---|
| 0:08:32 | normalization over multiple test segments are target speakers was not allowed | 
|---|
| 0:08:36 | use of evaluation data from for nontarget speaker modeling was not allowed | 
|---|
| 0:08:41 | and training system parameters using data not provided as part of the challenge was also | 
|---|
| 0:08:46 | model out one two and three these or | 
|---|
| 0:08:51 | pretty typical for the nist develops for is actually knew | 
|---|
| 0:08:55 | in the intent was to remove data engineering and also encourage participation from so it's | 
|---|
| 0:09:00 | a don't have a lot of their own speech data | 
|---|
| 0:09:05 | so in terms of dissipation there about three hundred registrants from about fifty countries | 
|---|
| 0:09:11 | hundred and forty of the registrants from hundred and five unique sites | 
|---|
| 0:09:15 | i'm at least one valid submissions so there were some | 
|---|
| 0:09:19 | some number people registered but worked able to some of the system | 
|---|
| 0:09:25 | the numbers submissions actually exceeded eight thousand if we compare these numbers to a street | 
|---|
| 0:09:29 | well we do see a really large increase in participation which are excited c | 
|---|
| 0:09:38 | in addition to receiving data | 
|---|
| 0:09:40 | a baseline system was distributed with the evaluation | 
|---|
| 0:09:47 | it used a variant of cosine scoring accuracy the five steps estimate a global mean | 
|---|
| 0:09:53 | and covariance and the unlabeled data | 
|---|
| 0:09:56 | update that's mean and variance by center and whiten you know a project them onto | 
|---|
| 0:10:01 | a unit sphere | 
|---|
| 0:10:02 | and that for each model | 
|---|
| 0:10:03 | i average it's five i-vectors and project those on the unit sphere and then compute | 
|---|
| 0:10:08 | the inner product | 
|---|
| 0:10:12 | one thing to note is because the dev data was unlabeled at the b d | 
|---|
| 0:10:16 | c n and lda were | 
|---|
| 0:10:18 | not possible to use | 
|---|
| 0:10:21 | in addition to that there was an oracle system that was not provided but kept | 
|---|
| 0:10:26 | a g h u | 
|---|
| 0:10:28 | which have access to the development speaker data will development data speaker labels | 
|---|
| 0:10:35 | and the | 
|---|
| 0:10:37 | a system was gender dependent with a four dimensional speaker space all of the i-vectors | 
|---|
| 0:10:44 | for each model or let length normalized are then averaged | 
|---|
| 0:10:49 | and it discarded i-vectors with duration less than thirty seconds which actually reduce the development | 
|---|
| 0:10:55 | set | 
|---|
| 0:10:56 | quite a bit | 
|---|
| 0:11:00 | and here we see our first result so | 
|---|
| 0:11:04 | z a red line as their oracle system | 
|---|
| 0:11:08 | and the blue line is the baseline system the solid line is on the evaluation | 
|---|
| 0:11:13 | set of trials use of the sixty percent or | 
|---|
| 0:11:17 | held out in the dotted line is on the progress set | 
|---|
| 0:11:22 | so basically the gap between these lines indicate the | 
|---|
| 0:11:26 | potential value of having speaker labels | 
|---|
| 0:11:30 | so the hope was to be able to use clustering techniques from the development set | 
|---|
| 0:11:35 | up close this gap | 
|---|
| 0:11:42 | here we see | 
|---|
| 0:11:44 | results | 
|---|
| 0:11:45 | so i here | 
|---|
| 0:11:48 | is the mindcf on the oracle system and on the baseline system the blue line | 
|---|
| 0:11:55 | is the progress set | 
|---|
| 0:11:56 | and the red line is the ml set | 
|---|
| 0:11:58 | and here we see the | 
|---|
| 0:12:00 | top ten performing systems and how they did on the progress set and on the | 
|---|
| 0:12:06 | ml sit | 
|---|
| 0:12:08 | performance on the eval set was consistently better than progress | 
|---|
| 0:12:12 | not exactly sure why then some random variation | 
|---|
| 0:12:16 | and seventy five percent of participants submitted a system that outperform the baseline true really | 
|---|
| 0:12:21 | please soon as well | 
|---|
| 0:12:23 | are we do not time | 
|---|
| 0:12:26 | so okay great actually course so | 
|---|
| 0:12:31 | oops what skip this | 
|---|
| 0:12:34 | accuracy progress over time | 
|---|
| 0:12:37 | the green line is on the of al so that | 
|---|
| 0:12:42 | and the blue line as on the progress set | 
|---|
| 0:12:44 | and the red line is on the progress set to so basically the green line | 
|---|
| 0:12:47 | is the very best score observed to date | 
|---|
| 0:12:51 | same with the blue line | 
|---|
| 0:12:53 | and then the red line is for the system that and it up | 
|---|
| 0:12:57 | with the top performance | 
|---|
| 0:13:00 | at the end so we see it's | 
|---|
| 0:13:05 | history of the performance over time | 
|---|
| 0:13:09 | couple thing is that we note it was the performance levelled off after about six | 
|---|
| 0:13:12 | weeks we ran this from december | 
|---|
| 0:13:15 | through april | 
|---|
| 0:13:17 | and basically after six weeks but not much for a progress was observed | 
|---|
| 0:13:24 | and also interesting to note was that the leading system | 
|---|
| 0:13:28 | did not lead basically from december till february | 
|---|
| 0:13:32 | output by it's a period that | 
|---|
| 0:13:37 | i taking the lead to stay there | 
|---|
| 0:13:42 | here we see performance by gender on the left | 
|---|
| 0:13:47 | of each of these is the leading system | 
|---|
| 0:13:50 | and on the right is the baseline system | 
|---|
| 0:13:54 | one thing kind of interesting to note | 
|---|
| 0:13:56 | is the leading system did worse | 
|---|
| 0:13:58 | a on same sex trials than on male only and female only i which might | 
|---|
| 0:14:04 | be unexpected but | 
|---|
| 0:14:07 | i think an explanation for this is that there were calibration issues | 
|---|
| 0:14:14 | accuracy performance by same and different phone number | 
|---|
| 0:14:18 | here the blue is the baseline | 
|---|
| 0:14:21 | on the left the same number of the right is different number | 
|---|
| 0:14:24 | and here i guess like with the gender | 
|---|
| 0:14:27 | i we see limited degradation in performance to the change and phone number from the | 
|---|
| 0:14:30 | leading system | 
|---|
| 0:14:32 | so this was very close | 
|---|
| 0:14:36 | even compared to the baseline which was fairly close | 
|---|
| 0:14:42 | so there's some additional information available you can see the odyssey paper for more results | 
|---|
| 0:14:49 | for example more information about the progress over time and gender effects as well same | 
|---|
| 0:14:55 | a different phone numbers | 
|---|
| 0:14:57 | we also have an interspeech paper that does some analysis of participation | 
|---|
| 0:15:01 | i gives us some of these same results but on the progress set the odyssey | 
|---|
| 0:15:04 | paper focuses entirely on the ml set | 
|---|
| 0:15:08 | and there's the lots of work to do | 
|---|
| 0:15:11 | so that we have future paper on duration age another results as you can see | 
|---|
| 0:15:15 | those things for additional information you can also please feel free to contact us | 
|---|
| 0:15:20 | so some conclusions we thought that the process worked which was very exciting for us | 
|---|
| 0:15:26 | the website was brought up and stayed up which was good | 
|---|
| 0:15:33 | participation exceeded that of prior sre is | 
|---|
| 0:15:37 | which was a of the goal | 
|---|
| 0:15:39 | and many states significantly improved on the baseline system | 
|---|
| 0:15:44 | further investigation and feedback will be needed | 
|---|
| 0:15:47 | in order to determine the extent to which the you participation was from outside of | 
|---|
| 0:15:52 | the audio processing community | 
|---|
| 0:15:55 | for people who are signed up we | 
|---|
| 0:15:59 | eventually asked if they were from the audio processing community but we didn't thing to | 
|---|
| 0:16:04 | do that during the initial sign up so that all other cases we don't know | 
|---|
| 0:16:10 | whether a the additional participation came from outside the outside the audio processing community or | 
|---|
| 0:16:16 | not | 
|---|
| 0:16:18 | thousands of submissions provide data for further analysis which we look for to doing | 
|---|
| 0:16:25 | and | 
|---|
| 0:16:29 | these things include things like clustering of unlabeled data and gender differences across and within | 
|---|
| 0:16:34 | trials effects of handsets role of duration | 
|---|
| 0:16:41 | and speaking of future work | 
|---|
| 0:16:44 | we plan to enhance the online platform for example would like to put analysis tools | 
|---|
| 0:16:50 | on the platform for participants to use | 
|---|
| 0:16:54 | we expect to | 
|---|
| 0:16:57 | offer further online challenges | 
|---|
| 0:17:00 | and in part because they're more readily organized and also because | 
|---|
| 0:17:05 | it's a possible to efficiently we use of test data | 
|---|
| 0:17:09 | but we expect that we use results will affect full fledged evaluations as well or | 
|---|
| 0:17:15 | the typical s are used | 
|---|
| 0:17:17 | as well for example we'd like to | 
|---|
| 0:17:21 | have increasingly web based in user friendly procedures for | 
|---|
| 0:17:25 | i registration in for data distribution | 
|---|
| 0:17:28 | and it's possible that were use a separate datasets evaluation datasets | 
|---|
| 0:17:38 | want a four iterations graph performance in another held-out with limited exposure | 
|---|
| 0:17:43 | i we've seen this used in | 
|---|
| 0:17:46 | i have passed | 
|---|
| 0:17:48 | nist evaluations and it may | 
|---|
| 0:17:51 | and see renewed use in a series | 
|---|
| 0:17:55 | thank you very much | 
|---|
| 0:18:07 | i craig and you pass like twenty one okay | 
|---|
| 0:18:12 | i'm wondering with those this seems just the weights is leading system | 
|---|
| 0:18:18 | is that the leading system and that's two conditions are same sets | 
|---|
| 0:18:22 | a sure that is the same system in | 
|---|
| 0:18:26 | in both | 
|---|
| 0:18:33 | i used in a reasonable idea oracle was and different directed and what you distribute | 
|---|
| 0:18:38 | which one six hundred twenty four hundred one so why you keep the same i-vector | 
|---|
| 0:18:45 | the two distributions i lincoln you may be addressed | 
|---|
| 0:18:50 | like | 
|---|
| 0:19:02 | craig in your final slide you mentioned | 
|---|
| 0:19:05 | that the last point that's a data set for iterated use | 
|---|
| 0:19:12 | are you thinking of something similar to what you have now the | 
|---|
| 0:19:19 | the point i'm getting at is | 
|---|
| 0:19:21 | if you want to train for example calibration or fusion | 
|---|
| 0:19:26 | then it's then it's very nice to average of feedback for example | 
|---|
| 0:19:32 | the derivatives of your system parameters with respect to those schools so | 
|---|
| 0:19:40 | you think | 
|---|
| 0:19:41 | it would be possible to | 
|---|
| 0:19:45 | i'm not sure whether it's in the question is | 
|---|
| 0:19:48 | is this an issue of not having speaker labels for development or | 
|---|
| 0:19:58 | well | 
|---|
| 0:20:00 | we want to be able to train | 
|---|
| 0:20:02 | a fusion sure on the type that so can you see that happening or | 
|---|
| 0:20:08 | because if you would just give us the data we could do that but if | 
|---|
| 0:20:11 | the data stays | 
|---|
| 0:20:14 | on the other side and all side | 
|---|
| 0:20:17 | then | 
|---|
| 0:20:17 | that's a more difficult and then sure more complex | 
|---|
| 0:20:26 | right | 
|---|
| 0:20:32 | yes and one thing that maybe i should clarifies this was really meant in the | 
|---|
| 0:20:37 | context of sre in other nist evaluations sometimes | 
|---|
| 0:20:41 | the reuse dataset from one you're to another here | 
|---|
| 0:20:45 | up of also have some | 
|---|
| 0:20:48 | i guess what's called the progress set but they use a different sense then we | 
|---|
| 0:20:52 | are using it here | 
|---|
| 0:20:54 | where people won't get | 
|---|
| 0:20:58 | the key for that | 
|---|
| 0:20:59 | but they will have the key for the review set | 
|---|
| 0:21:02 | as editors your question or | 
|---|
| 0:21:04 | okay | 
|---|
| 0:21:09 | we question i just wondered is not relevant to do according to the rules but | 
|---|
| 0:21:14 | those thirty nine nodes | 
|---|
| 0:21:17 | are all models are a little different speakers where they're not there were some speakers | 
|---|
| 0:21:23 | because there was a distortion weighted it would be or not | 
|---|