0:00:14she this work was done in collaboration with a very large number of colleagues and
0:00:20everyone the latter part work so
0:00:24i'd like to think
0:00:27disarray and george daniel jack tommy alvin alan mark and dog
0:00:37so the goal of the challenge was to support and encourage development of new methods
0:00:42for speaker detection utilizing i-vectors in the intent was to explore new ideas and machine
0:00:49learning for used in speaker recognition
0:00:53and to trying to the field more accessible for people outside of the audio processing
0:00:56community and to improve the performance of the technology
0:01:05the chunk format for people who don't know
0:01:08was to use i-vectors us also the sure been in audio was to distribute the
0:01:12i-vectors themselves and it was all hosted on a web platform so was entirely online
0:01:18the registration the system submission and receiving results all online
0:01:26the reason for using i-vectors in the web platform was to attempt to expand the
0:01:32number and types of participants including ones from the ml community
0:01:36and to allow iterative submissions with the fast turnaround and order to support research progress
0:01:43during the actual evaluation
0:01:48another think that was that different from what people may be accustomed to with the
0:01:51regular sre was that's a large development set of unlabeled i-vectors was distributed to be
0:02:00used for dev data the intent there was to encourage new creative approaches the modelling
0:02:06and in particular
0:02:08the use of clustering to improve performance
0:02:12in addition to these things one thing we were hoping to do was to set
0:02:16a precedence or at least have a proof of concept for future evaluations where there
0:02:22can be web based registration data distribution potentially and results submission trying to make this
0:02:28more efficient and more user friendly
0:02:32after the community
0:02:38the objective straight the data selection was to include multiple training sessions for each target
0:02:43speaker in the main evaluation test in a recent histories
0:02:47an optional test is involved multiple training sessions but
0:02:52in this challenge we wanted to include that for everyone to do is the main
0:03:01also i same handset target trials and cross sex nontarget trials both the which or
0:03:10for the regular sre
0:03:12also something different was taking i-vectors from a log normal distribution as opposed to
0:03:20some discrete uniform
0:03:25durations the reason for this was filters more realistic and it's the challenge that's
0:03:32people seemed eager to address
0:03:34and also well just varying the duration allows us to do
0:03:39post evaluation analysis
0:03:43so the task is speaker detection which hopefully everybody here knows by the third data
0:03:48for the c
0:03:49and the system i was evaluated over a set of trials where you trial compared
0:03:55a target speaker model in this case this was a set of five i-vectors and
0:04:00it test speech segment comprised of a single i-vector
0:04:05the system determines whether or not the speaker and the test segment as a target
0:04:08speaker about put in a single real number
0:04:12and no decision was necessary
0:04:15the trial outputs are then compare to ground truth to compute a performance measure which
0:04:19for the i-vector challenge was in dcf
0:04:24hopefully people know what target trials and non-target trials and misses and false alarms are
0:04:29it does anyone not know that
0:04:33okay if not come see me afterwards and measure was dcf which is essentially just
0:04:40the miss rates a time plus one hundred times of false alarm rate
0:04:46and the official overall measure was the mindcf
0:04:54seen here
0:04:57so the challenge i-vectors were produced with the system developed jointly we between johns hopkins
0:05:02and mit lincoln labs and it uses the standard mfccs and don't the acoustic features
0:05:10and use the gmm train set
0:05:17the source data were the ldc mixer corpora particular mixtures one three seven as well
0:05:22as remakes and included around sixty thousand telephone call sides from about six thousand speakers
0:05:28and the duration of these calls were up to five minutes drawn from a log
0:05:34normal distribution
0:05:35with the mean of nearly forty seconds
0:05:39for each selected segment participants were provided with this example dimensional i-vector as well as
0:05:45the duration from of the speech from which the i'd draw a vector was extracted
0:05:52so this is the data and then the data was partitioned into a development set
0:05:56and enrollment test set
0:05:59after the development partition the calls were from speakers without test data
0:06:04and consisted of round three six thousand telephone call sides from around five thousand speakers
0:06:10and as i said earlier it was unlabeled so
0:06:15no speaker like bowls we're given with the development partition
0:06:21for the enrollment and test
0:06:24calls were from speakers with at least five calls from different phone numbers and at
0:06:28least eight calls from a single phone number consisted of a about thirteen hundred target
0:06:36i'm sort target models
0:06:38almost ten thousand test i-vectors and the target trials we're limited to ten same intent
0:06:44different phone number calls per speaker and non-target trials came from other target speakers as
0:06:49well as a five hundred speakers who are not
0:06:53other target speakers two hundred fifty males and
0:06:56fifty females
0:07:00the trials consisted of all possible pairs of a target speaker and the test i-vector
0:07:07about twelve and half a million trials
0:07:10and included cross sex nontarget trials as willow same number
0:07:14target trials
0:07:16the trials were divided into two randomly selected subsets that someone asked about this the
0:07:21speakers did overlap between the progress subset and the evaluation subset
0:07:29forty percent was used for a progress subset which
0:07:36was what was used to monitor progress and people familiar maybe not from where i
0:07:41should say with the
0:07:42i challenge there was a
0:07:46a progress board where people could see how they were doing and how other people
0:07:52we're doing
0:07:53and that was actually
0:07:59updated using the progress set and sixty percent of the data was held out
0:08:04into the end of the evaluation period
0:08:07and then the system submissions were scored for the official results
0:08:13using this remaining sixty percent
0:08:18so some structure to the evaluation system output for each trial could be based only
0:08:24on the trials model and test i-vectors as well as the durations provided and the
0:08:29provided development data
0:08:32normalization over multiple test segments are target speakers was not allowed
0:08:36use of evaluation data from for nontarget speaker modeling was not allowed
0:08:41and training system parameters using data not provided as part of the challenge was also
0:08:46model out one two and three these or
0:08:51pretty typical for the nist develops for is actually knew
0:08:55in the intent was to remove data engineering and also encourage participation from so it's
0:09:00a don't have a lot of their own speech data
0:09:05so in terms of dissipation there about three hundred registrants from about fifty countries
0:09:11hundred and forty of the registrants from hundred and five unique sites
0:09:15i'm at least one valid submissions so there were some
0:09:19some number people registered but worked able to some of the system
0:09:25the numbers submissions actually exceeded eight thousand if we compare these numbers to a street
0:09:29well we do see a really large increase in participation which are excited c
0:09:38in addition to receiving data
0:09:40a baseline system was distributed with the evaluation
0:09:47it used a variant of cosine scoring accuracy the five steps estimate a global mean
0:09:53and covariance and the unlabeled data
0:09:56update that's mean and variance by center and whiten you know a project them onto
0:10:01a unit sphere
0:10:02and that for each model
0:10:03i average it's five i-vectors and project those on the unit sphere and then compute
0:10:08the inner product
0:10:12one thing to note is because the dev data was unlabeled at the b d
0:10:16c n and lda were
0:10:18not possible to use
0:10:21in addition to that there was an oracle system that was not provided but kept
0:10:26a g h u
0:10:28which have access to the development speaker data will development data speaker labels
0:10:35and the
0:10:37a system was gender dependent with a four dimensional speaker space all of the i-vectors
0:10:44for each model or let length normalized are then averaged
0:10:49and it discarded i-vectors with duration less than thirty seconds which actually reduce the development
0:10:56quite a bit
0:11:00and here we see our first result so
0:11:04z a red line as their oracle system
0:11:08and the blue line is the baseline system the solid line is on the evaluation
0:11:13set of trials use of the sixty percent or
0:11:17held out in the dotted line is on the progress set
0:11:22so basically the gap between these lines indicate the
0:11:26potential value of having speaker labels
0:11:30so the hope was to be able to use clustering techniques from the development set
0:11:35up close this gap
0:11:42here we see
0:11:45so i here
0:11:48is the mindcf on the oracle system and on the baseline system the blue line
0:11:55is the progress set
0:11:56and the red line is the ml set
0:11:58and here we see the
0:12:00top ten performing systems and how they did on the progress set and on the
0:12:06ml sit
0:12:08performance on the eval set was consistently better than progress
0:12:12not exactly sure why then some random variation
0:12:16and seventy five percent of participants submitted a system that outperform the baseline true really
0:12:21please soon as well
0:12:23are we do not time
0:12:26so okay great actually course so
0:12:31oops what skip this
0:12:34accuracy progress over time
0:12:37the green line is on the of al so that
0:12:42and the blue line as on the progress set
0:12:44and the red line is on the progress set to so basically the green line
0:12:47is the very best score observed to date
0:12:51same with the blue line
0:12:53and then the red line is for the system that and it up
0:12:57with the top performance
0:13:00at the end so we see it's
0:13:05history of the performance over time
0:13:09couple thing is that we note it was the performance levelled off after about six
0:13:12weeks we ran this from december
0:13:15through april
0:13:17and basically after six weeks but not much for a progress was observed
0:13:24and also interesting to note was that the leading system
0:13:28did not lead basically from december till february
0:13:32output by it's a period that
0:13:37i taking the lead to stay there
0:13:42here we see performance by gender on the left
0:13:47of each of these is the leading system
0:13:50and on the right is the baseline system
0:13:54one thing kind of interesting to note
0:13:56is the leading system did worse
0:13:58a on same sex trials than on male only and female only i which might
0:14:04be unexpected but
0:14:07i think an explanation for this is that there were calibration issues
0:14:14accuracy performance by same and different phone number
0:14:18here the blue is the baseline
0:14:21on the left the same number of the right is different number
0:14:24and here i guess like with the gender
0:14:27i we see limited degradation in performance to the change and phone number from the
0:14:30leading system
0:14:32so this was very close
0:14:36even compared to the baseline which was fairly close
0:14:42so there's some additional information available you can see the odyssey paper for more results
0:14:49for example more information about the progress over time and gender effects as well same
0:14:55a different phone numbers
0:14:57we also have an interspeech paper that does some analysis of participation
0:15:01i gives us some of these same results but on the progress set the odyssey
0:15:04paper focuses entirely on the ml set
0:15:08and there's the lots of work to do
0:15:11so that we have future paper on duration age another results as you can see
0:15:15those things for additional information you can also please feel free to contact us
0:15:20so some conclusions we thought that the process worked which was very exciting for us
0:15:26the website was brought up and stayed up which was good
0:15:33participation exceeded that of prior sre is
0:15:37which was a of the goal
0:15:39and many states significantly improved on the baseline system
0:15:44further investigation and feedback will be needed
0:15:47in order to determine the extent to which the you participation was from outside of
0:15:52the audio processing community
0:15:55for people who are signed up we
0:15:59eventually asked if they were from the audio processing community but we didn't thing to
0:16:04do that during the initial sign up so that all other cases we don't know
0:16:10whether a the additional participation came from outside the outside the audio processing community or
0:16:18thousands of submissions provide data for further analysis which we look for to doing
0:16:29these things include things like clustering of unlabeled data and gender differences across and within
0:16:34trials effects of handsets role of duration
0:16:41and speaking of future work
0:16:44we plan to enhance the online platform for example would like to put analysis tools
0:16:50on the platform for participants to use
0:16:54we expect to
0:16:57offer further online challenges
0:17:00and in part because they're more readily organized and also because
0:17:05it's a possible to efficiently we use of test data
0:17:09but we expect that we use results will affect full fledged evaluations as well or
0:17:15the typical s are used
0:17:17as well for example we'd like to
0:17:21have increasingly web based in user friendly procedures for
0:17:25i registration in for data distribution
0:17:28and it's possible that were use a separate datasets evaluation datasets
0:17:38want a four iterations graph performance in another held-out with limited exposure
0:17:43i we've seen this used in
0:17:46i have passed
0:17:48nist evaluations and it may
0:17:51and see renewed use in a series
0:17:55thank you very much
0:18:07i craig and you pass like twenty one okay
0:18:12i'm wondering with those this seems just the weights is leading system
0:18:18is that the leading system and that's two conditions are same sets
0:18:22a sure that is the same system in
0:18:26in both
0:18:33i used in a reasonable idea oracle was and different directed and what you distribute
0:18:38which one six hundred twenty four hundred one so why you keep the same i-vector
0:18:45the two distributions i lincoln you may be addressed
0:19:02craig in your final slide you mentioned
0:19:05that the last point that's a data set for iterated use
0:19:12are you thinking of something similar to what you have now the
0:19:19the point i'm getting at is
0:19:21if you want to train for example calibration or fusion
0:19:26then it's then it's very nice to average of feedback for example
0:19:32the derivatives of your system parameters with respect to those schools so
0:19:40you think
0:19:41it would be possible to
0:19:45i'm not sure whether it's in the question is
0:19:48is this an issue of not having speaker labels for development or
0:20:00we want to be able to train
0:20:02a fusion sure on the type that so can you see that happening or
0:20:08because if you would just give us the data we could do that but if
0:20:11the data stays
0:20:14on the other side and all side
0:20:17that's a more difficult and then sure more complex
0:20:32yes and one thing that maybe i should clarifies this was really meant in the
0:20:37context of sre in other nist evaluations sometimes
0:20:41the reuse dataset from one you're to another here
0:20:45up of also have some
0:20:48i guess what's called the progress set but they use a different sense then we
0:20:52are using it here
0:20:54where people won't get
0:20:58the key for that
0:20:59but they will have the key for the review set
0:21:02as editors your question or
0:21:09we question i just wondered is not relevant to do according to the rules but
0:21:14those thirty nine nodes
0:21:17are all models are a little different speakers where they're not there were some speakers
0:21:23because there was a distortion weighted it would be or not