Speech Transcript - The NIST 2014 Speaker Recognition i-vector Machine Learning Challenge

0:00:14	she this work was done in collaboration with a very large number of colleagues and
0:00:20	everyone the latter part work so
0:00:24	i'd like to think
0:00:27	disarray and george daniel jack tommy alvin alan mark and dog
0:00:37	so the goal of the challenge was to support and encourage development of new methods
0:00:42	for speaker detection utilizing i-vectors in the intent was to explore new ideas and machine
0:00:49	learning for used in speaker recognition
0:00:53	and to trying to the field more accessible for people outside of the audio processing
0:00:56	community and to improve the performance of the technology
0:01:05	the chunk format for people who don't know
0:01:08	was to use i-vectors us also the sure been in audio was to distribute the
0:01:12	i-vectors themselves and it was all hosted on a web platform so was entirely online
0:01:18	the registration the system submission and receiving results all online
0:01:26	the reason for using i-vectors in the web platform was to attempt to expand the
0:01:32	number and types of participants including ones from the ml community
0:01:36	and to allow iterative submissions with the fast turnaround and order to support research progress
0:01:43	during the actual evaluation
0:01:48	another think that was that different from what people may be accustomed to with the
0:01:51	regular sre was that's a large development set of unlabeled i-vectors was distributed to be
0:02:00	used for dev data the intent there was to encourage new creative approaches the modelling
0:02:06	and in particular
0:02:08	the use of clustering to improve performance
0:02:12	in addition to these things one thing we were hoping to do was to set
0:02:16	a precedence or at least have a proof of concept for future evaluations where there
0:02:22	can be web based registration data distribution potentially and results submission trying to make this
0:02:28	more efficient and more user friendly
0:02:32	after the community
0:02:36	so
0:02:38	the objective straight the data selection was to include multiple training sessions for each target
0:02:43	speaker in the main evaluation test in a recent histories
0:02:47	an optional test is involved multiple training sessions but
0:02:52	in this challenge we wanted to include that for everyone to do is the main
0:02:59	focus
0:03:01	also i same handset target trials and cross sex nontarget trials both the which or
0:03:08	unusual
0:03:10	for the regular sre
0:03:12	also something different was taking i-vectors from a log normal distribution as opposed to
0:03:20	some discrete uniform
0:03:25	durations the reason for this was filters more realistic and it's the challenge that's
0:03:32	people seemed eager to address
0:03:34	and also well just varying the duration allows us to do
0:03:39	post evaluation analysis
0:03:43	so the task is speaker detection which hopefully everybody here knows by the third data
0:03:48	for the c
0:03:49	and the system i was evaluated over a set of trials where you trial compared
0:03:55	a target speaker model in this case this was a set of five i-vectors and
0:04:00	it test speech segment comprised of a single i-vector
0:04:05	the system determines whether or not the speaker and the test segment as a target
0:04:08	speaker about put in a single real number
0:04:12	and no decision was necessary
0:04:15	the trial outputs are then compare to ground truth to compute a performance measure which
0:04:19	for the i-vector challenge was in dcf
0:04:24	hopefully people know what target trials and non-target trials and misses and false alarms are
0:04:29	it does anyone not know that
0:04:33	okay if not come see me afterwards and measure was dcf which is essentially just
0:04:40	the miss rates a time plus one hundred times of false alarm rate
0:04:46	and the official overall measure was the mindcf
0:04:54	seen here
0:04:57	so the challenge i-vectors were produced with the system developed jointly we between johns hopkins
0:05:02	and mit lincoln labs and it uses the standard mfccs and don't the acoustic features
0:05:10	and use the gmm train set
0:05:17	the source data were the ldc mixer corpora particular mixtures one three seven as well
0:05:22	as remakes and included around sixty thousand telephone call sides from about six thousand speakers
0:05:28	and the duration of these calls were up to five minutes drawn from a log
0:05:34	normal distribution
0:05:35	with the mean of nearly forty seconds
0:05:39	for each selected segment participants were provided with this example dimensional i-vector as well as
0:05:45	the duration from of the speech from which the i'd draw a vector was extracted
0:05:52	so this is the data and then the data was partitioned into a development set
0:05:56	and enrollment test set
0:05:59	after the development partition the calls were from speakers without test data
0:06:04	and consisted of round three six thousand telephone call sides from around five thousand speakers
0:06:10	and as i said earlier it was unlabeled so
0:06:15	no speaker like bowls we're given with the development partition
0:06:21	for the enrollment and test
0:06:24	calls were from speakers with at least five calls from different phone numbers and at
0:06:28	least eight calls from a single phone number consisted of a about thirteen hundred target
0:06:34	speakers
0:06:36	i'm sort target models
0:06:38	almost ten thousand test i-vectors and the target trials we're limited to ten same intent
0:06:44	different phone number calls per speaker and non-target trials came from other target speakers as
0:06:49	well as a five hundred speakers who are not
0:06:53	other target speakers two hundred fifty males and
0:06:56	fifty females
0:07:00	the trials consisted of all possible pairs of a target speaker and the test i-vector
0:07:07	about twelve and half a million trials
0:07:10	and included cross sex nontarget trials as willow same number
0:07:14	target trials
0:07:16	the trials were divided into two randomly selected subsets that someone asked about this the
0:07:21	speakers did overlap between the progress subset and the evaluation subset
0:07:29	forty percent was used for a progress subset which
0:07:36	was what was used to monitor progress and people familiar maybe not from where i
0:07:41	should say with the
0:07:42	i challenge there was a
0:07:46	a progress board where people could see how they were doing and how other people
0:07:52	we're doing
0:07:53	and that was actually
0:07:59	updated using the progress set and sixty percent of the data was held out
0:08:04	into the end of the evaluation period
0:08:07	and then the system submissions were scored for the official results
0:08:13	using this remaining sixty percent
0:08:18	so some structure to the evaluation system output for each trial could be based only
0:08:24	on the trials model and test i-vectors as well as the durations provided and the
0:08:29	provided development data
0:08:32	normalization over multiple test segments are target speakers was not allowed
0:08:36	use of evaluation data from for nontarget speaker modeling was not allowed
0:08:41	and training system parameters using data not provided as part of the challenge was also
0:08:46	model out one two and three these or
0:08:51	pretty typical for the nist develops for is actually knew
0:08:55	in the intent was to remove data engineering and also encourage participation from so it's
0:09:00	a don't have a lot of their own speech data
0:09:05	so in terms of dissipation there about three hundred registrants from about fifty countries
0:09:11	hundred and forty of the registrants from hundred and five unique sites
0:09:15	i'm at least one valid submissions so there were some
0:09:19	some number people registered but worked able to some of the system
0:09:25	the numbers submissions actually exceeded eight thousand if we compare these numbers to a street
0:09:29	well we do see a really large increase in participation which are excited c
0:09:38	in addition to receiving data
0:09:40	a baseline system was distributed with the evaluation
0:09:47	it used a variant of cosine scoring accuracy the five steps estimate a global mean
0:09:53	and covariance and the unlabeled data
0:09:56	update that's mean and variance by center and whiten you know a project them onto
0:10:01	a unit sphere
0:10:02	and that for each model
0:10:03	i average it's five i-vectors and project those on the unit sphere and then compute
0:10:08	the inner product
0:10:12	one thing to note is because the dev data was unlabeled at the b d
0:10:16	c n and lda were
0:10:18	not possible to use
0:10:21	in addition to that there was an oracle system that was not provided but kept
0:10:26	a g h u
0:10:28	which have access to the development speaker data will development data speaker labels
0:10:35	and the
0:10:37	a system was gender dependent with a four dimensional speaker space all of the i-vectors
0:10:44	for each model or let length normalized are then averaged
0:10:49	and it discarded i-vectors with duration less than thirty seconds which actually reduce the development
0:10:55	set
0:10:56	quite a bit
0:11:00	and here we see our first result so
0:11:04	z a red line as their oracle system
0:11:08	and the blue line is the baseline system the solid line is on the evaluation
0:11:13	set of trials use of the sixty percent or
0:11:17	held out in the dotted line is on the progress set
0:11:22	so basically the gap between these lines indicate the
0:11:26	potential value of having speaker labels
0:11:30	so the hope was to be able to use clustering techniques from the development set
0:11:35	up close this gap
0:11:42	here we see
0:11:44	results
0:11:45	so i here
0:11:48	is the mindcf on the oracle system and on the baseline system the blue line
0:11:55	is the progress set
0:11:56	and the red line is the ml set
0:11:58	and here we see the
0:12:00	top ten performing systems and how they did on the progress set and on the
0:12:06	ml sit
0:12:08	performance on the eval set was consistently better than progress
0:12:12	not exactly sure why then some random variation
0:12:16	and seventy five percent of participants submitted a system that outperform the baseline true really
0:12:21	please soon as well
0:12:23	are we do not time
0:12:26	so okay great actually course so
0:12:31	oops what skip this
0:12:34	accuracy progress over time
0:12:37	the green line is on the of al so that
0:12:42	and the blue line as on the progress set
0:12:44	and the red line is on the progress set to so basically the green line
0:12:47	is the very best score observed to date
0:12:51	same with the blue line
0:12:53	and then the red line is for the system that and it up
0:12:57	with the top performance
0:13:00	at the end so we see it's
0:13:05	history of the performance over time
0:13:09	couple thing is that we note it was the performance levelled off after about six
0:13:12	weeks we ran this from december
0:13:15	through april
0:13:17	and basically after six weeks but not much for a progress was observed
0:13:24	and also interesting to note was that the leading system
0:13:28	did not lead basically from december till february
0:13:32	output by it's a period that
0:13:37	i taking the lead to stay there
0:13:42	here we see performance by gender on the left
0:13:47	of each of these is the leading system
0:13:50	and on the right is the baseline system
0:13:54	one thing kind of interesting to note
0:13:56	is the leading system did worse
0:13:58	a on same sex trials than on male only and female only i which might
0:14:04	be unexpected but
0:14:07	i think an explanation for this is that there were calibration issues
0:14:14	accuracy performance by same and different phone number
0:14:18	here the blue is the baseline
0:14:21	on the left the same number of the right is different number
0:14:24	and here i guess like with the gender
0:14:27	i we see limited degradation in performance to the change and phone number from the
0:14:30	leading system
0:14:32	so this was very close
0:14:36	even compared to the baseline which was fairly close
0:14:42	so there's some additional information available you can see the odyssey paper for more results
0:14:49	for example more information about the progress over time and gender effects as well same
0:14:55	a different phone numbers
0:14:57	we also have an interspeech paper that does some analysis of participation
0:15:01	i gives us some of these same results but on the progress set the odyssey
0:15:04	paper focuses entirely on the ml set
0:15:08	and there's the lots of work to do
0:15:11	so that we have future paper on duration age another results as you can see
0:15:15	those things for additional information you can also please feel free to contact us
0:15:20	so some conclusions we thought that the process worked which was very exciting for us
0:15:26	the website was brought up and stayed up which was good
0:15:33	participation exceeded that of prior sre is
0:15:37	which was a of the goal
0:15:39	and many states significantly improved on the baseline system
0:15:44	further investigation and feedback will be needed
0:15:47	in order to determine the extent to which the you participation was from outside of
0:15:52	the audio processing community
0:15:55	for people who are signed up we
0:15:59	eventually asked if they were from the audio processing community but we didn't thing to
0:16:04	do that during the initial sign up so that all other cases we don't know
0:16:10	whether a the additional participation came from outside the outside the audio processing community or
0:16:16	not
0:16:18	thousands of submissions provide data for further analysis which we look for to doing
0:16:25	and
0:16:29	these things include things like clustering of unlabeled data and gender differences across and within
0:16:34	trials effects of handsets role of duration
0:16:41	and speaking of future work
0:16:44	we plan to enhance the online platform for example would like to put analysis tools
0:16:50	on the platform for participants to use
0:16:54	we expect to
0:16:57	offer further online challenges
0:17:00	and in part because they're more readily organized and also because
0:17:05	it's a possible to efficiently we use of test data
0:17:09	but we expect that we use results will affect full fledged evaluations as well or
0:17:15	the typical s are used
0:17:17	as well for example we'd like to
0:17:21	have increasingly web based in user friendly procedures for
0:17:25	i registration in for data distribution
0:17:28	and it's possible that were use a separate datasets evaluation datasets
0:17:38	want a four iterations graph performance in another held-out with limited exposure
0:17:43	i we've seen this used in
0:17:46	i have passed
0:17:48	nist evaluations and it may
0:17:51	and see renewed use in a series
0:17:55	thank you very much
0:18:07	i craig and you pass like twenty one okay
0:18:12	i'm wondering with those this seems just the weights is leading system
0:18:18	is that the leading system and that's two conditions are same sets
0:18:22	a sure that is the same system in
0:18:26	in both
0:18:33	i used in a reasonable idea oracle was and different directed and what you distribute
0:18:38	which one six hundred twenty four hundred one so why you keep the same i-vector
0:18:45	the two distributions i lincoln you may be addressed
0:18:50	like
0:19:02	craig in your final slide you mentioned
0:19:05	that the last point that's a data set for iterated use
0:19:12	are you thinking of something similar to what you have now the
0:19:19	the point i'm getting at is
0:19:21	if you want to train for example calibration or fusion
0:19:26	then it's then it's very nice to average of feedback for example
0:19:32	the derivatives of your system parameters with respect to those schools so
0:19:40	you think
0:19:41	it would be possible to
0:19:45	i'm not sure whether it's in the question is
0:19:48	is this an issue of not having speaker labels for development or
0:19:58	well
0:20:00	we want to be able to train
0:20:02	a fusion sure on the type that so can you see that happening or
0:20:08	because if you would just give us the data we could do that but if
0:20:11	the data stays
0:20:14	on the other side and all side
0:20:17	then
0:20:17	that's a more difficult and then sure more complex
0:20:26	right
0:20:32	yes and one thing that maybe i should clarifies this was really meant in the
0:20:37	context of sre in other nist evaluations sometimes
0:20:41	the reuse dataset from one you're to another here
0:20:45	up of also have some
0:20:48	i guess what's called the progress set but they use a different sense then we
0:20:52	are using it here
0:20:54	where people won't get
0:20:58	the key for that
0:20:59	but they will have the key for the review set
0:21:02	as editors your question or
0:21:04	okay
0:21:09	we question i just wondered is not relevant to do according to the rules but
0:21:14	those thirty nine nodes
0:21:17	are all models are a little different speakers where they're not there were some speakers
0:21:23	because there was a distortion weighted it would be or not

The NIST 2014 Speaker Recognition i-vector Machine Learning Challenge

Nist I-Vector Special Session

Craig Greenberg, Désiré Bansé, George Doddington, Daniel Garcia-Romero, John Godfrey, Tomi Kinnunen, Alvin Martin, Alan McCree, Mark Przybocki and Douglas Reynolds