Speech Transcript - Summary of the 2015 NIST Language Recognition i-Vector Machine Learning Challenge

0:00:14	she thank you also
0:00:18	so the language recognition i-vector challenge had three main goals
0:00:26	first to including attracts people from outside a regular community
0:00:32	and to make
0:00:35	this
0:00:37	work that we do more accessible to that
0:00:39	and the idea behind that was to people to explore new approaches and methods
0:00:47	from machine learning and language recognition with the overall goal of improving performance and language
0:00:52	recognition
0:00:55	the task was open set language identification so given audio segments a which are and
0:01:00	languages the audio segments spoken in or whether was and
0:01:06	unknown language
0:01:09	the data used was from previous and a cell l are used as well as
0:01:14	from the i r pa babble program
0:01:17	and the data was selected in such a manner such that multiple sources were used
0:01:25	for each language in order to reduce
0:01:27	the source and language fact
0:01:30	and we're also select in order to have highly confusable languages included in the
0:01:37	dataset
0:01:40	accuracy the size of the data there were fifty languages and train and sixty five
0:01:44	and dev and test
0:01:47	about three hundred per language gender segments per language in the training and about a
0:01:53	hundred
0:01:53	and the devon test
0:01:55	and we see the total number of segments all the way the right hand column
0:02:01	fifteen hundred for training so about sixty four hundred for dev and about sixty five
0:02:05	hundred for test
0:02:06	and the training set did not include data that was from out of set
0:02:12	the development set included and unlabeled out of set
0:02:15	and the test set was divided into progress and evaluation subsets so we'll
0:02:21	cover and just a moment
0:02:23	people were able to upload their system outputs and receive some feedback on how that
0:02:29	one and that was done using a progress set
0:02:32	and then at the end of the evaluation period
0:02:36	a feedback was given on an evaluation set in it was a partition so there's
0:02:40	not overlap
0:02:44	here we see data sources for each language
0:02:48	on the
0:02:50	right hand side i sure noisy that is to see
0:02:53	you can see different corpora labels i think that a high-level we can say
0:02:58	blue or conversational telephone speech green include
0:03:04	broadcast narrowband speech and yellow is a combination of the two
0:03:09	i think
0:03:10	one thing to say is that if you look across
0:03:13	the training data which is the i guess you're leftmost column
0:03:17	the dev data which is in the middle and the test data to rest of
0:03:20	the right
0:03:21	the distribution across sources is very similar per language there are a few exceptions
0:03:27	and as we mentioned there was no out of set
0:03:29	due to the training
0:03:36	and here we see us speech duration
0:03:41	both in trained up and test
0:03:43	training is this page that is green and test is blue
0:03:48	and we see it again a similar distribution a model trained of interest
0:03:53	this was low more
0:03:59	the performance metric was error rates split into out of seven languages and within seven
0:04:04	languages
0:04:06	where the prior probability of a lot of seven languages point two three
0:04:15	participation was
0:04:18	wonderful a more than what will typically see and a lre
0:04:23	was from international sites six continents and thirty one countries
0:04:30	about eighty participants to model the data know little a fifty five per se but
0:04:34	the results
0:04:36	from
0:04:37	forty four unique organisations
0:04:41	during the evaluation period a little over seventy i'm sorry thirty seven hundred dollars emissions
0:04:46	were submitted
0:04:49	and that number continues to grow
0:04:54	after which
0:04:59	and mentioned that we
0:05:01	i had more participation and the i-vector challenge that we need to be with your
0:05:05	salary and we can see some other comparisons
0:05:09	i guess i've not had said one of the main differences between the i-vector challenge
0:05:15	and a traditional areas in the data that we distribute
0:05:19	and the traditional battery we send a audio segments as input to systems and i-vector
0:05:26	challenge we send i-vectors instead
0:05:30	the task was different never to challenge as a open set identification instill detection
0:05:37	and i-vector challenge the cost was based on a kind of total error rates per
0:05:43	language and in the traditional laureates on miss and false alarm rates
0:05:48	a larger number of target languages a different
0:05:52	distribution of speech duration and mention that was log normal and i-vector challenge in the
0:05:57	traditional array it's three ten and thirty second bins traditionally
0:06:02	the challenge lasted much longer than the i-vector challenge
0:06:07	and it
0:06:08	but also the i-vector challenge results were
0:06:12	feedback where it was given during the challenge period which is also about something we
0:06:16	do in traditional evaluations
0:06:19	and last there was a an evaluation platform that was online
0:06:27	and this was something that we
0:06:30	focused on for the i-vector challenge
0:06:33	in particular the goal was to facilitate
0:06:36	the evaluation process with limited human involvement
0:06:40	all evaluation activities were conducted via this platform including receiving the data
0:06:47	uploading submissions and been able to see how things went
0:06:56	and now looking at some results on the y-axis we see
0:07:01	cost
0:07:03	and on the x-axis a time
0:07:06	the first
0:07:07	first diff i think is around may seventeenth the choice certainly first
0:07:12	and the second floor
0:07:14	large dip is on may twenty first so
0:07:18	of about half roughly half of the progress made during the evaluation to place during
0:07:25	the first
0:07:25	two or three weeks or so
0:07:28	and then during the remainder of four months the rest of the progress was made
0:07:37	here we also see cost on the y-axis one x-axis we see
0:07:43	participant id so these are really discrete it's sorted by best cost
0:07:49	obtained on the evaluation
0:07:50	a subset
0:07:52	and so we see most of the sites be the be the baseline
0:07:59	which is trained and a few sites be an oracle system so i guess speaking
0:08:03	of speaking to both of these the baseline i believe is a simple
0:08:13	a simple
0:08:17	system that used cosine distance and oracle system used p lda
0:08:25	so it's called oracle because there were unlabeled data that were distributed to the participants
0:08:30	butts the oracle system used those labels
0:08:38	and here we see the number of submissions per participant
0:08:42	in general
0:08:43	a participants you did well estimated more systems but there were
0:08:48	a few exceptions i think now is a reasonable time dimension that
0:08:54	participant id and
0:08:56	site id the distinction between participants and site so
0:09:02	participants as someone who signed up and maybe there were multiple participants personally so i
0:09:08	use are not necessarily unrelated for example section three may have also been by thirty
0:09:15	just
0:09:20	and you receive results by a target language we have every year on the y-axis
0:09:27	on x-axis we see language the lowest error or was received on
0:09:39	parameters and highest on hindi
0:09:42	what was surprising was english also had a high error rate
0:09:47	second from can be actually of second for the right
0:09:51	and the blue was the out of seven languages somewhere in the middle the pack
0:09:58	and here we see results by speech duration i guess no surprise that is you
0:10:04	get more audio
0:10:08	you tend to do better
0:10:10	one thing that
0:10:12	is also may be interesting is there seems to be some diminishing marginal returns
0:10:17	so if for example you had three seconds and you could get ten you do
0:10:26	maybe
0:10:27	we
0:10:28	point to better but if you want from
0:10:34	a ten to twenty
0:10:36	the difference is not so great
0:10:38	just as an example
0:10:42	so some lessons learned
0:10:44	wonderful participation were all very grateful for you in the audience to fit it is
0:10:51	this was those we couldn't dryness today
0:10:55	number of systems be the baseline that surprisingly six you're actually better than the oracle
0:10:59	system sure hoping to learn more about
0:11:03	a half of the improvement made as early on i which may just to reconsider
0:11:09	the timeline
0:11:11	surprisingly top systems do not all do so well on english
0:11:18	performance of out of seven languages also was not is for this we might have
0:11:22	expected
0:11:25	we did not receive many system descriptions so it's unclear how many of the participants
0:11:32	attended have its although
0:11:34	later in the session will your from
0:11:38	tops is thus able to capture stated in the a team that created top system
0:11:44	that did develop level techniques and we'll see more that
0:11:48	and the web platform ends up so please feel free to visit and participant the
0:11:54	challenge now
0:11:57	and see how see how you're doing
0:12:00	and a quick plug for upcoming activities there's a story sixteen and workshop
0:12:06	where the it speaker detection on telephone speech recorded over a variety of handsets
0:12:13	similar to lre fifteen those are from layer there's now a fixed training condition as
0:12:17	well as an open condition
0:12:20	can see some other there so that the evaluation and there's also a twenty sixteen
0:12:25	lre analysis workshop and all of this will be co-located with salty sixteen and
0:12:30	send
0:12:32	so it looks like we have time for
0:12:35	for questions

Summary of the 2015 NIST Language Recognition i-Vector Machine Learning Challenge

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Audrey Tong, Craig Greenberg, Alvin Martin, Desire Banse, John Howard, Hui Zhao, George Doddington, Daniel Garcia-Romero, Alan McCree, Douglas Reynolds, Elliot Singer, Jaime Hernandez-Cordero, Lisa Mason