Speech Transcript - NIST Language Recognition Evaluation – Past and Future

0:00:15	and i talk about the nist language recognition evaluations a past and future this is
0:00:20	work done with colleagues
0:00:22	of an john georgian jack
0:00:27	so there are two tasks
0:00:30	and language recognition
0:00:32	identification which is choose among and specified target languages and detection is the speech and
0:00:39	the target language
0:00:42	and the lre tasks that have been part of the nist evaluations have evolved over
0:00:49	time
0:00:50	the early l or ease and ninety six three and two thousand five focused on
0:00:55	identification
0:00:57	and the recent salaries focused on detection
0:01:01	the most recent lre and the next lre will focus on detection limited to language
0:01:08	pair
0:01:11	i and the rationale for the change is that we believe the two class problem
0:01:16	is can conceptually simpler
0:01:19	and represents the fundamental challenge
0:01:22	and the improve performance over time has required ever increasing data to reliably estimate error
0:01:28	rates
0:01:32	there are three category distinctions and lre
0:01:37	dialect which might be thought of as speech patterns of a particular group
0:01:42	language which is a dialect with an army in the navy
0:01:48	and linguistic variety a way to dodge the issue
0:01:57	like the task that category distinctions what we're actually trying to recognise change over time
0:02:03	in earlier ease there was a distinction between language and dialects
0:02:07	and in fact there were separate dialect and language test in those years except pro
0:02:11	three
0:02:13	and recent years and in the next lre we've may no distinction between languages and
0:02:19	dialects
0:02:20	and instead test confusable linguistic variety clusters
0:02:26	and among the reasons for the changes that there is no accepted language dialect criteria
0:02:31	and that dialect is used in consistent ways for example
0:02:36	chinese dialects are i'm sorry chinese languages are mutually intelligible
0:02:43	but hindi or i'll start chinese dialects are mutually intelligible but hindi and urdu distinctions
0:02:49	are primarily and non-linguistic
0:02:57	there are three data collection approaches the that have been used in lre
0:03:01	one we might refer to as color where someone's paid to make a single phone
0:03:05	call and his or her speech is used
0:03:08	a class based model
0:03:09	repeat someone to make many calls in the speech of the interlocutor is used
0:03:13	and then broadcast where you find narrowband speech and radio broadcasts
0:03:23	really ovaries took the colour approach and recent ovaries in two thousand nine eleven in
0:03:29	the next to larry
0:03:30	will combine their clack
0:03:31	and broadcast approaches
0:03:35	and the reason for the changes that the large number of unique speakers of each
0:03:38	i'm sorry there are a large number of unique speech or speakers needed for each
0:03:42	language
0:03:43	and single speaker phone calls will become increasingly expect expensive to collect an experiment showed
0:03:51	that broadcast could be used and language recognition evaluation
0:03:57	to produce comparable for performance results
0:04:03	so there are two broad classes of metric sort of been used see that which
0:04:08	we see here is a weighted linear combination of the miss and false alarms and
0:04:13	see that language pair with a linear combination of miss and false alarms but for
0:04:18	each language pair
0:04:21	the earlier larry's you see that's the very early l are easy you see that
0:04:26	the more recent lre is used to never see that and the most recent mallory
0:04:30	used average see that over language pairs
0:04:36	and the primary reason to change the metric is changed is has been to reflect
0:04:42	a new task focuses
0:04:46	so here we see
0:04:49	the average see that for thirty seconds ten seconds and three seconds
0:04:55	where the red line is thirty seconds
0:04:59	that's thirty seconds of speech
0:05:00	ten seconds of speech
0:05:03	three seconds of speech
0:05:04	then we see performance improvements over years with some caviar
0:05:09	in particular the ones we just discussed that the task change from identification to detection
0:05:15	other languages change from you the year
0:05:18	and the data sources changed
0:05:20	from
0:05:21	calls
0:05:24	solely calls in these years two calls and broadcasts
0:05:27	two thousand nine
0:05:30	and we see in two thousand nine for example on the thirty second
0:05:34	speech segments
0:05:35	that they were few errors observed
0:05:38	and leading systems
0:05:43	so here we see how leading systems for a language pair american english indian english
0:05:50	this is the most study pair in the sense that
0:05:53	it started back in two thousand five
0:05:55	and we seek an good performance improvement over time where the blue is
0:06:00	them in see that language pair for
0:06:03	thirty sec sorry a blue is for of the real seven
0:06:08	readily real nine
0:06:09	in green lre eleven and here we see thirty seconds ten seconds and three seconds
0:06:15	i consistent improvement
0:06:19	for hindi urdu the pictures less rosie
0:06:24	language pair remains challenging especially for the shorter durations
0:06:29	and the improvement we've seen over time is limited i again especially for the three
0:06:33	seconds
0:06:36	we suspect that's it's really in large part due to the problematic language distinction although
0:06:43	human test showed some consistency
0:06:46	with annotator judgements that they're also some consistency issues that were observed
0:06:54	here we see results for dari firstly
0:07:01	and we see improvement from lre online celery eleven in the thirty seconds and the
0:07:07	three seconds
0:07:12	and here we see the russian ukrainian language pair
0:07:16	and were
0:07:18	noticing
0:07:22	reversion trend
0:07:23	where lre eleven actually so worse performance
0:07:27	and we expect that this may have been due to change and data source between
0:07:31	the
0:07:32	training and evaluation data
0:07:37	so in summary nist has coordinated ovaries since nineteen ninety six
0:07:41	and have a emphasized detecting target language classes of interest some recent years
0:07:47	but the nature of the real english classes of the vault earlier evaluations achieved i
0:07:52	performance a broad language classes with separate dialect tests in this leads to the change
0:07:59	and later
0:08:01	the change was to move away from the language dialect distinction
0:08:04	towards pairwise testing of closely related varieties
0:08:10	so for future evaluations the next a value language recognition evaluation is planned for twenty
0:08:15	fifteen with pairwise testing in within six broad language clusters
0:08:22	utilizing newly collected cts and broadcast news speech sounds are broadcast narrowband speech
0:08:29	the system output will be a vector of log likelihoods
0:08:33	which is a change from the
0:08:35	past evaluations
0:08:37	for each cluster will average performance overall there's on the cluster and the overall measure
0:08:43	will be the mean of the six cluster actual decisions
0:08:48	and it's open to all participants so for more information please jointly other email in
0:08:54	this by contacting us there
0:08:57	thank you very much
0:09:16	so
0:09:17	what the pairwise fisher
0:09:22	so the pairwise measure is actually going to be different in
0:09:26	and the next lre then and the last one but we will continued emphasized language
0:09:30	pairs as a research task
0:09:35	we believe that this is
0:09:40	a
0:09:44	we believe this is a focusing on the core problem
0:09:47	and language recognition
0:09:49	i want to say that
0:09:52	solving chinese english
0:09:55	distinction is no longer interesting
0:09:59	but maybe two varieties of english is more interesting
0:10:05	task
0:10:15	i wasn't there two thousand eleven i and i would be into six do you
0:10:20	still make the bolts because you were talking about
0:10:24	c get which is fine just to make the poles
0:10:27	as well
0:10:29	i try to recall but i want to say twenty eleven was the first worked
0:10:34	representation without any that plots are that's cool
0:10:38	but you could you control dimples for detection yes and then i would be to
0:10:43	see what you put along the axes
0:10:49	i think that point probabilities are what are you going say probability of false alarm
0:10:53	oregon say probability or indian english given the fact that smirk
0:11:01	i would i would so for the latter one
0:11:06	thank you
0:11:08	i still wanna go back one point with this is i and the pair maybe
0:11:13	someone
0:11:14	isn't getting what
0:11:15	give me a system that operates that way i mean to where you by saying
0:11:19	that you telling
0:11:20	basically detection system years used
0:11:23	i data much label by language
0:11:27	where is the pairwise thing come into that i once the system level i understand
0:11:31	from
0:11:32	maybe for research perspective so
0:11:36	you get distinction is what's just operate it more than one which systems that way
0:11:40	right
0:11:43	that's the that's interesting question it's difficult for me to first one i think there's
0:11:49	a tradeoff between
0:11:50	we application focused and being research focused
0:11:54	not to say that they're entirely different but i think in this case it's a
0:11:57	tradeoff and so really more towards the research currently
0:12:18	so you said you are gonna ask us to pretty to give you a factor
0:12:23	of language log-likelihoods yes and then you're going to subtract
0:12:28	two of those to get the score that would differentiate between pairs of languages such
0:12:33	as
0:12:35	so that's very nice because
0:12:39	the single vector likelihoods is a lot smaller than all the possible pairs so that
0:12:47	that's a nice compact score format yes i think the only request is that you
0:12:52	submit all pairs
0:12:55	so sorry just as i was making a joke sorry of
0:13:01	so
0:13:03	are you gonna concentrate again on heart decisions so you
0:13:07	you gonna have a seat get set up at the threshold of zero so is
0:13:10	that you gonna the that the criterion is then just gonna depend on whether the
0:13:14	score is
0:13:15	on that side of the side of the threshold
0:13:18	so
0:13:19	that then you gonna then it's not gonna method what the scale of the log-likelihood
0:13:23	vector is the has always comes are then you lose that one dimension of calibration
0:13:29	then it's just
0:13:31	the location of that vector in log-likelihood space matters but not the scale
0:13:36	yes understand you
0:13:38	if you somehow
0:13:40	do multiple operating points like you did in the sre
0:13:46	then you would get a handle on the scale
0:13:49	the scale factor as well
0:13:50	okay thank you have this is something to consider one planning
0:13:56	next
0:14:08	well
0:14:13	i
0:14:15	in two years we had this out-of-language problems and now other than the new evaluations
0:14:22	came out to you allowed people to the wall on this topic
0:14:28	so with the detection task it still possible to have a out of we can
0:14:36	not only above is an alternative so you can have
0:14:42	french or whatever the map that you have some is we is not closed set
0:14:46	up you have a unknown language you also rate we will i want to say
0:14:50	we can double
0:14:53	you we can self there were say twenty languages you could have a twenty
0:14:57	dimensional vector and for the closed and twenty one dimensional vector for the for the
0:15:02	open
0:15:03	do you have other information on the time lies on the skies and yes so
0:15:09	i right now were deliberating between having a during workshop and the summer workshop
0:15:16	so that would be the first half of the this your first have in the
0:15:24	case of the during workshop for the second half of the cases where the summer
0:15:27	workshop
0:15:36	okay

NIST Language Recognition Evaluation – Past and Future

Language Recognition

Alvin F. Martin, Craig S. Greenberg, John M. Howard, George R. Doddington and John J. Godfrey