Speech Transcript - I2R Submission to the 2015 NIST Language Recognition I-vector Challenge

0:00:15	so
0:00:17	going to all and comic from a is a defined for can research
0:00:21	the papers them go to precisely so i score submissions to the twenty fifteen is
0:00:26	a language recognition i-vector challenge
0:00:29	and this paper is that right will we might collecting the
0:00:32	in a subdued
0:00:37	okay this is a liar five percent asians were for so for a given a
0:00:41	break go and a will be one of the i-vector challenge which is have different
0:00:46	from a perspective of the organiser
0:00:51	okay
0:00:52	and then the what would be a always detections a strategy
0:00:57	which constitute the most part of the work you're in the i-vector challenge
0:01:02	and then we talked on the description of subsystems but you know the final submission
0:01:07	there have is in fact fusion from multiple systems
0:01:11	and then at this for the vice the expand result
0:01:14	so it's the conclusions
0:01:19	okay so
0:01:22	the
0:01:24	the i-vector challenge of course of consists of the i-vectors are extracted from fifty target
0:01:30	languages
0:01:31	that's why some unknown languages and
0:01:35	all these i-vectors probable the
0:01:38	conversational telephone speech and of service and the error rate of speech
0:01:43	and from the perspective of the participants there are three major challenges
0:01:48	the first one being the
0:01:51	it is also a language identification tasks
0:01:54	so in that in addition to the fifty target languages we have two models that
0:01:58	additional class
0:02:00	to detect the offset languages
0:02:02	and on top of these the set of always languages is unknown
0:02:07	and it has to be learned
0:02:08	from what a labeled training
0:02:10	development data
0:02:12	and
0:02:14	one if we get these that the unlabeled development data is that you're consists of
0:02:18	the target languages target languages and as well as the always languages so we have
0:02:24	to select a pretty carefully do so or as languages from the unlabeled a development
0:02:30	set
0:02:36	okay so this is the only a three dataset the are provided to the participants
0:02:43	the first one is the training set and this is a label layer consists of
0:02:47	a fifteen thousand a
0:02:49	i-vectors
0:02:50	to cover the fifty target languages
0:02:53	so we have about you and i-vectors but languages
0:02:57	the next is the development set which is not available
0:03:01	consist of both target and non-target languages
0:03:04	so most of the will is in fact consists of how to select those or
0:03:09	as i-vectors
0:03:10	from these development set
0:03:12	and find it is that the test set sic this external find i-vectors and split
0:03:17	into the
0:03:18	so the and seven the speed
0:03:22	for the progress and evaluation set
0:03:28	okay nist provide a baseline
0:03:31	is the i-vectors cosine scoring baseline consisting of a three step
0:03:35	first one is the whitening
0:03:37	followed by norm salty a whitening parameters is actual and
0:03:41	from the unlabeled elements that
0:03:44	is
0:03:45	because this one is an unsupervised training just need to mean and the corpora metrics
0:03:50	and then i suggest that was the cosine scoring so we have the five k
0:03:54	here
0:03:56	which is the average mean
0:03:58	all the
0:04:00	the three hundred i-vectors for specific target language
0:04:05	and k from
0:04:06	one kl case of fifty here
0:04:10	fight i is the i-vectors of the a test segment
0:04:13	so the cosine scoring given by these equations and of course after the rank normalization
0:04:18	this to them that would be equal to one
0:04:21	i never the case of language identification is what we have to do is to
0:04:25	select the language that but this is that gives the highest score
0:04:31	so i we can see from here
0:04:33	the i-vector cosine scoring a baseline
0:04:38	there's no
0:04:39	the or scost is not included here
0:04:42	okay so
0:04:44	you know as we can see later if we include additional "'cause" with or as
0:04:48	the performances so we get a quite
0:04:52	improvement compared to the baseline
0:04:56	okay no we evaluate the
0:04:58	the cost
0:04:59	which is defined as the not identification at a rate
0:05:04	error rates across the fifteen target languages and the always class
0:05:08	but it is efficient as the ones
0:05:10	correct instrument
0:05:12	but now if you
0:05:13	if you but this case the fifty which is the number of target languages
0:05:17	and well with the of voice into this formula
0:05:21	well we can see that the weight
0:05:24	given to the you know the or s everywhere
0:05:27	in detecting the
0:05:29	the or s filling detect no where it is much higher
0:05:33	compare
0:05:34	to the target classes
0:05:35	so this means
0:05:37	the cost so as colours that
0:05:40	always detection is a very important
0:05:45	things to do
0:05:46	to reduce the cost
0:05:52	so that this one can talk about
0:05:55	okay so
0:05:57	you know to investigate different strategy to perform always occasions the be designed a
0:06:04	so called unlabeled sre
0:06:07	from labeled training data we have
0:06:11	so the labeled train it does consist of fifteen thousand
0:06:14	i-vectors a four fifty target like this
0:06:18	so what we did was actually we do forty then split
0:06:22	so we have
0:06:24	for each target languages
0:06:26	and assuming that the act and i os languages and this is
0:06:30	run is that random selection
0:06:32	it's not particular
0:06:35	preference for any other languages
0:06:37	so this
0:06:39	i-vectors
0:06:41	is used as the
0:06:44	or is languages in all other percent
0:06:47	and we select fifth the all three hundred
0:06:50	as the unlabeled
0:06:53	a target languages in the unlabeled data set
0:06:56	okay
0:06:58	and of course we perform lda to reduce the dimension so that we could investigate
0:07:03	different strategy in the in the
0:07:06	for us we
0:07:12	okay are basically we investigated to strategy of each file we invest in many strategy
0:07:18	and have to study we follow pretty useful the first one record this fee talking
0:07:25	and the second one is the best or s so the this fit target mean
0:07:29	that you know we train a classifier
0:07:32	and then be for the target languages so those i-vector based let's feet to the
0:07:38	target classes would be taken as the always i-vector
0:07:42	well as for the
0:07:44	for this
0:07:46	bass fiddle as we train kind of like
0:07:49	fifty or forty plus one of fifty plus one and a which is that it
0:07:53	took about
0:07:54	so is that you have one class for the o s so we select those
0:07:59	that is breastfeed to the voice class
0:08:03	"'kay" so no we this is the radius of philosophy okay so what happens that
0:08:09	we to the a target languages that we train a multi-class svm soul for the
0:08:14	case of our seamlet the unlabeled a development set we have forty classes therefore the
0:08:21	actual we have fifty classes
0:08:23	so what is that
0:08:25	we train a multi-class svm
0:08:26	and then be scores look in those i-vectors in the unlabeled development set
0:08:32	so
0:08:32	what have like what is of the i-vectors where have one probably the posterior probabilities
0:08:39	for each of the classes
0:08:41	then we take the max
0:08:43	a amount of k classes we have which if these
0:08:48	so then this okay will be those i-vectors having the posterior probability
0:08:56	nasa then a given
0:08:58	try show
0:09:00	right
0:09:01	well as for the case of these that best fit or s
0:09:05	we train
0:09:06	o k plus one
0:09:08	multiclass svm b k plus one clusters
0:09:11	so no the question is how
0:09:13	how we're going to get the additional costs
0:09:17	even that we don't have the label so what is that b
0:09:23	we have the fifteen target languages
0:09:26	and then to be true in
0:09:27	those unlabeled development set
0:09:30	assuming that
0:09:32	all those unlabeled data out set i-vectors
0:09:37	maybe train the multiclass svms for that
0:09:41	and of course target i-vectors inside out of it a demo set so but
0:09:47	well it is
0:09:49	using the multiclass svm that trained in this manner
0:09:52	we compute the posterior probability
0:09:54	with respect to the voice class right so that is the like those i-vectors with
0:10:01	the highest
0:10:02	probability
0:10:04	which
0:10:05	what we call the best fit us
0:10:08	so in this way we actually discarding
0:10:10	those target i-vectors
0:10:13	in the unlabeled development set
0:10:22	i homing clean like spanish and have a
0:10:25	too much
0:10:26	chuckling
0:10:27	that's right okay
0:10:30	i travel events
0:10:31	okay so this is that i comparison of the to make the
0:10:38	the best fit of that
0:10:40	and the
0:10:42	this fit okay
0:10:43	and this is
0:10:45	the precision versus recall so we can see here the best fit
0:10:50	i'll so best fit offset
0:10:52	if a better
0:10:54	precision for all call value
0:10:56	compared to the these fit of a list target
0:11:00	and this
0:11:02	bigrams that illustrate
0:11:03	the on the two dimensional
0:11:06	graph
0:11:09	so later but setups that
0:11:11	can actually give a geographically energy
0:11:15	detection
0:11:16	the better the or as segment i-vectors
0:11:20	from a limited amount set
0:11:28	okay so
0:11:31	you know of the idea the
0:11:33	best fit or as cussing then be we do a attractive purification step
0:11:40	to improve the always detections what points that the based on the score
0:11:46	based on the
0:11:47	no but the detections that we have
0:11:50	from the
0:11:51	best fit offset
0:11:53	we randomly
0:11:54	the i-vectors for the top to bottom
0:11:57	talk will be the
0:11:59	most likely to be for s
0:12:02	what one would be a most likely to be target
0:12:05	and then we have these a to process of i-vectors
0:12:09	then we take the mean
0:12:10	and then be scores against all unlabeled i-vectors
0:12:15	and then we form be rang again we get and all
0:12:18	we the larger and increase and likely
0:12:22	for each iteration
0:12:24	and
0:12:25	we collect a risk of high so in that if you do this effectively
0:12:31	then when the best result what we can have peace
0:12:35	no we increase the lm
0:12:37	to these
0:12:39	you one three percent
0:12:41	a job forty percent meaning
0:12:43	against these six thousand
0:12:45	find a bus the i-vectors unlabeled i-vectors that we have
0:12:55	okay so the system there is something that is a fusion stuff on multiple classifiers
0:12:59	so is consists of pretty symbols and now classic so i've a classifier we have
0:13:06	the first one is the gaussian backend followed by multiclass logistic regressions
0:13:10	and then we have a solutions of svms
0:13:13	one is based on what we call polynomial expansions
0:13:17	and in a one is a fundamentally
0:13:20	then we also have investigated using the a multilayer perceptron
0:13:26	to expand i-vectors in a non you know we endpointed svm
0:13:32	just this one no so we also have these that the nn classifier that take
0:13:36	the i-vector as input and output is a few the last one
0:13:42	a target
0:13:43	fifty target languages and one i'll set
0:13:48	a languages
0:13:50	and the system there is something that is a very simple is the in your
0:13:54	fusion so
0:13:57	the way we learned the weight
0:13:58	is by some meeting the result to that systems
0:14:01	and then have a series on the progress set and the
0:14:04	it has to wait accordingly
0:14:11	okay so for the first
0:14:14	classifiers
0:14:16	well we what we did you section we train a gaussian so
0:14:20	distributions for each all the target languages
0:14:22	so for the case of fifty k good fifty target and just be trained fifty
0:14:26	gaussian distributions
0:14:28	and he a the means
0:14:31	estimate the separately
0:14:33	well as for the cobra metrics used actually a we get a global gram matrix
0:14:39	and then was moved in
0:14:41	we the smoothing figure two point one may be adapted to the individual target classes
0:14:48	then
0:14:51	then we get a new backend but in the score space we train a you
0:14:55	be included always clusters
0:14:57	as one additional process
0:15:00	okay this is counter have a standard in the language recognition
0:15:04	and this is followed by us cost score calibration using a multi class logistic regressions
0:15:09	and of course we used a multi class logistic regression be could come good any
0:15:14	log-likelihood in two parts deal
0:15:16	and this is maybe can actually control
0:15:19	the of trial so t v can get you know perhaps put more
0:15:24	prior onto the always classes because
0:15:27	and have seen voice detection is
0:15:30	right important in production costs
0:15:35	okay
0:15:38	okay that may probably svm
0:15:41	that we have a
0:15:43	use
0:15:45	so
0:15:46	we do or a simple well in the by expansions use in a second up
0:15:50	to the second order
0:15:51	so this one expand a four hundred dimensional i-vectors seems to at k which is
0:15:56	scaled by a b
0:15:57	then we didn't is obvious a bit worse and i sent rising to a global
0:16:01	mean and normalized to unit norm
0:16:04	and perform any p
0:16:06	be the rank not just at each is kind of small compared to the
0:16:11	the dalmatian have
0:16:13	okay and then
0:16:15	to include always classes the we have a fifty one classes so we used to
0:16:20	strategy once one versus all
0:16:22	and get a one is a pair-wise strategy so
0:16:25	the final score we combination of these two o a strategy to be used to
0:16:29	train svm
0:16:33	okay so
0:16:35	and i one is what we call the empirical kind of mapping
0:16:38	so what we did this we use the polynomials break those that we have
0:16:43	then we construct we call a possible way the matrix
0:16:47	using all the training that we have
0:16:50	as well as the or else
0:16:53	you know i-vectors to be a detector
0:16:55	then we do for each of the i-vectors that were going to score
0:17:01	we do a mapping
0:17:02	by just simply modifications to the matrix we have
0:17:05	then be account like a combating all transforming the a polynomial select those to the
0:17:11	score space
0:17:13	the optimal course call score vectors
0:17:16	and this is followed by
0:17:18	you know us and writing and to the global mean and normalized to unit norm
0:17:22	and the same strategy line
0:17:24	so we have to a kernel that we use one polynomial expansions that second emprically
0:17:30	kinda mapping but svm
0:17:36	is result
0:17:37	first of all we see we would like to compare how the a local minima
0:17:42	selectors the score scoring goes compared to the i-vectors
0:17:46	so this pulse first lines the baseline
0:17:48	ways i-vectors followed by cosine scoring
0:17:52	zero point three nine five nine
0:17:54	and t v just simply change cosine scoring to svm
0:17:58	then what we get is about seven point eight percent improvement compared to the baseline
0:18:03	and then if you chase endurable in my expansion and i-vectors then we get is
0:18:08	that your point three for which is a fourteen percent buttons
0:18:11	and if we know from the polynomial select those used empirical kernel
0:18:16	a of the scost with those we get a sixteen percent of phones
0:18:22	okay so next we see the a simple example always detection strategy maybe on to
0:18:28	compare the this fit target database without set
0:18:31	for both or no male svm and emprically connects
0:18:38	okay so
0:18:40	this is what they like you know when you includes the
0:18:46	it does not include any more s is
0:18:50	this fourteen percent due to the classifier compared to baseline
0:18:54	if you use the is the lowest fit target
0:18:58	variable get the d two percent improvements okay then best fit or s get this
0:19:05	and if you on not the best fit a or s
0:19:08	then we do a exactly for purification
0:19:12	we get a forty five percent improvements
0:19:14	similarly for the case of empirical kernel
0:19:21	alright so this is the you know how final submission is
0:19:27	we get about fifty five percent no improvement on the progress set
0:19:33	and a fifty four percent
0:19:36	compared to baseline
0:19:37	on eva sense that so
0:19:39	the improvements a new setting one century come from a better classifier
0:19:44	but you svm multiclass logistic conditions we used the n and t is the mlp
0:19:49	and i think the most part actually contribute at the contribution is from the always
0:19:53	detection strategy b c
0:19:55	give us a raw forty percent so far improvements
0:19:58	compared to baseline
0:20:02	okay i not examine the mentions that
0:20:04	we have in one day from the has a cassette
0:20:09	the number always the fact that is a one thousand seven hundred i think this
0:20:13	is much
0:20:14	higher than the
0:20:17	a real more file or as a segments all i-vectors in the test set
0:20:22	but given that the cost actually in a very well
0:20:26	if you do a
0:20:28	miss detection or as
0:20:29	you're going to lose much in terms of the cost so
0:20:33	it is better to say i-vector that so as then
0:20:37	then this not always
0:20:41	okay so this is the how progress
0:20:44	across
0:20:44	treat formant
0:20:46	so from the baseline systems
0:20:48	then we have a the you know
0:20:51	classifier
0:20:52	then be a
0:20:54	we found that the
0:20:56	this fee target
0:20:57	it's a good strategy for the always detection then we get a boost the performance
0:21:02	and then the betsy lawrence strategy eva santana bows
0:21:06	and then adaptive a cluster verification difference and a one
0:21:10	and then finally we have the fusion which that's to the zero point one seven
0:21:15	in terms of the costs
0:21:20	okay so
0:21:21	in conclusion so we have obtained a bow
0:21:24	fifty percent of buttons compared to baseline feature is
0:21:28	major contribution from the fusion multiple classifier
0:21:31	and the s voice detection strategy
0:21:35	and the following are always detection strategy find to be useful
0:21:39	which is the this fit target bessy always ended if a classification
0:21:44	but i have a real are actually able to find a good strategy to actually
0:21:49	extract
0:21:51	useful target i-vectors from a delay but i'm set
0:21:55	so all we believe a t v
0:21:57	have a bit distracted in doing that
0:22:00	you would give us a for the improvement
0:22:11	okay we have time for some questions
0:22:22	i think things
0:22:25	forward three d is your two we observe the
0:22:30	not very useful to sort out of so that this one class
0:22:36	because
0:22:37	i
0:22:38	distributed between different based on this definition you try to
0:22:43	maybe more k plus one but k plus the than we choose the
0:22:50	the o posted to
0:22:53	well
0:22:54	i four comments i'm for channel we didn't try because when you know during the
0:22:59	evaluations we do not reno
0:23:01	how many other languages that there
0:23:03	in the as
0:23:06	classes in maybe one he may be too
0:23:08	so
0:23:09	we do have and the ideas of how many languages in the class
0:23:13	so we don't actually explored it not that options you is what we take the
0:23:17	much from the that reject mm can see that much from the that you're is
0:23:22	entirely on the or at least the these japanese
0:23:25	so we can say okay this all of the or more close to italian family
0:23:30	you results or group of the way we show
0:23:35	we should have done the in of the and language tree and green
0:23:39	and the second question do you the confusion matrix
0:23:44	c we choose or more in terms of somehow pool
0:23:52	so
0:23:53	the thing we have peace
0:23:56	not exactly what say but maybe you know take this opportunity to actually talk about
0:24:02	the is greater actually the central but the snow so
0:24:06	you know overall what we did for the for the i-vector challenge is not always
0:24:10	detections of cost a lot of are those expect that we explore
0:24:15	i for example you know the target detection is actually not very good if you
0:24:19	if you see that the able find the summation even though we give a lot
0:24:23	model
0:24:23	fifty percent improvement compared to baseline
0:24:26	but the target detection effect it was compared to the baseline
0:24:30	if you see what
0:24:34	thank you thank you
0:24:44	are there
0:24:45	this study
0:24:51	the i-th this one the channel i-vector challenge to
0:24:54	proceed the
0:24:56	the nist the l at
0:24:59	distribution right so
0:25:01	how much of this work the was left and right in
0:25:05	in this well afford to aid a star forgeries i
0:25:11	i'm for divorce because the
0:25:13	our at a ten to fifteen is across the identifications
0:25:17	we have open set verification problem
0:25:20	where the always cucumber important for a way that ten to fifteen is kind of
0:25:26	not available but maybe we in fact use the we called the and pick a
0:25:31	kind of map
0:25:32	for however
0:25:34	but of course for the our you
0:25:37	well what we actually important you use of the bottleneck features
0:25:41	compared to
0:25:43	you know we may be used to use sdc
0:25:46	in a once you replace sdc be a bartender features we get around fifty percent
0:25:52	automatically we are doing anything
0:25:54	so you more focusing on the
0:25:57	p two levels and
0:25:59	for lid anymore
0:26:02	because for the i-vector challenge to pass on the policies that about always detection
0:26:06	if it's in for the us presentation this well
0:26:15	okay so it i think we're out of time so it's pretty slick the speaker
0:26:18	again

I2R Submission to the 2015 NIST Language Recognition I-vector Challenge

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Hanwu Sun, Trung Hieu Nguyen, Guangsen Wang, Kong Aik Lee, Bin Ma, Haizhou Li