0:00:17going to all and comic from a is a defined for can research
0:00:21the papers them go to precisely so i score submissions to the twenty fifteen is
0:00:26a language recognition i-vector challenge
0:00:29and this paper is that right will we might collecting the
0:00:32in a subdued
0:00:37okay this is a liar five percent asians were for so for a given a
0:00:41break go and a will be one of the i-vector challenge which is have different
0:00:46from a perspective of the organiser
0:00:52and then the what would be a always detections a strategy
0:00:57which constitute the most part of the work you're in the i-vector challenge
0:01:02and then we talked on the description of subsystems but you know the final submission
0:01:07there have is in fact fusion from multiple systems
0:01:11and then at this for the vice the expand result
0:01:14so it's the conclusions
0:01:19okay so
0:01:24the i-vector challenge of course of consists of the i-vectors are extracted from fifty target
0:01:31that's why some unknown languages and
0:01:35all these i-vectors probable the
0:01:38conversational telephone speech and of service and the error rate of speech
0:01:43and from the perspective of the participants there are three major challenges
0:01:48the first one being the
0:01:51it is also a language identification tasks
0:01:54so in that in addition to the fifty target languages we have two models that
0:01:58additional class
0:02:00to detect the offset languages
0:02:02and on top of these the set of always languages is unknown
0:02:07and it has to be learned
0:02:08from what a labeled training
0:02:10development data
0:02:14one if we get these that the unlabeled development data is that you're consists of
0:02:18the target languages target languages and as well as the always languages so we have
0:02:24to select a pretty carefully do so or as languages from the unlabeled a development
0:02:36okay so this is the only a three dataset the are provided to the participants
0:02:43the first one is the training set and this is a label layer consists of
0:02:47a fifteen thousand a
0:02:50to cover the fifty target languages
0:02:53so we have about you and i-vectors but languages
0:02:57the next is the development set which is not available
0:03:01consist of both target and non-target languages
0:03:04so most of the will is in fact consists of how to select those or
0:03:09as i-vectors
0:03:10from these development set
0:03:12and find it is that the test set sic this external find i-vectors and split
0:03:17into the
0:03:18so the and seven the speed
0:03:22for the progress and evaluation set
0:03:28okay nist provide a baseline
0:03:31is the i-vectors cosine scoring baseline consisting of a three step
0:03:35first one is the whitening
0:03:37followed by norm salty a whitening parameters is actual and
0:03:41from the unlabeled elements that
0:03:45because this one is an unsupervised training just need to mean and the corpora metrics
0:03:50and then i suggest that was the cosine scoring so we have the five k
0:03:56which is the average mean
0:03:58all the
0:04:00the three hundred i-vectors for specific target language
0:04:05and k from
0:04:06one kl case of fifty here
0:04:10fight i is the i-vectors of the a test segment
0:04:13so the cosine scoring given by these equations and of course after the rank normalization
0:04:18this to them that would be equal to one
0:04:21i never the case of language identification is what we have to do is to
0:04:25select the language that but this is that gives the highest score
0:04:31so i we can see from here
0:04:33the i-vector cosine scoring a baseline
0:04:38there's no
0:04:39the or scost is not included here
0:04:42okay so
0:04:44you know as we can see later if we include additional "'cause" with or as
0:04:48the performances so we get a quite
0:04:52improvement compared to the baseline
0:04:56okay no we evaluate the
0:04:58the cost
0:04:59which is defined as the not identification at a rate
0:05:04error rates across the fifteen target languages and the always class
0:05:08but it is efficient as the ones
0:05:10correct instrument
0:05:12but now if you
0:05:13if you but this case the fifty which is the number of target languages
0:05:17and well with the of voice into this formula
0:05:21well we can see that the weight
0:05:24given to the you know the or s everywhere
0:05:27in detecting the
0:05:29the or s filling detect no where it is much higher
0:05:34to the target classes
0:05:35so this means
0:05:37the cost so as colours that
0:05:40always detection is a very important
0:05:45things to do
0:05:46to reduce the cost
0:05:52so that this one can talk about
0:05:55okay so
0:05:57you know to investigate different strategy to perform always occasions the be designed a
0:06:04so called unlabeled sre
0:06:07from labeled training data we have
0:06:11so the labeled train it does consist of fifteen thousand
0:06:14i-vectors a four fifty target like this
0:06:18so what we did was actually we do forty then split
0:06:22so we have
0:06:24for each target languages
0:06:26and assuming that the act and i os languages and this is
0:06:30run is that random selection
0:06:32it's not particular
0:06:35preference for any other languages
0:06:37so this
0:06:41is used as the
0:06:44or is languages in all other percent
0:06:47and we select fifth the all three hundred
0:06:50as the unlabeled
0:06:53a target languages in the unlabeled data set
0:06:58and of course we perform lda to reduce the dimension so that we could investigate
0:07:03different strategy in the in the
0:07:06for us we
0:07:12okay are basically we investigated to strategy of each file we invest in many strategy
0:07:18and have to study we follow pretty useful the first one record this fee talking
0:07:25and the second one is the best or s so the this fit target mean
0:07:29that you know we train a classifier
0:07:32and then be for the target languages so those i-vector based let's feet to the
0:07:38target classes would be taken as the always i-vector
0:07:42well as for the
0:07:44for this
0:07:46bass fiddle as we train kind of like
0:07:49fifty or forty plus one of fifty plus one and a which is that it
0:07:53took about
0:07:54so is that you have one class for the o s so we select those
0:07:59that is breastfeed to the voice class
0:08:03"'kay" so no we this is the radius of philosophy okay so what happens that
0:08:09we to the a target languages that we train a multi-class svm soul for the
0:08:14case of our seamlet the unlabeled a development set we have forty classes therefore the
0:08:21actual we have fifty classes
0:08:23so what is that
0:08:25we train a multi-class svm
0:08:26and then be scores look in those i-vectors in the unlabeled development set
0:08:32what have like what is of the i-vectors where have one probably the posterior probabilities
0:08:39for each of the classes
0:08:41then we take the max
0:08:43a amount of k classes we have which if these
0:08:48so then this okay will be those i-vectors having the posterior probability
0:08:56nasa then a given
0:08:58try show
0:09:01well as for the case of these that best fit or s
0:09:05we train
0:09:06o k plus one
0:09:08multiclass svm b k plus one clusters
0:09:11so no the question is how
0:09:13how we're going to get the additional costs
0:09:17even that we don't have the label so what is that b
0:09:23we have the fifteen target languages
0:09:26and then to be true in
0:09:27those unlabeled development set
0:09:30assuming that
0:09:32all those unlabeled data out set i-vectors
0:09:37maybe train the multiclass svms for that
0:09:41and of course target i-vectors inside out of it a demo set so but
0:09:47well it is
0:09:49using the multiclass svm that trained in this manner
0:09:52we compute the posterior probability
0:09:54with respect to the voice class right so that is the like those i-vectors with
0:10:01the highest
0:10:05what we call the best fit us
0:10:08so in this way we actually discarding
0:10:10those target i-vectors
0:10:13in the unlabeled development set
0:10:22i homing clean like spanish and have a
0:10:25too much
0:10:27that's right okay
0:10:30i travel events
0:10:31okay so this is that i comparison of the to make the
0:10:38the best fit of that
0:10:40and the
0:10:42this fit okay
0:10:43and this is
0:10:45the precision versus recall so we can see here the best fit
0:10:50i'll so best fit offset
0:10:52if a better
0:10:54precision for all call value
0:10:56compared to the these fit of a list target
0:11:00and this
0:11:02bigrams that illustrate
0:11:03the on the two dimensional
0:11:09so later but setups that
0:11:11can actually give a geographically energy
0:11:16the better the or as segment i-vectors
0:11:20from a limited amount set
0:11:28okay so
0:11:31you know of the idea the
0:11:33best fit or as cussing then be we do a attractive purification step
0:11:40to improve the always detections what points that the based on the score
0:11:46based on the
0:11:47no but the detections that we have
0:11:50from the
0:11:51best fit offset
0:11:53we randomly
0:11:54the i-vectors for the top to bottom
0:11:57talk will be the
0:11:59most likely to be for s
0:12:02what one would be a most likely to be target
0:12:05and then we have these a to process of i-vectors
0:12:09then we take the mean
0:12:10and then be scores against all unlabeled i-vectors
0:12:15and then we form be rang again we get and all
0:12:18we the larger and increase and likely
0:12:22for each iteration
0:12:25we collect a risk of high so in that if you do this effectively
0:12:31then when the best result what we can have peace
0:12:35no we increase the lm
0:12:37to these
0:12:39you one three percent
0:12:41a job forty percent meaning
0:12:43against these six thousand
0:12:45find a bus the i-vectors unlabeled i-vectors that we have
0:12:55okay so the system there is something that is a fusion stuff on multiple classifiers
0:12:59so is consists of pretty symbols and now classic so i've a classifier we have
0:13:06the first one is the gaussian backend followed by multiclass logistic regressions
0:13:10and then we have a solutions of svms
0:13:13one is based on what we call polynomial expansions
0:13:17and in a one is a fundamentally
0:13:20then we also have investigated using the a multilayer perceptron
0:13:26to expand i-vectors in a non you know we endpointed svm
0:13:32just this one no so we also have these that the nn classifier that take
0:13:36the i-vector as input and output is a few the last one
0:13:42a target
0:13:43fifty target languages and one i'll set
0:13:48a languages
0:13:50and the system there is something that is a very simple is the in your
0:13:54fusion so
0:13:57the way we learned the weight
0:13:58is by some meeting the result to that systems
0:14:01and then have a series on the progress set and the
0:14:04it has to wait accordingly
0:14:11okay so for the first
0:14:16well we what we did you section we train a gaussian so
0:14:20distributions for each all the target languages
0:14:22so for the case of fifty k good fifty target and just be trained fifty
0:14:26gaussian distributions
0:14:28and he a the means
0:14:31estimate the separately
0:14:33well as for the cobra metrics used actually a we get a global gram matrix
0:14:39and then was moved in
0:14:41we the smoothing figure two point one may be adapted to the individual target classes
0:14:51then we get a new backend but in the score space we train a you
0:14:55be included always clusters
0:14:57as one additional process
0:15:00okay this is counter have a standard in the language recognition
0:15:04and this is followed by us cost score calibration using a multi class logistic regressions
0:15:09and of course we used a multi class logistic regression be could come good any
0:15:14log-likelihood in two parts deal
0:15:16and this is maybe can actually control
0:15:19the of trial so t v can get you know perhaps put more
0:15:24prior onto the always classes because
0:15:27and have seen voice detection is
0:15:30right important in production costs
0:15:38okay that may probably svm
0:15:41that we have a
0:15:46we do or a simple well in the by expansions use in a second up
0:15:50to the second order
0:15:51so this one expand a four hundred dimensional i-vectors seems to at k which is
0:15:56scaled by a b
0:15:57then we didn't is obvious a bit worse and i sent rising to a global
0:16:01mean and normalized to unit norm
0:16:04and perform any p
0:16:06be the rank not just at each is kind of small compared to the
0:16:11the dalmatian have
0:16:13okay and then
0:16:15to include always classes the we have a fifty one classes so we used to
0:16:20strategy once one versus all
0:16:22and get a one is a pair-wise strategy so
0:16:25the final score we combination of these two o a strategy to be used to
0:16:29train svm
0:16:33okay so
0:16:35and i one is what we call the empirical kind of mapping
0:16:38so what we did this we use the polynomials break those that we have
0:16:43then we construct we call a possible way the matrix
0:16:47using all the training that we have
0:16:50as well as the or else
0:16:53you know i-vectors to be a detector
0:16:55then we do for each of the i-vectors that were going to score
0:17:01we do a mapping
0:17:02by just simply modifications to the matrix we have
0:17:05then be account like a combating all transforming the a polynomial select those to the
0:17:11score space
0:17:13the optimal course call score vectors
0:17:16and this is followed by
0:17:18you know us and writing and to the global mean and normalized to unit norm
0:17:22and the same strategy line
0:17:24so we have to a kernel that we use one polynomial expansions that second emprically
0:17:30kinda mapping but svm
0:17:36is result
0:17:37first of all we see we would like to compare how the a local minima
0:17:42selectors the score scoring goes compared to the i-vectors
0:17:46so this pulse first lines the baseline
0:17:48ways i-vectors followed by cosine scoring
0:17:52zero point three nine five nine
0:17:54and t v just simply change cosine scoring to svm
0:17:58then what we get is about seven point eight percent improvement compared to the baseline
0:18:03and then if you chase endurable in my expansion and i-vectors then we get is
0:18:08that your point three for which is a fourteen percent buttons
0:18:11and if we know from the polynomial select those used empirical kernel
0:18:16a of the scost with those we get a sixteen percent of phones
0:18:22okay so next we see the a simple example always detection strategy maybe on to
0:18:28compare the this fit target database without set
0:18:31for both or no male svm and emprically connects
0:18:38okay so
0:18:40this is what they like you know when you includes the
0:18:46it does not include any more s is
0:18:50this fourteen percent due to the classifier compared to baseline
0:18:54if you use the is the lowest fit target
0:18:58variable get the d two percent improvements okay then best fit or s get this
0:19:05and if you on not the best fit a or s
0:19:08then we do a exactly for purification
0:19:12we get a forty five percent improvements
0:19:14similarly for the case of empirical kernel
0:19:21alright so this is the you know how final submission is
0:19:27we get about fifty five percent no improvement on the progress set
0:19:33and a fifty four percent
0:19:36compared to baseline
0:19:37on eva sense that so
0:19:39the improvements a new setting one century come from a better classifier
0:19:44but you svm multiclass logistic conditions we used the n and t is the mlp
0:19:49and i think the most part actually contribute at the contribution is from the always
0:19:53detection strategy b c
0:19:55give us a raw forty percent so far improvements
0:19:58compared to baseline
0:20:02okay i not examine the mentions that
0:20:04we have in one day from the has a cassette
0:20:09the number always the fact that is a one thousand seven hundred i think this
0:20:13is much
0:20:14higher than the
0:20:17a real more file or as a segments all i-vectors in the test set
0:20:22but given that the cost actually in a very well
0:20:26if you do a
0:20:28miss detection or as
0:20:29you're going to lose much in terms of the cost so
0:20:33it is better to say i-vector that so as then
0:20:37then this not always
0:20:41okay so this is the how progress
0:20:44treat formant
0:20:46so from the baseline systems
0:20:48then we have a the you know
0:20:52then be a
0:20:54we found that the
0:20:56this fee target
0:20:57it's a good strategy for the always detection then we get a boost the performance
0:21:02and then the betsy lawrence strategy eva santana bows
0:21:06and then adaptive a cluster verification difference and a one
0:21:10and then finally we have the fusion which that's to the zero point one seven
0:21:15in terms of the costs
0:21:20okay so
0:21:21in conclusion so we have obtained a bow
0:21:24fifty percent of buttons compared to baseline feature is
0:21:28major contribution from the fusion multiple classifier
0:21:31and the s voice detection strategy
0:21:35and the following are always detection strategy find to be useful
0:21:39which is the this fit target bessy always ended if a classification
0:21:44but i have a real are actually able to find a good strategy to actually
0:21:51useful target i-vectors from a delay but i'm set
0:21:55so all we believe a t v
0:21:57have a bit distracted in doing that
0:22:00you would give us a for the improvement
0:22:11okay we have time for some questions
0:22:22i think things
0:22:25forward three d is your two we observe the
0:22:30not very useful to sort out of so that this one class
0:22:38distributed between different based on this definition you try to
0:22:43maybe more k plus one but k plus the than we choose the
0:22:50the o posted to
0:22:54i four comments i'm for channel we didn't try because when you know during the
0:22:59evaluations we do not reno
0:23:01how many other languages that there
0:23:03in the as
0:23:06classes in maybe one he may be too
0:23:09we do have and the ideas of how many languages in the class
0:23:13so we don't actually explored it not that options you is what we take the
0:23:17much from the that reject mm can see that much from the that you're is
0:23:22entirely on the or at least the these japanese
0:23:25so we can say okay this all of the or more close to italian family
0:23:30you results or group of the way we show
0:23:35we should have done the in of the and language tree and green
0:23:39and the second question do you the confusion matrix
0:23:44c we choose or more in terms of somehow pool
0:23:53the thing we have peace
0:23:56not exactly what say but maybe you know take this opportunity to actually talk about
0:24:02the is greater actually the central but the snow so
0:24:06you know overall what we did for the for the i-vector challenge is not always
0:24:10detections of cost a lot of are those expect that we explore
0:24:15i for example you know the target detection is actually not very good if you
0:24:19if you see that the able find the summation even though we give a lot
0:24:23fifty percent improvement compared to baseline
0:24:26but the target detection effect it was compared to the baseline
0:24:30if you see what
0:24:34thank you thank you
0:24:44are there
0:24:45this study
0:24:51the i-th this one the channel i-vector challenge to
0:24:54proceed the
0:24:56the nist the l at
0:24:59distribution right so
0:25:01how much of this work the was left and right in
0:25:05in this well afford to aid a star forgeries i
0:25:11i'm for divorce because the
0:25:13our at a ten to fifteen is across the identifications
0:25:17we have open set verification problem
0:25:20where the always cucumber important for a way that ten to fifteen is kind of
0:25:26not available but maybe we in fact use the we called the and pick a
0:25:31kind of map
0:25:32for however
0:25:34but of course for the our you
0:25:37well what we actually important you use of the bottleneck features
0:25:41compared to
0:25:43you know we may be used to use sdc
0:25:46in a once you replace sdc be a bartender features we get around fifty percent
0:25:52automatically we are doing anything
0:25:54so you more focusing on the
0:25:57p two levels and
0:25:59for lid anymore
0:26:02because for the i-vector challenge to pass on the policies that about always detection
0:26:06if it's in for the us presentation this well
0:26:15okay so it i think we're out of time so it's pretty slick the speaker