Speech Transcript - I2R Submission to the 2015 NIST Language Recognition I-vector Challenge

going to all and comic from a is a defined for can research

the papers them go to precisely so i score submissions to the twenty fifteen is

a language recognition i-vector challenge

and this paper is that right will we might collecting the

in a subdued

okay this is a liar five percent asians were for so for a given a

break go and a will be one of the i-vector challenge which is have different

from a perspective of the organiser

okay

and then the what would be a always detections a strategy

which constitute the most part of the work you're in the i-vector challenge

and then we talked on the description of subsystems but you know the final submission

there have is in fact fusion from multiple systems

and then at this for the vice the expand result

so it's the conclusions

okay so

the

the i-vector challenge of course of consists of the i-vectors are extracted from fifty target

languages

that's why some unknown languages and

all these i-vectors probable the

conversational telephone speech and of service and the error rate of speech

and from the perspective of the participants there are three major challenges

the first one being the

it is also a language identification tasks

so in that in addition to the fifty target languages we have two models that

additional class

to detect the offset languages

and on top of these the set of always languages is unknown

and it has to be learned

from what a labeled training

development data

and

one if we get these that the unlabeled development data is that you're consists of

the target languages target languages and as well as the always languages so we have

to select a pretty carefully do so or as languages from the unlabeled a development

set

okay so this is the only a three dataset the are provided to the participants

the first one is the training set and this is a label layer consists of

a fifteen thousand a

i-vectors

to cover the fifty target languages

so we have about you and i-vectors but languages

the next is the development set which is not available

consist of both target and non-target languages

so most of the will is in fact consists of how to select those or

as i-vectors

from these development set

and find it is that the test set sic this external find i-vectors and split

into the

so the and seven the speed

for the progress and evaluation set

okay nist provide a baseline

is the i-vectors cosine scoring baseline consisting of a three step

first one is the whitening

followed by norm salty a whitening parameters is actual and

from the unlabeled elements that

because this one is an unsupervised training just need to mean and the corpora metrics

and then i suggest that was the cosine scoring so we have the five k

here

which is the average mean

all the

the three hundred i-vectors for specific target language

and k from

one kl case of fifty here

fight i is the i-vectors of the a test segment

so the cosine scoring given by these equations and of course after the rank normalization

this to them that would be equal to one

i never the case of language identification is what we have to do is to

select the language that but this is that gives the highest score

so i we can see from here

the i-vector cosine scoring a baseline

there's no

the or scost is not included here

okay so

you know as we can see later if we include additional "'cause" with or as

the performances so we get a quite

improvement compared to the baseline

okay no we evaluate the

the cost

which is defined as the not identification at a rate

error rates across the fifteen target languages and the always class

but it is efficient as the ones

correct instrument

but now if you

if you but this case the fifty which is the number of target languages

and well with the of voice into this formula

well we can see that the weight

given to the you know the or s everywhere

in detecting the

the or s filling detect no where it is much higher

compare

to the target classes

so this means

the cost so as colours that

always detection is a very important

things to do

to reduce the cost

so that this one can talk about

okay so

you know to investigate different strategy to perform always occasions the be designed a

so called unlabeled sre

from labeled training data we have

so the labeled train it does consist of fifteen thousand

i-vectors a four fifty target like this

so what we did was actually we do forty then split

so we have

for each target languages

and assuming that the act and i os languages and this is

run is that random selection

it's not particular

preference for any other languages

so this

i-vectors

is used as the

or is languages in all other percent

and we select fifth the all three hundred

as the unlabeled

a target languages in the unlabeled data set

okay

and of course we perform lda to reduce the dimension so that we could investigate

different strategy in the in the

for us we

okay are basically we investigated to strategy of each file we invest in many strategy

and have to study we follow pretty useful the first one record this fee talking

and the second one is the best or s so the this fit target mean

that you know we train a classifier

and then be for the target languages so those i-vector based let's feet to the

target classes would be taken as the always i-vector

well as for the

for this

bass fiddle as we train kind of like

fifty or forty plus one of fifty plus one and a which is that it

took about

so is that you have one class for the o s so we select those

that is breastfeed to the voice class

"'kay" so no we this is the radius of philosophy okay so what happens that

we to the a target languages that we train a multi-class svm soul for the

case of our seamlet the unlabeled a development set we have forty classes therefore the

actual we have fifty classes

so what is that

we train a multi-class svm

and then be scores look in those i-vectors in the unlabeled development set

what have like what is of the i-vectors where have one probably the posterior probabilities

for each of the classes

then we take the max

a amount of k classes we have which if these

so then this okay will be those i-vectors having the posterior probability

nasa then a given

try show

right

well as for the case of these that best fit or s

we train

o k plus one

multiclass svm b k plus one clusters

so no the question is how

how we're going to get the additional costs

even that we don't have the label so what is that b

we have the fifteen target languages

and then to be true in

those unlabeled development set

assuming that

all those unlabeled data out set i-vectors

maybe train the multiclass svms for that

and of course target i-vectors inside out of it a demo set so but

well it is

using the multiclass svm that trained in this manner

we compute the posterior probability

with respect to the voice class right so that is the like those i-vectors with

the highest

probability

which

what we call the best fit us

so in this way we actually discarding

those target i-vectors

in the unlabeled development set

i homing clean like spanish and have a

too much

chuckling

that's right okay

i travel events

okay so this is that i comparison of the to make the

the best fit of that

and the

this fit okay

and this is

the precision versus recall so we can see here the best fit

i'll so best fit offset

if a better

precision for all call value

compared to the these fit of a list target

and this

bigrams that illustrate

the on the two dimensional

graph

so later but setups that

can actually give a geographically energy

detection

the better the or as segment i-vectors

from a limited amount set

okay so

you know of the idea the

best fit or as cussing then be we do a attractive purification step

to improve the always detections what points that the based on the score

based on the

no but the detections that we have

from the

best fit offset

we randomly

the i-vectors for the top to bottom

talk will be the

most likely to be for s

what one would be a most likely to be target

and then we have these a to process of i-vectors

then we take the mean

and then be scores against all unlabeled i-vectors

and then we form be rang again we get and all

we the larger and increase and likely

for each iteration

and

we collect a risk of high so in that if you do this effectively

then when the best result what we can have peace

no we increase the lm

to these

you one three percent

a job forty percent meaning

against these six thousand

find a bus the i-vectors unlabeled i-vectors that we have

okay so the system there is something that is a fusion stuff on multiple classifiers

so is consists of pretty symbols and now classic so i've a classifier we have

the first one is the gaussian backend followed by multiclass logistic regressions

and then we have a solutions of svms

one is based on what we call polynomial expansions

and in a one is a fundamentally

then we also have investigated using the a multilayer perceptron

to expand i-vectors in a non you know we endpointed svm

just this one no so we also have these that the nn classifier that take

the i-vector as input and output is a few the last one

a target

fifty target languages and one i'll set

a languages

and the system there is something that is a very simple is the in your

fusion so

the way we learned the weight

is by some meeting the result to that systems

and then have a series on the progress set and the

it has to wait accordingly

okay so for the first

classifiers

well we what we did you section we train a gaussian so

distributions for each all the target languages

so for the case of fifty k good fifty target and just be trained fifty

gaussian distributions

and he a the means

estimate the separately

well as for the cobra metrics used actually a we get a global gram matrix

and then was moved in

we the smoothing figure two point one may be adapted to the individual target classes

then

then we get a new backend but in the score space we train a you

be included always clusters

as one additional process

okay this is counter have a standard in the language recognition

and this is followed by us cost score calibration using a multi class logistic regressions

and of course we used a multi class logistic regression be could come good any

log-likelihood in two parts deal

and this is maybe can actually control

the of trial so t v can get you know perhaps put more

prior onto the always classes because

and have seen voice detection is

right important in production costs

okay

okay that may probably svm

that we have a

use

we do or a simple well in the by expansions use in a second up

to the second order

so this one expand a four hundred dimensional i-vectors seems to at k which is

scaled by a b

then we didn't is obvious a bit worse and i sent rising to a global

mean and normalized to unit norm

and perform any p

be the rank not just at each is kind of small compared to the

the dalmatian have

okay and then

to include always classes the we have a fifty one classes so we used to

strategy once one versus all

and get a one is a pair-wise strategy so

the final score we combination of these two o a strategy to be used to

train svm

okay so

and i one is what we call the empirical kind of mapping

so what we did this we use the polynomials break those that we have

then we construct we call a possible way the matrix

using all the training that we have

as well as the or else

you know i-vectors to be a detector

then we do for each of the i-vectors that were going to score

we do a mapping

by just simply modifications to the matrix we have

then be account like a combating all transforming the a polynomial select those to the

score space

the optimal course call score vectors

and this is followed by

you know us and writing and to the global mean and normalized to unit norm

and the same strategy line

so we have to a kernel that we use one polynomial expansions that second emprically

kinda mapping but svm

is result

first of all we see we would like to compare how the a local minima

selectors the score scoring goes compared to the i-vectors

so this pulse first lines the baseline

ways i-vectors followed by cosine scoring

zero point three nine five nine

and t v just simply change cosine scoring to svm

then what we get is about seven point eight percent improvement compared to the baseline

and then if you chase endurable in my expansion and i-vectors then we get is

that your point three for which is a fourteen percent buttons

and if we know from the polynomial select those used empirical kernel

a of the scost with those we get a sixteen percent of phones

okay so next we see the a simple example always detection strategy maybe on to

compare the this fit target database without set

for both or no male svm and emprically connects

okay so

this is what they like you know when you includes the

it does not include any more s is

this fourteen percent due to the classifier compared to baseline

if you use the is the lowest fit target

variable get the d two percent improvements okay then best fit or s get this

and if you on not the best fit a or s

then we do a exactly for purification

we get a forty five percent improvements

similarly for the case of empirical kernel

alright so this is the you know how final submission is

we get about fifty five percent no improvement on the progress set

and a fifty four percent

compared to baseline

on eva sense that so

the improvements a new setting one century come from a better classifier

but you svm multiclass logistic conditions we used the n and t is the mlp

and i think the most part actually contribute at the contribution is from the always

detection strategy b c

give us a raw forty percent so far improvements

compared to baseline

okay i not examine the mentions that

we have in one day from the has a cassette

the number always the fact that is a one thousand seven hundred i think this

is much

higher than the

a real more file or as a segments all i-vectors in the test set

but given that the cost actually in a very well

if you do a

miss detection or as

you're going to lose much in terms of the cost so

it is better to say i-vector that so as then

then this not always

okay so this is the how progress

across

treat formant

so from the baseline systems

then we have a the you know

classifier

then be a

we found that the

this fee target

it's a good strategy for the always detection then we get a boost the performance

and then the betsy lawrence strategy eva santana bows

and then adaptive a cluster verification difference and a one

and then finally we have the fusion which that's to the zero point one seven

in terms of the costs

okay so

in conclusion so we have obtained a bow

fifty percent of buttons compared to baseline feature is

major contribution from the fusion multiple classifier

and the s voice detection strategy

and the following are always detection strategy find to be useful

which is the this fit target bessy always ended if a classification

but i have a real are actually able to find a good strategy to actually

extract

useful target i-vectors from a delay but i'm set

so all we believe a t v

have a bit distracted in doing that

you would give us a for the improvement

okay we have time for some questions

i think things

forward three d is your two we observe the

not very useful to sort out of so that this one class

because

distributed between different based on this definition you try to

maybe more k plus one but k plus the than we choose the

the o posted to

well

i four comments i'm for channel we didn't try because when you know during the

evaluations we do not reno

how many other languages that there

in the as

classes in maybe one he may be too

we do have and the ideas of how many languages in the class

so we don't actually explored it not that options you is what we take the

much from the that reject mm can see that much from the that you're is

entirely on the or at least the these japanese

so we can say okay this all of the or more close to italian family

you results or group of the way we show

we should have done the in of the and language tree and green

and the second question do you the confusion matrix

c we choose or more in terms of somehow pool

the thing we have peace

not exactly what say but maybe you know take this opportunity to actually talk about

the is greater actually the central but the snow so

you know overall what we did for the for the i-vector challenge is not always

detections of cost a lot of are those expect that we explore

i for example you know the target detection is actually not very good if you

if you see that the able find the summation even though we give a lot

model

fifty percent improvement compared to baseline

but the target detection effect it was compared to the baseline

if you see what

thank you thank you

are there

this study

the i-th this one the channel i-vector challenge to

proceed the

the nist the l at

distribution right so

how much of this work the was left and right in

in this well afford to aid a star forgeries i

i'm for divorce because the

our at a ten to fifteen is across the identifications

we have open set verification problem

where the always cucumber important for a way that ten to fifteen is kind of

not available but maybe we in fact use the we called the and pick a

kind of map

for however

but of course for the our you

well what we actually important you use of the bottleneck features

compared to

you know we may be used to use sdc

in a once you replace sdc be a bartender features we get around fifty percent

automatically we are doing anything

so you more focusing on the

p two levels and

for lid anymore

because for the i-vector challenge to pass on the policies that about always detection

if it's in for the us presentation this well

okay so it i think we're out of time so it's pretty slick the speaker

again

I2R Submission to the 2015 NIST Language Recognition I-vector Challenge

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Hanwu Sun, Trung Hieu Nguyen, Guangsen Wang, Kong Aik Lee, Bin Ma, Haizhou Li