Speech Transcript - Out-of-Set i-Vector Selection for Open-set Language Identification

hello i'm having problems from university of east and feel and

well it's my pleasure two presents my guess that the in this workshop i dunno

it's good to be the last

among the last the speakers or not but

well

in the following fifteen twenty minutes i will present

and effective and simple a out of the detection method over i-vector space in the

context of a language identification

well

language identification can be done in two ways one is closed set

where the language of a test segment corresponds to one of the instead or target

languages

and in open-set

where the language of a test segment may not

be any of the target languages

the task is to classify

the test segment

either into one of the inset languages for

and out of set model

well

one way to perform open set language identification is to training

i out of set model from additional data

but

then the data is huge and on and only build

the practical key question is

how to select the most representative out of set data

to model to be all this out of set model in other words

how to obtain

the higher quality

out of set data or additional data to train

this an out of set model

well

in the context of language identification the good candidates for out of that they do

have some properties deductible of their main properties or

i don't set candidates should come from a different lingo is the language families

by language families i mean that those languages that have the same kinds of the

common ancestor for example a russian ukrainian polish are all from the slavic a language

family

and the second property

is that open-set candidates should be pillows

into instead languages while others for of a well i because of having at various

general out of set model which represents which better represent the ward of out of

set data or out of set languages

well

and are some ways to do this

dorsum classical approaches one is one class svm where the idea is to enclose the

data with an hypersphere

and collapsible new data has an or model if they fall within this hypersphere and

as out of set out otherwise

to other classical approaches are k nearest neighbor where

given each data a the sum of its distances between this data and it's k

nearest neighbours are computed and

the higher this task is the more a confidence we ought to say that this

data is outlier is out of set

and another classical approaches distance the class means of l if we assume that the

data is a gaussian

those data that long

two or three the standard deviation a bill or eyeball the class name

are considered as out of set data

what we consider in this study is to use of a nonparametric statistical test known

as a whole marker of the smirnoff test

it's a non parameter

nonparametric

and the idea is to

we have two samples

we estimate

but their these two samples have the same underlying distribution

but computing the maximum difference between their

empirical cumulative distribution functions

well as you could see in this picture this maximum difference is known ask i

guess value if it is a great an accurate critical value

we can in indicates that this these two samples are from different distributions or in

our case from different classes

okay how we adopted and two are open set language identification task

well even and unlabeled vector w us up a script on i and all their

all i-vectors in class barely language l we can he would

that a the empirical cumulative distribution functions between this w only and all directors

then we will have a

if you have a and samples in this language

language l

we will come up with l individual k s values so we take average for

on this

individual king is values and then become a bit average k s e

that corresponds to

and outlier a score of w on i in language

well

we repeat this work other l target languages

and then become a bit l average k s values and then we take the

minimum value

as the final outlier a score

for and w only

this unlabeled i-vector

well

it's interesting that this that the distribution of this case you values

have also a distribution

in this in this picture

and the and the red bars shows the instantaneous in values meaning that for example

if you're in the data class

and the red ball strolls the shows that

for computing the red bars the in the data

those data that correspond to derek the last very used to compute the k s

z values and the for the and for the blue wires and the outputs that

they to those they don't that do not belong to their equal ask for example

very use the computer used you values

and interestingly

the incipiency values

tends to values close to zero and out of set

casey value stands to

and values close to one

so we couldn't see this problem where do directly about looking at that the data

the beginning but now

we have a tool that shows how instead that out of set data are separated

well that's

applied in our open set language identification task

well

be applied idea and the and used language i-vector challenge two thousand fifteen

the training set corresponds to prevent house and

utterance s

fifty in that languages

and development sets has six thousand five hundred on labeled

data and the same amount of data for the test set

well the data that was balance between each languages

and the dimensions of the i-vectors were four hundred

and to be did some post-processing like within class covariance normalisation and

linear discriminant analysis

and the i-vectors

well

to perform

evaluation of the out of the detection methods we need labeled data because the development

set didn't have a label was not labeled be used for training set to

to be segmented training set into three different portions training you have and test portions

so that we have certainly we assign thirty instead languages and twenty out of set

languages

and the test portions has all the languages of the instead

and twenty out of set

and the data was

what's didn't have any overlap between these three portions

well

if here is an example of labeling of the out of set and for the

out of set a evaluation for example for those data that and their true language

was one of the instead languages for example data id one

be a label it as instead

and for those data that there

two language was different done

one of things that line from the instead languages

we label

we label them as out of set

here is the results of

on a out of the detection methods and our proposed

method well case devalues yes i a method outperforms other classical approaches

for example in case of svm and knn we have fourteen and sixteen percent relative

it all error rate reductions in out of set detection

well

before their f use this baseline systems with k s and we have improvement we

have improved all individual systems by

by fusing k s e with them

and the best performance is fusing k is a bit one class it's we have

that resulted in twenty percent

it while error rates of around twenty eight

individual t s a we dropped

the equal error rate to twenty percent

well

let us look at the open set language identification results

here

the table and the different roles in the table shows

and

the they differ based on the data selected for out of set modeling

for example we have random

we use all the training set

all the development set combination of training and development set

and the last rule is the proposed selection method

as a for the reference purposes we include that the colours that result

this results are based on the svm classifier and dark directly reported from the news

evaluation website

well

the proposed selection method

based on identification results sorry i didn't mention that

the

bill the lines are that identification "'cause" is twenty six around twenty six

a performance that nist baseline

buys thirty three percent relative

improvement the best relative improvement was fifty

fifty five percent

well

looking at the for the first rose

i think i think additional data well hand held to reduce the identification cost but

what not was not bitter and then selecting

so selecting in a supervised by selecting out of set a date or in a

supervised a

well

here we look at be we compare the

casey with other out of the detection methods in the open set language identification

well all of them help to

all of them and outperforms the that the candles that results

but they contain is it is the wiener system with twenty six

identification cost

well

we had one thousand five hundred out of set data

and you set and fifteen

out of that language as we were able to detect what around one thousand of

them

with this method

it can use them as that

so that the and important thing in this challenge was

two bitter detect out of set it change your level when you correctly detect out

of set data

well in the conclusion

in this study

we propose to use a simple and effective method to detect out of the data

over i-vector is space we showed that

this no

the that the case in values the proposed method

has it nicely distribution

and then been integrated to the open set a like this is that we receive

thirty three percent relative reduction in identification cost

or a closed set

system

okay thank you for attention

so if you if you go back to slide fifteen

making did you

did you try different partitions of in set not observed and the this

make much of the difference for your

well no we select that's their twenty percent

is there content you languages or c

so this was on the next slide but you the thirty and twenty you didn't

write different portions now do you think this would have made a difference

in our offset detection yes

yes it

i dunno what you mean by making a difference but

the results maybe difference but the output

will be the same this is the this case system

it's something are

among other systems

i see but maybe the amount by which one

whatever the

different had you selected

which we ran the random it's not supervising on the selected target languages

and set and twenty s out of that

and the other are there other questions

one classes them what the couldn't that used

investment coding what was the current that linear yes polynomial kernel

and

between the two images that used

that you can that he scanned and one and the ones

which one is more efficient

which was the first one

fast this one

well

my method was fast

and knn was also first not a

i didn't look carefully at that well the speed but

i think goes and this one class svm this the this nonstick plastered to cluster

mean and

gaussian and canyon unless it

the speech or more or less the same

but i didn't look at the speaker now step by step

evaluation

if there are no the questions let's take the speaker again please

Out-of-Set i-Vector Selection for Open-set Language Identification

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Hamid Behravan, Tomi Kinnunen, Ville Hautamäki