Speech Transcript - Source Normalization for Language-Independent Speaker Recognition using i-Vectors

well this is so you organise race

a that

if you serve also

third order all three here that the first author individual

yeah it solves the general idea of a source position

that's the i think right after

the previous oversee

and

afterwards after this talk you think that this is fantastic i'm going to implement this

tomorrow or next week

in you need to do is done it so

and mature

and he slides because he'd make the slides form

yeah

i'm very happy with that if only a house you think yeah afterwards i didn't

get i think of this and why did before but this is probably due to

me and all being able to convey the message

and if you before hands thing to the same thing

then you're are sort of each what we have

right

anyway

this is sort of you automatically generated a summary of my presentation today

which i think is kind of pointless in this particular case because

contains lots of irony is that are can be explained later soldiers

having to

the motivation to do this work and the idea is that in speaker recognition

we all phone that and you evaluations were things change from year to year we

often

get to the situation where you get new data we haven't seen for sitting here

well yeah

voicing data no you know what kind of noise maybe some people

but most of us don't know

and

well

how we're going to deal with that's

i don't

sometimes have to

locations and see what i'm going to talk about

every once a while actually to rubbish i guess because

i haven't seen this to go from

but anyway that the basic idea is that if you get conditions definitely to train

and test

are of a different kind you like to have to see

that you like to have seen this before what you do

i don't

if you're

if you know that you won't have seen

and one way of to do with that is this ideal source normalization

i'll try to explain

the basic ideas source say

oh here's some slides about i-vectors i think i i'll skip

with these two then you probably in your hands and standard much better i

the basic idea is that we review the i-vector in this particular presentation that's a

very low dimensional representation

of the entire utterance

containing

apart from speaker information other information

i essential to the idea of source position is that wants to that we do

in the standards

approach is we hear by within covariance

within class covariance normalization

for the P lda

and

that's needs to be changed

with data and in the training

the

within class and between-class

scatter matrices are

are computed

and that's where the source normalisation takes place

so here and notes that we actually need to estimate those scatter matrices

for

so this is the mathematics just to stay in line with the previous torso

to have at least some mathematics so on the view screen

this is the expression for the within

speaker scatter matrix

and this is what the source position is going to

try and estimating a better way

because what is the what is the

problem with a wccn in this particular

a matter is this issue is that

relevant kinds of variation are observed in the training data

and this is more often to if you don't have

data

so i hear another graphical representation of what typically happens here we look at a

specific

kind of data is the label of the data in mind say which is

the language so you have also english language data and every once in a while

we get some a tests

where language model english

how that in i think that

six

and before

a also two thousand eight seconds content easy

so when is you get in two thousand twelve be what you get

so maybe language itself is not so relevant for the current that is what it

is a good example of where things

change

an important

here is that even if we have some training data from here's

will not have for all speakers

the different languages so typically

the

the speakers are decoupled from the language of for some language you have some speakers

and for the language you have other speakers

so how do you know the problem where you in the end in your

recognition have to compare one segment in one second

in the other language where the case might be that it's actually same speaker

so what about shown out is why this kind of

difference in language labels going to

influence these

we can

speaker

sky within class scatter matrix

so this is one way of viewing how the

i-vectors might be distributed in this very

this way

and

is used

these

three big circles denote the different sources in this case of source

might be a language

with some means there's a global mean which would be yeah mean i

i guess

i don't have some speaker so for the speaker you have a little bit of

variability any comes from one source

and the speaker is the she and he comes from another source and we have

also you speaker sources in a last

you think imagine if you're going to compute the between speaker variation that you actually

i don't a lot of between source variation and that's probably not a good thing

which you want to

no we did different speakers and between source

the wccn is going to

do this for myself

based on this information

and related to this

is what is stacey the source variance

is not correctly

observed so the various tv sources

is not explicitly

models

so there's another problem for wccn

this is as follows is summarising again

what problems are

that's moved to the solution i think this is much more interesting that to see

what's how do we tackle this problem that these sources to hang around

have

globally different means in the this i-vector stage the solution is very simple is compute

these means

for every source

so here you look at the

scatter matrix

for a

conditioned on the source

we simply say i compute the mean for every source

and before computers contrary i

subtract these means

so the effect basically means that you

all these three

sources this is still going from like these two microphone

yeah and telephone data

also them for languages

yeah more

and you subtract the mean for

label per language

and then this scatter matrix will be estimated better so the mathematics then we'll say

okay

that's very nice fit within

within a class variation

we still have the between class variation

but we'll just see that as the difference that's data rate

issues

so that the other way around

but it does so the idea is that you can compensate for one

scatter matrix and because you have total variability

you can compute the other as the difference from

total variability

so this idea is to stress

in fact that you only need the language labels see records applied to language

for the development set

so you're languages are you development

and you're training your system you have all kinds of labels in your data in

this case we consider

language label

but in applying this you do not need the languages

because this is only used to make a better

transforms for these wccn that make

how can you actually see that it works well one way of doing that is

to look at the distribution of i-vectors

a wccn

when you

do not apply this technique source-normalization a strong left

and here in different colours U C encoded of the label that we want to

so the way in this case language you see for each language recognition

these

languages might be familiar for these people needed

was

that what you see that

languages seem to have different places

this is by the dimension a dimension reduction

two dimensions

after the incision that's just for few problems

and you see a that is language normalization this source

source normalization by language

you get that all these different labels too much more similar

force for the basic assumptions that

i-vector systems are based on

should a little better

okay in our system results because

we need to have tables

in the presentation of we're going to get some

at first what kind of what kind of experiment we can do

we use

most i databases for is that the

yeah men the training

yeah i-vector make use of

but we did at one specific database callfriend

very little database are used

oh two starts

the first language recognition so it contains

a variation of languages and twelve languages certainly

for that

right

price

and

as for the evaluation data because these two data sets and from nist two thousand

ten

dataset and two thousand

eight oh two thousand ten you might think why would you do that there wasn't

actually much different language

from english that was sense but we don't use that for purposes one for training

calibration

calibration as well

another reason is to see actually what are we do doesn't spurts

the basic english performance too much

you a case of course is going to be used as a test data

where there is a there are trials from different languages

and there are also considered

condition english only so that

we compare

do you actually hurt ourselves

this is

durations are a simple standard

the U

have seen either numbers i'd say before so there's nothing you hear

these are indians the breakdown numbers for the

per language

for the training data

these funny are the results now here

i'll try to explain

database

red

it means this is you

doesn't mean this is better

but both figures means is better and the first condition

shows

see

yeah

the performance on all trials

four sre eight

and measured in where it and get

does not in calibration here

a C these numbers go down so for four O eight it works if we

see some languages i believe that

okay

force is also in english

if we

oh you look at english then used to use a little bit so it does

hurt our system but it doesn't hurt it's

here

and

the same for

as we can

for system gets hurt

but here

the basic conclusion there

here we have a breakdown where we look at the english languages

from history of weights

where is where we look at different positions are there in the in the trials

the same language or different language

when english is

so the top row which has to be the best performance because

still contains these N yeah

many english trials

systems that works best for

so the baseline

but this includes

both english and english so if you break down

for instance where you say okay i want a different language in the trial suppose

that the target of target trials language

difference

i was four

we see that the new figures that once right

are slightly better than

the red ones

left

the background smooth

and the same respect to four

in addition so you can specifically look at them english trials

where there's otherwise restriction

it helps

for the language

trials where you actually restricts trials you say minus the same time but english

still helps to there's one condition where for whatever it does not how

so that's a big difference

this is something we don't

is that

suppose

and that's for the old english trials

where you specify that the process

different language trials

so usually

it seems to work

except for one particular

place

where it's that's

one

dish

but i say that are actually not too many trials

it's not show the graph oh very nice

if you vision

so i don't know how

accurate this measure

now i'll except for also it's calibration

and

our to carlos also the it's a this kind of experiment i

looking at

make

more robust for

for languages

and we use a better different measure

is a measure used by the keynote speaker they

as cllr and one way of looking at how

however

you're calibration is small rates is to look at the difference between the cllr and

the minimum attainable C or your in G

or oh

C miss so as to that

posts of

this kind of H

section

it's not

this is gonna

alright so you have to this school mismatched different means

mismatched and matched

and

i was actually thinking

vigilance the intensity state we might build a set of mismatched

my niched

but

that might be to heart for you guys

anyway

and the is the

that they do thing that we tried to a remote here

and black is the old approach

at both is better figures

so we see a separate from female to answer

also

we ask for

and now

generally

for this mismatch condition by big mismatch we need to calibrate english only to be

a straight answer

ten for calibration and we applied to

sre eight is

to be the other way around that we consider that way in order to be

able to calibrate english and test or

in a channel

so this particular

in addition it works

always and

in the matched condition that is only looking at english scores of this really

well calibrated english words

ten

you see that it doesn't always help factors on one condition where it helps to

do so

the miscalibration itself

so you molecules

in calibration

becomes less

see that's for calibration there is still somehow

english only

but for the arts and figures it doesn't

however

alright i hope that

explains the numbers well enough

your first for the managers amongst

yeah

i just easier to draw at this

the same time

dataset

calibration this is just miscalibration so this is just the amount of information by

by not be able to

produce proper likelihood ratios

increases

for

the conditions where we applied is the language normalization

but for english only trials you don't notice the difference

so i have a slight

conclusions are here

used to source normalization wish to general framework and i have to say here's been

applied before

it should be machine

three or four

conference proceedings

papers

about this technique applied it to this

definition of source being a microphone or integer interview or telephone

and we even applied it

i should say by

fair

to source being know the sex of the speaker so even though i speakers generally

don't change six

and that's in this evaluations

you can use this approach

to compensate for situations where you might not have enough data

so for telephone conditions this

didn't we make much difference but for conditions

where there wasn't really much data i did how to shoot pool the male female

i-vectors and make a human same gender independent

recognition system

and apply source normalization

very sad speaker sex is the label of the i-vector and we normalize that way

and that in your recognition

you can only the labelling marcy can basically more second column of your trial based

okay

but you reply to two languages seems to work

recently

and

that it doesn't for english trials too much first

which

and also basically S

what's to go

i stopped speaker cases

and we do not use try to use language as a discriminating speakers

in this

research of course you can see that very well

we think you that

take it as a challenge that you should be able to recognize speakers even if

the speaker speaks a different language than seen before in the in

in the training then you

what

of course

make it will be easier by saying either or different speakers

it's a speaker

she

yeah

yeah and i remember calibration was in one of the one of the major problems

in two thousand six

where you know if you have more english

performance actually be reasonable the discrimination performance but calibrations

where

car

so sure that

that even holds for be a systems

nowadays though but a systems nowadays are

generally behaving better

yeah

no i don't think that a

that

that is what we want

say i think

to say is that it

with the channel

between channel

variation

estimated one of the

very of the total variance

is used to the fact that

things have a different language

and you don't observed that's in the within speaker

variability

so the attributes within language variability

as with the

channel variability

and that is not as to K stiff

this case

languages for same-speaker

Source Normalization for Language-Independent Speaker Recognition using i-Vectors

SESSION 02: Speaker Recognition - Generative modeling

David A. van Leeuwen