Speech Transcript - Dataset Shift in PLDA based Speaker Verification

So, good afternoon and thank you, Patrick.

Well, I am Carlos Vaqueros from Agnitio, from Spain, and I'm presenting our work on

Datasets shift in PLDA based speaker verification,

which is, actually, an analysis on

several techniques

that can be used in

PLDA systems to indicate the effect of dataset shift. But also it's analysis on the

limitations that the PLDA systems have, when dealing with dataset shift.

So, dataset shift is the mismatch that may appear between the joint distributions of inputs

and outputs

for training and testing.

Okay? In general, we have three types of dataset shift. First one will be covariate

shift , which is the... which appears when there is

when the distribution of the inputs, differ from training to testing and it's the most

usual one, the most usual type of dataset shift, since it is related to channel

variability, session variability or language mismatch. But there are also another types of

dataset shift, for example prior probability shift, which is related to variations in the operating

point;

or concept shift, which is related to adversarial environments, that in speaker verification will be

spoofing attempts.

In this ... in this work we're focusing on covariate shift.

Covariate shift has been widely studied in the speaker verification.

We know that there are several techniques developed to compensate for channel/ session variability or

language mismatch. But most of the sessions, most of these techniques work under the assumption

that large datasets are available for training.

The thing is: what happens in real situations, where we face a completely new and

unknown situation and we don't have data to train these... these approaches? For example, here

we have some results.

We are considering the JFA system,

that

to face the condition one, the NIST, SRE, await, which is interview-interview, and we don't

use any telephone, any microphone data for training channel. So, we can see that JFA

if not using microphone data, is not much better classical map

doesn't use any compensation at all

So, once we have the microphone data, we get

a huge improvement.

So the thing is, what can we do we in real scenarios that are unknown

and unseen?

Well, if we don't have any data, it's hard to do anything, but usually we

can, we can expect that some little amount of matched data is provided. So, there

is the thing that we could do.

We can define some probabilistic framework, so that it is possible to perform an adaptation,

even a modeled train.

When a mismatch development data and given some matched data, we... so we can adapt

the model parameters and they can work as soon as possible in this new scenario.

But to do this in a natural way and

to derive it eassily, we should suspect that the

the speaker verification system would be a monolithic system that provides a single probabilistic framework

to compute the likelihood of the model parameters given the data.

Well, the first approach is to JFA were monolythic, so they provided a framework in

which algorithm worked, that defined this, the weight to adapt of these parameters. It could

be possible to define weight to adapt these parameters, given a small amount of data.

But, currents state-of-the-art PLDA systems, they are modular, so we have several model levels.

We have started with the first level, the UBM, so we plan the UBm separately

and it provides sufficient statistics. We used to train the i-vector extractor, a total variability

subspace, and then we obtained i-vectors and we used them to train the PLDA model,

but we used them as features. PLDA model has no knowledge of how these features

were structured,

just the prior distribution they have.

So it it's ... this model has it's advantages, because it's very easy to

to keep improving this model, since we can fix the UBM and work in the

total variability matrices, which is fast to train and

we can try many things and prove it. And also the i-vector extractor is fixed,

we can work a lot and very quickly in PLDA model, and we keep improving

it.

But, in test of adapting this model to new situations, it's ... it has some

problems. Because either we work in the highest level, in a highest model level, that

is PLDA and we adapt the PLDA parameters to face the new situations

or if we want to work in

lower model levels, we will need to retrain

the whole system.

For example, if we have adapted the UBM, our i-vector extractor is not valid anymore,

so we will need to retrain it on the whole data. And this is not

feasible in many applications, for example an application that you want to learn online as

you get more data, in new situation, so you...

need to have all the development data every time we adapt the UBM, so that's

not... it will take a long time to adapt it for even a few set

of recording, a small set of recrding. So that's not feasible in many applications.

Well, we... in any case, there are several techniques that we've done, that we can

apply

in a PLDA system. First thing we could do is, we can

the UBM, attend to the subsequent model levels, but we will need to retrain the

whole system.

We can do it pooling all the available data, the development data and the matched

data, or we could do it weighting of datasets. But, this will be not feasible

in many applications.

So, we can also work in the i-vector extractor. One thing that has been done

is to

is to train a new total variability matrix on the matched

matched data.

Stack it with the original total

variability matrix.

Well, this approach has some to work, but usually you need a quite large amount

of data to train the match, total variability matrix. And also, it will require to

retrain the PLDA model.

It will have some problems. And also, become working

in the PLDA

PLDA model. Here, what we are proposing to do is simply use the length normalization.

But using

some sort of i-vector adaptation by centering

using the i-vector mean from the matched dataset.

What it has to say?

Here

it should be some reference to the word, the study done by Jesus, that is

also another approach that could have to compensate for covariate shift

in five to six percent and after another approach.

So, this

these problems, but always work in the PLDA model so the UBM and the i-vector

extractor are modified.

To test these techniques, what we do is we simulate covariate shift into variation language

mismatch.

So we assume that our system has been trained completely on English data.

We will evaluate it in mismatched groups of languages. We will consider Chinese, Hindi-Urdu and

Russian. As the development data we will use the NIST data from zero four to

zero six, the Switchword data and Fisher data.

Here we will have the number of session speakers that we have for each language

for Chinese we'll have

quite a large amount of data.

For example, for Hindi-Urdu we don't have much development data.

We will evaluate these approaches on the NIST SRE zero eight telephone- telephone condition. We

will consider all to all trials

sHere we have the number of models and speakers, it is

language.

In a speaker verification system we will consider an i-vector, PLDA system, gender-dependent i-vector extractor,

dimension four hundred. And then, we'll consider a gender-dependent PLDA, which is a mixture of

two PLDA models, one for... one trained male data, one trained with female data.

With what... with full covariance matrix for the system component we have speaker subspace of

dimension one hundred and twenty.

And the result will... are analyzed in terms of EER and miniDCF. MiniDCF

So the first thing we do is, we analyze the effect of covariate shift in

the data. And what we have done is to analyze the i-vectors.

We have different languages. So we have computed in Mahalanobis distance, been doing the

population of English i-vectors are the

other language, the population of other language's i-vectors. We have seen that these distances are

very large. So, this means that when we are performing the i-vector land normalisation

language which is different from English, we project it onto a small region of the

hypersphere of unit radius. So, that... the distribution will not be suspected.

The... all the i-vectors will be concentrated in a small region of the hyperextract.

So this will have an effect in the accuracy, not only the distribution of i-vectors,

because we are missing more information in the UBM But in the end, we see

that it has

an effect in the accuracy of the system, but we can see in this table

only English data has been used for development

the other languages

worse results that English. It is true that we don't know the accuracy that we

will get for these languages, provided that we have enough data to train a model,

to train a complete evaluation system with them. But there's no reason also to believe

that these languages are harder for speaker verification system that English. So we could expect

to get an accuracy which is

somehow similar, maybe better, maybe worse, but somehow similar.

to English.

Well, here we are comparing the minDCF obtained for the proposed techniques

for the three languages and the three groups of languages at their best.

So the first call for each language is the baseline, so you see

English development data.

And the second column is

stacking to the... we use

total variability matrices.

The third is using i-vector adaptation. Fourth is using s-norm.

but, we will

And the last three collumns are combinations of these techniques.

So, what... we can see that most of these techniques work in the sense that

they improve the

results of the system

but improvement is quite small.

if we wanted to reach some acccuracy close to English, which is

here

where we are still too far, we're still too far.

So, this can be seen also in this DET curves

where we are representing the DET curves of time for Chinese.

We have the DET curve which is

only using English data for involvement, the blue curve will use a match training data

perform i-vector adaptation

the black curve will use match Chinese data

to perform

i-vector adaptation on s-norm.

We get the

we see that we get a slight improvement, but we are still too far from

English

So, that's from the results we would like to get.

There is also another important fact that we introduce. The presence of covariate shift. We

will find this misalignment in the score distributions.

It's something that is widely known and

you can see this effect here, in the example we have.

We have represented the English and

Chinese score distributions. We can see that the Chinese score distributions

are

shifted to the right

higher scores, probably it's related also with the fact that

the i-vectors are concentrated in the small region.

you

So,

it's

it's mandatory to use, it will have a little amount of data to use it

for calibration.

This is something that everybody knows and we have been doing for

in all NIST evals, we always calibrate each condition separately. We use also techniques with

side info

for calibration that we, that we add the language, but it's important, the condition might...

because if we only have a little amount of data, and we need to use

independent...

part of the data for calibration and for adaptation, we will not have much data

for adaptation.

So, here we are representing minDCF for our languages.

And in the actual DCF we use English data for calibration, in red. That's DCF

when you use

we use matched data.

It's

it's mandatory to use matched data for calibration.

So, as conclusions of this work,

we'll say that dataset shift is usual in speaker recognition

There are many techniques developed to compensate for this, but most of them need

large amount of data to work properly.

But in many real cases little data is provided.

So, if we have monolithic systems, it will enable us to perform some sort of

adaptation.

But state-of-the-art techniques tend to modularities, since development is much easier, when we have a

modular system.

PLDA

There are techniques that can work with this modular

modular systems, but they obtain a slight increase in accuracy.

There is still a huge gap to improve.

And finally, it's important to keep in mind that matched data is mandatory for calibration,

so we have

small amount of data

for adaptation, we will need to use part of this data for calibration.

So, that's all, thank you very much.

You mean, in this work?

You mean this work or in the literature?

I'm not sure, BUT you can see that, for example, YOUR i-vectors don't match your

distribution, your expected prior distribution needs a new or

or even at lower levels your statistics or

or MFCC

but yes

but it would be interesting. I think the problem is that

if you want to have a compensation

basis, it would be interesting to have at some point JFA or maybe eigenchannel base

system that is

described as probabilistic framework that you could adapt, define some technique but

interesting to do it.

So you mean using a smaller

dimensional i-vector extracor?

okay

But in any case, you will... if you adapt your i-vector extractor, you will need

to retrain your PLDA system.

Yeah yeah. Have you tried to remove the specific means

or the specific channel conditions?

for example

microphone data

or to remove

telephone mean from the telephone data

microphone mean from the microphone data?

No, I haven't tried that.

Sounds risky.

It may work, but

like assuming that there is no rotation in the i-vectors, so that's shift in the

if there is rotation

it will not work

I don't know

It is interesting to try. I've tried that and it was helping

It was helping? Ok, that's interesting.

okay

Well, especially were in those languages, where I don't have much matched data yet. Yeah,

that might be... I think it's in most languages pretty balanced, but there are some

languages... I think I remember that, for example, Hindi had

Hindi-Urdu had ...

in detail... seven speakers. So that was

I remember, but is probably... it is quite unbalanced, but maybe we have

female speaker

well

okay

Well, not for Chinese, for example. It depends on the language

but

I would say that i-vector adaptation is the one that

rocks, so it always needs improvement

It's not much, but

yet

The matched data.

So, when I work I use...

so these techniques try to use the matched data

but in our web two group the

accuracy of the system

Not much, I don't think the improvement was indicative, if there was improvement. Maybe there

was some losses.

So you mean that

if I get

my model speakers from English, it will help also if we

perform some of these techniques to adapt to them?

okay

I see that you can't do sometimes something without the data, because there are certain

ways

courses of variability

variability in the first place

general comment to

all of us

Yeah, ok, well in fact, there are techniques that provide more...that need the results presented

in last of the speech

is based on integrating out the

PLDA parameters. So, to

the uncertainty of these parameters, so it should be more robust to dataset shift, but

when you see.. the point here is: if you have some amount of data

so it's better to use it. But you're right

You are completely right, of course.

Dataset Shift in PLDA based Speaker Verification

SESSION 02: Speaker Recognition - Generative modeling

Carlos Vaquero