This whole session should be about compact representations for speaker identification and the first talk

is... the title of the first talk is A small footprint i-vector extractor

So, I repeat, this talk

by a small footprint

i-vector extractor

one that

amount to memory

the troubles we have

basic problem is that these kind of algorithms for extracting i-vec

are quadratic

both of them, both the memory and

computational reqiurements

we present

is a follow-up on

paper that Ondrej

Glembek presented at the last year's ICAASSP,

The approximation on how i-vectors could be extrac

with minimal memory

overhead, but after

after discussing that work, I intend to

is to show how that idea

The principal motivation for doing this work

it's well known that

you can introduce approximations at run time as

we have minor degradation in the

in the recognition performance

However, these approximations generally do cause problems if you want to do training. So, the

motivation for aiming at the exact posterior computations was to be able to do training

and, in particular, to be able to do training on a very large scale. Traditionally,

most work with i-vectors has been done with dimensions four or six hundred. In other

areas of pattern recognition principal components analyzers have much higher dimension than they constructed, so

the

purpose of the paper was to be able to run experiments with very high- dimensional

i-vector extractors.

As it happens, this didn't pay off. But the experiments needed to be done, you

know, in any case.

Okay, so the ... the point of that i-vectors, then, is that they

they provide a compact representation of an utterance

typically, vector of four or eight hundred dimensions independently of the length of the utterance.

So, that the time dimension is banished alltogether, which greatly simplifies the problem.

Essentially, it now becomes a traditional

biometric pattern recognition problem without the complication introduced by

arbitrary duration. So, many standard techniques apply and joint factor analysis becomes vastly simpler; so

simple that is now has another name it's called probabilistic linear

discriminant analysis

And, of course, the simplicity of this representation has that

well, fruitful research in other areas, like language recognition. Even speaker diarization for i-vectors

can be extracted from short speaker turns as short as just one second.

So, the basic idea is there is an implicit assumption that, given an utterance, it

can be represented by a Gaussian mixture model.

If the

if that GMM were observable,

then the problem of extracting the i-vector would simply be a matter of applying a

standard probabilistic principle components analysis

to the GMM supervector

So, the basic assumption is that of the supervectors lie below dimensional space, the basis

of that space

is known as the eigenvoices and the coordinates of the supervector relative to that basis

is the i-vector representation. So, the idea is that the components of the i-vector should

represent high level aspects of the utterance, which are independent of the phonetic content.

Because all of this apparatus is built

of the UBM, UBM can play the role of modelling the phonetic variability in the

utterance

and the i-vector then should capture things like speaker characteristics

room impulse response

and the other global aspects of the

utterance

So, the problem that arises is that the

the GMM supervector is not observable. The way to get around the problem is by

thinking of the Baum-Welch statistics.

It's typically collected with the universal

background model as ... summarising a noisy observation of the... of the GMM supervector.

the

From the mathematical point of view. The only difference between this situation and a standard

probabilistic principal components analysis is that in the standard situation you get to observe every

component of the vector exactly once and this situation, you observe different parts of the

vector different number of times.

Other than that, there is nothing misterious in the derivation.

So, this is the mathematical model, the supervectors

are assumed to be confined in a multidimensional

subspace or the supervector space.

The vector y is assumed in the prior, a standard, normal distribution. Now, the problem

is given Baum-Welch statistics to produce a point estimeate of y and tyhat is the

i-vector representation if the utterance.

You can also write the terms of the individual

components of the GML

the standard assumption is that the covariance matrix here remains unchanged, it's the same for

all utterances.

the

Attempting to make that utterance- dependent seems to lead to insuperable problems in practice,

nobody, to my knowledge has

ever made the progress with that problem.

One aspect that is common to most implementations is that

some of the parameters, mainly the mean vectors and the covariance matrices

is copied, technically, from the UBM, into the

probabilistic model

That, actually, leads to a slight improvement in the performance

I'll report to some results

later

The main advantage, though, is that you can simplify the implementation by performance of affine

transformation of the parameters

which enables you to take the mean vectors to be zero and covariance matrices

to be the identity

and that enables you to handle UBMs with full covariance matrices

in the simplest way.

It's well known that using

covariance matrices

does help.

So, these are the standard equations for extracting the i-vectors, assuming that the model parameters

are known.

the matrices be

the

problem is accumulating this... this matrix here.

Those are the zero order statistics,

that are extracted with the UBM.

The standard procedure is to precompute the terms here These matrices,

these here, they're symmetric matrices.

So, you only need the

proper triangle.

The problem, then, in the memory point of view, is that... because these are quadratic

in the i-vector dimension, you have to pay a heavy

price in terms of memory.

So, those are some

typical figures

for fairly standard sort of

configuration.

that is

These are the standard training algorithms; the only point I wanted to make in

putting up this equation here is that: both in training and in extracting the i-vector,

which was

the previous slide, the principal

computation is a matter of calculating the

posterior distribution of that factor y.

So, that is the problem,

calculating the posterior distribution of y.

Not just the point

So, the contribution of this paper

is to use a variational Bayes implementation of the probability model

in order to solve this

particular problem of the

doing

So, the standard assumption is to assume that the

posterior distribution that you're interested in factorizes; in other words, that you have a statistical

independent assumption relation

that you can impose

Estimating these terms here's carried out by a standard variational Bayes update procedure, which you

can

find

within the reference

of the Bishop's book.

This notation here means you take the vector y and

you

calculate an expectation over all components, rather than particular components

that you happen to be interested in when you're updating

the particular term.

the...

These updated rules are guaranteed to increase the variational

lower bound and that's useful

property

So, this is an iterative method,

you have to single iteration, what consists of

cycling over the

components of the i-vector.

This is just to explain that the computation's actually brought down to something very simple

assumptions are

Gaussian

the

factors in the variational factorization are also Gaussian. To get the normals

you just

expression

and the point about the memory then is just as the... in the full posterior

calculation. Pre-computing these matrices here enables you to speed up the computation at a constant

memory

The things you have to be pre-compute here are just the diagonal versions of these

things here and for that the memory overhead is negligible.

So, this is all based on the assumption that the

posterior that we can assume a diagonal posterior covariance matrix,

So,

is's explained in the paper why the variational Bayes method

, even if that assumption turns out to be wrong,

the variational Bayes is guaranteed to find the point estimate of the i-vector

exactly.

See? So, the only

error that's introduced here

is in the posterior covariance matrix,

it's assumed to be diagonal

There's no error

in the point estimate of the posterior, thus it's

If you're familiar with the numerical

the mechanics correspond to something known as the

which in this case happens to be guaranteed

versions happens to be guaranteed because

variational Bayes

So the method is exact, the only real issue is how efficient

it is. That turns out to raise the question of how

is the

assumption that

the covariance matrix can be treated as diagonal.

Two points here, to bear in mind.

In order to show why the assumptions are

reasonable.

First is that the i-vector model is not uniquely defined.

You can perform a rotation in the i-vector coordinates

provided that you perform a corresponding transformation on the

eigenvoices, the model remains unchanged.

The

posterior,

prior factor of the

why it continues to be the center of whole distrib

You have freedom in

rotating the

the basis.

The other point, this was the point that Ondrej Glembek

named in his

ICAASP paper last year. That, in general, this is a good approximation to the posterior

precision matrix,

provided you have sufficient data. So, those W's there are just the mixture

weights in the

unversal background model

and talking the number of frames

in, for example,

in scenario like core condition it may be somethin

so that you have sufficinetly many

frames that this

approximation here

would be reasoned.

If you combine those two things together,

okay? You can say that by diagonalizing this sum here you

form this sum just once, using the

the mixture weights. Then you will produce a basis of the i-vector space with respect

to which all the posterior

precision matrices are approximately diagonal.

That's the justification for the

diagonal assumption. You have to use a preferred

basis

in order to

do the calculations.

And using this... using this basis guarantees that the variational Bayes algorithm

will converge very quickly.

Typically, three iterations are enough,

three iterations independently

of the

rank of the dimensionality

of the i-vector.

And that's the basis of my contention that this algorythm's

computational requirements are

linear in the

So, memory overhead is negligible and

the computation

scales linearly

rather than quadratic.

If you're using this,

the preferred basis is going to change

, so you should not overlook that.

Whenever you have a variational Bayes method, you have a variational lower bound, which is

very similar to auxiliary function and

which weighting the auxiliary function

which is guaranteed to increase

on successive iterations of your

algotythm.

So, it's useful to be able to evaluate this, and

the formula is given in the paper.

It's guaranteed to increase on successive

iterations of variational Bayes.

So,

used for debuging. In principle,

it can be used to monitor convergence, but it actually turns out that the

overhead of using it for that purpose

slows down the algorythm,

it's not used for that in practice.

I think is that it can be used to monitor convergence when

you are training an i-vector extractor with variat

The point here is that the

exact evidence, which is

the thing you woudl normally use

to monitor convergence,

cannot be used

in this particular case

if you're assuming that the posterior

is diagonal, then you have to

modify the calculation

Okay, so a few examples

of questions that I dealt with in the paper.

One is how accurate is

variational Bayes algorythm?

To be clear here, there is no issue

at run time, you are guaranteed to get the exact

point-estimate of your i-vector,

provided you

monitor iterations

The only issue is

the recent approximation, when you treat the posterior precision or covariance matrix as the

diagonal.

And those posterior precisions

do interrupt the

training model, so it's concievable. But using the

assumption on the posterior precisions could affec

the way training behaves.

So, that's one point

that needs to be checked.

I mentioned at the beginning,

this is well known, but

I think needed to be tested.

If you make the simplifying transformation, which allows you to take the mean vectors to

be zero or the covariance

matrices

You're copying some parameters

from the UBM into the probabilistic model for

i-vectors.

There is a question as to

obviously

plausable reason

How efficient is variational Bayes?

Obviously, there's going to be some price to be paid.

The standard implementation

you can recude computational burden

at the cost of

several gigabytes of

memory,

so you no longer have opportunity

of using all that memory.

The question about efficiency.

And finally, there is an issue of

training very high dimensional i-vector extractors.

You cannot train very high dimensional

i-vectors exactly using

variational approach.

Bayes approach does enable you to do it,

but there is an impairment of doing that.

Ok, so the testbed was

female det two trials

this is a matter of telephone speech,

extended core condition of the NIST

two thousand and ten

evaluation.

Extended core condition of the millions of

very much large number of trials

than the original

evaluation protocol.

The standard front end, the standard UBM

diagonal covariance matrices, trained on the usual

In other respects, the classifier

was quite standard

the used heavy-tailed PLDA

These were

results obtained with JFA executables,which is

the way i-vectors were originally

built and they were just

produced as Benchmark

There was a problem, the activity detection explai

error rates were a little higher than expected.

With variaional Bayes I actually got marginally better

results, which turned out to be a more effective c

covariance matrices. Copying the covariance matrices

actually

effect is to reduce to other estimated variances

I need to get to

efficiency

My figures for extracting a four hundred-dimensional i-vector extractor

are typically about half a second.

Almost all of the time is spent in

BLAS routines, accumulating the posterior covariance matrices

to take seventy five percent of the time.

An estimate a of quarter of second, which

suggests that compiler optimization may

be helpful, everything is going on inside the clas

For the variational Bayes method, I've got an estimate of

point nine seconds instead of point five.

I've fixed the number of iterations at five.

Variational Bayes method really comes to it's own

when you work

with higher-dimensional of

i-vector extractors.

It's one last table. I did try and put several

dimensions, up to sixteen

hundred and got a very good indurance and performa

Okay, thank you.

Well, it depends on the ... on the dimensionality of the i-vector extractor.

A couple of gigabytes, it's as big as the large vocabulary continuous speech recognizer, it's

not as intelligent, but it's as big.

Just the eigenvoices, okay, together with the stuff you have to pre-compute in order to

extract the i-vectors efficiently.

Yeah, that what require

You still have to store the eigenvoices, but thing that you did not know, that's

not the big part. The big part is a bunch of a triangular matrices that

we store in order to calculate i-vectors efficiently, using the standard approach.

The point of this was to use the variational Bayes to void that computation.

I'm afraid that we've just spent the time for questions here and so I... I

guess that Patrick will get lots of questions offline and maybe even at the end

of this talk they can answer those questions together with Sandro, who is giving the

next talk