Exploring similarity and fusion of i-vector and sparse representation based speaker verification systems


I'm Haris from Indian Institute of Technology Guwahati and I am here to present our

work on exploring the similarity and fusion of i-vector based sparse representation based

speaker verification systems.

As we all know, the i-vector based systems form

current state-of-the-art of speaker verification.

And also recently

some works

explored the use of sparse representation

for the speaker verification in gender speaker recognition task.

So, in sparse representation based works, both examplar dictionary based and learned dictionary based techniques

have been explored. So, in this work we're trying to find the similarity between these...

I mean, the sparse representation based and

the i-vector based speaker verification systems. We also proposed a feature-level combination or fusion of

these two systems exploiting the advantages from both of the systems.

So, again, I will have a short review of the i-vector based system, with we're

all familiar with. So, i-vector based speaker verification system can be interpreted as climbing the

compact representation of the high-dimensional supervectors by taking projections to a matrix called Total variability

matrix, which is a low-rank. And the estimation of the... estimation of the i-vector can

be performed using the

by the given equation and

so we find the i-vector representation of training as well as testing utterances, and find

the similarity between these two using the cosine kernel and this is the... So now,

before going to the basics of sparse representation based speaker verification system, we'll have a

look at the fundamentals of sparse representation. Sparse representation finds... will try to represent a

vector, y, by using a dictionary, our matrix, as a linear combination of the columns

of the dictionary matrix. And we also put the constraint that the number of columns

used to be... that's how we find the sparse representation vector, which actually takes only

a few known zero values.

So, the columns of dictionary are often named as atoms

used by different sparse representation community. So, the

transformation matrices named as the dictionary.

So, a sparse representation applications we can find starting from compression, de-noising and also classification.

So, it has been well explored in the area of image processing and many other

areas of signal processing. So, the basic idea behind sparse representation based classification is the

test example for a class can be approximated as a linear population of training examples.

So, in the... one of the very first works in this were photos is a

face recognition task. A whole set of examples. The dictionary was formed using that examples,

training examples from different classes and classification was performed by finding the sparse representation of

the best example using the dictionary created by the training example. That means: so we

call these approaches exemplar based, because the dictionaries are created using examples of the... examples

from different classes.

So, motivated by this work, a few works

explored the theme for speaker identification and then for speaker verification. As we know, in

speaker verification we don't have... it's not a close set task, so we'll ... the

work... you placed a set of background speakers to create the dictionary, examplar dictionary. So,

the claimed speakers training example and a set of background utterances. That has a meaning

for the form of vectors, which is, which can be the supervector representation or i-vector

representation; forms the dictionary, exemplar dictionary. And the text example is projector of behind sparse

representation of the text example over the dictionary. And so, for finding the sparse solution,

we can go for that many algorithms available for finding the sparse solution. It can

be zero minimization based or one minimization based. The example of a zero minimization is

an orthogonal matching pursuit. Or we can go for a basis pursuit, Lasso; any of

these algorithms. And

So, in this pass

representation vector, that's D. It is supposed to ... how we do the scoring, scoring

for performing the speak verification, yes? We take the some of coeficients in that sparse

vector, which correspond to the claimed speaker training an example. And they should be in

that train, the score corresponding to the claim and the background and total corrections in

the sparse vector is considered as the score for verification. This is, actually, proposed in

the work by and there are, in case if you have multiple examples for a

claimed speaker, we'll take some of the corrections corresponding to that. Otherwise, we'll have only

one example, so we take, basically, the correction corresponding the particular

example and against the


as an improvement over this work, in our previous work we have proposed the use

of a learned dictionary for the task of ... i mean, for doing the sparse

representation based speaker verification. Here, actually, similar to that, i-vector formulation learn and dictionary,

which is,in our task, D.


We use the center of mean shifted supervectors as the speaker representation and the train

has a list of testing examples, that are the corresponding supervectors. They are represented over

the learned dictionary, which is also learned from the supervector representations only and

the sparse representation of training and the testing examples are extracted using orthogonal matching pursuit.

And similar to the i-vector system, we find the similarity between these two representations using

that cosine kernel.

So, this system was being named, in this work we reffered to is as SRSV.


we have used couple of methods for learning the dictionary. One was the well known

KSVD algorithm for learning the dictionary for sparse approximation and we also used modified version

of KSVD algorithm, which is near that, so S-KSVD algorithm. This is, actually, a supervised

version of the KSVD algorithm. In KSVD

what we do is: the development data

it's an iterative matter, there is all in the figure, there are two phases. One

is the sparse coding phase ,the dictionary update phase, the sparse coding phase. We find,

we initialize the process with the random dictionary or randomly ... created using randomly chosen

examples. And we find the sparse representation of the development data or the dictionary. And

in the next stage, we update the dictionary using singular value decomposition, the columns of

dictionaries are updated

and this is done iteratively to get the optimized dictionary. In the sparse coding stage,

any of the sparse algorithms can be used. In our we have used OMP.

So, the difference of S-KSVD algorithm is that it basically uses class ... the training

examples. So. in case we need the basic idea, our basic goal is to minimize

the representation error and to better sparse constraint. In S-KSVD apart from this, minimizing the

representation error, we put there constrain on the class probability also. That means in

the same. We have to minimize the representation error as we maximise the separability degree

representation. So here we use a Fisher criteria

with the representation. Minimizing the representation error. So, this is... this can be considered as

a discriminative dictionary, something like LDA incorporated to the dictinary learning conceept. So, these two

matters we have used for learning the dictiona and our experiments are done using the

NIST two thousand three database. Actually, when we were doing this experiment we had access

to only two thousand three data. We are looking or trying to get results using

latest databases, so

now coming to the work

unlike the theme matrix learning, ... I'll go back to the slide... from here I

see it's twenty four, this is very much similar to the i-vector formulation. The difference

lies in extracting the natural, the representation like this, which is sparse here. And in

case of i-vector it's a pool vector. And also, the way we learn the dictionary.

In case of theme matrix learning it's something similar to PPC, whereas in this matter,

the dictionaries are learned with the sparse constraint or expecting a sparse representation.

And also as we have used OMP. OMP is a greedy approach for finding the

... minimizing the zero of the vector x. So here we can have either a

constraint over the representation error, or we can have a constraint over the sparsity.

So, in this figure we examine the effect of sparsity in the final results of

speaker verification.


And there are two sparse representation

processes are in board. One, in the dictionary learning phase, and one at the decoding

or the testing phase. So what

Which sparsity should be used to learning the dictionary, because I know I am facing

more in the learning dictionary also,

at the time of testing. So we try to

find the optimal or best number for this sparsity constraint.

Actually, while learning the dictionary, we are taking the sparse representation of the seen data.

That means that we initialize the dictionary, then we try to represent the same data

of the dictionary and we are updating. So, it's... that process is over at seen

data. So we can expect

more compact representation compared to doing the sparse representation over an unseen data. So here

we have observed that when we use the learning... dictionary learning phase, we have used

very high sparsity. That means very large number of atoms selected. Whereas in the testing

phase, we should relax the constraintment for higher number of atoms, because

on unseen data. There are different data and evaluation that are also gonna

So, in this particular work we have observed the selecton of five atoms while dictinary

learning and fifty atoms while representation are giving ...I mean this is giving their best



I compared the results obtained from the i-vector based system,

sparse representation based system with case we did there. So here are

So, the results of equal rate, the respond to the i-vector system is better by

one percent compared to the sparse representation based system,

using case with the dictionary. So, look at the distribution of scores. Though, I mean,

our work our work exactly

differed these two systems.

We can see that this is the true scores and false scores distribution.

Red curve is the false score distribution and false distributions and blue one is the


Here, this is corresponding the i-vector system.

The sparse representations.

So, here you can see this false representation, false trials. Scores are pi at zero

compared to the i-vector system.

That's obvious like the sparse representation into

I mean, unlike that i-vectors... I mean, in the false trial case, there is high

chance to have orthogonal representation for a... let's say...

suppose in the false trials, speakers are different, so the atoms selected by two different

speakers can be... I mean, can be different. So, that cosine kernel will give zero

scores in many after trials. That leads to a distribution like this.

And the mean of the true trials

has shifted towards the right, which is good,

but at the same time, the variables of true trials have increased. So, which ultimately

makes the system perform

with the i-vector system.

And, in fact, this is one at the pi key, false trials distribution, is good.

Also that shifting towards the right is good, but at the same time, that increasing

the variables, makes the system performing to i-vector based system.

So, here

there is another work which is trying to do the sparse representation

over the T matrix system.


here, you know, actually, the actual work, they have used Lasso algorithm for binding.

the sparse representation instead of i-vector over the T matrix. So, here we have repeated

this experiment matching with our previous experiment, using OMP. So, as OMP has, unlike that

Lasso basis, OMP has sitting constraint using it, so we examined the

change in the sparsity with the T matrix and the dictionary. So, you can see

here this blue curve shows there are performance equal at rates corresponding to them, as

SRSV system with the T matrix

and the green curve shows the KSVD.

And for the third we have shown that... the i-vector, the classical i-vector system.

Here you can see that and the T matrices look more learned with the sparse

constraint. Last number atoms elected for the representation it gives a really bad performance. Whereas,

KSVD based system gives a decent or comparable result with the i-vector, tkaing the real

numbers of atom selector.

when we go, when we increase the number of columns selected, ultimately it affects the

i-vector performance. Because i-vector is also a full representation. So here, as noted before, the

SRSV case, we did perform slightly inferior to

i-vector system.


the conclusion result. SRSV system with the T dictionary performs poor with high sparsity and

approaches i-vector performance

with lower sparsity.

And also, wit the use of all atoms the performance of the t-SRSV system matches

that of i-vector system.

Now, so, we'll try to find the effect on which, mean,

the previous talks also,

larger dimension i-vectors. So here, I mean, basically, we'll go for large number of false

alarms in the T matrix

Here says effect of dimension of, the size of dictionary on the sparse representation

based systems.

and this blue curve shows the performance that is size of i-vectors, I mean, the

number of various sizes of dictionary. And the green is the SRSV

and the red is SRSV with the

T matrix as the dictionary.

Here the number of atoms or columns selected is matching with the KSVD one. And

this i-vector, of course, is a folder of destination.

The atom selector, the T-SRSV performs very ... I mean,

the performance is good comared to the other two. 1.0 the i-vectors, in the case

of i-vector, three hundred columns

or four hundred hardly make any difference, but the KSVD, the optimal performance or the

best performance

That atoms selected with T atoms and extracted

So, we are trying to increase the number of atoms selected for representation in the

case. So, we found that when we increased the number

for sizes of dictionary also. It approaches the performance of the i-vector system.

Still, the i-vector based system is

giving the best performance.

Now, motivated by the performance of these three systems, basically. The KSVD dictionary, T-SRSV, T

dictionary based SRSV and i-vector. We tried to use the power of these two and

we proposed the fusion of this

representation. The more we did this, we found that i-vector representation of supervector using T

matrix in the conventional way. And the resynteticized supervectors and we termed it as the

T smoothing. Whereas we know that projection to a lower dimension space removes the ...

I mean, reduced dimensions and small nuisances


And so, this helps in clasification. So this diagram, or this intensity plot, shows the

results of the control experiment in the case of twenty five speakers we do. These

speakers having five examples and we find the similarity of the cosine kernel within the

supervectors before this smoothing and after smoothing.

So, that

all cases combinations, the similarity, is on the cosine kernel, our newest improvement. But in

case of in - class, improvement is much better, compared to them, between class cases.

Which, of course, it will be helping in the class performance. So then, we have

used this smooth supervectors for learning the dictionary and for sparse representation.


We are becoming to the results later.

Here we can compare the performance

various matters we have trained.

There's to be compared i-vector with the KSVD dictionary system, as I mentioned, there is


I mean, approximately one

percent equal rate between these two.

T matrix based

sparse representation with pool atom selected. This is, actually, t atom selected, this is the

best performance came with the pool

pool representation.

Then another thing, which I already mentioned, by use of the discriminative dictionary, we have

more tha a huge improvement in performance. This number against the KSVD, ... actually these

need to consider that this is type of thing incorpporated into that. So that justifies

the improvement in the performance. So, smoothing using that the T matrix, we recorded approximately

thirty percent relative improvement in case of both the dictionaries.

So, we have

also did some channel and session variability compensation using joint factor analysis and LDA and

WCCN. So, after this research,

compensation. So here

we have known that i-vector with this LDN WCCN as a three channel compensation vectors

Actually, joint factor analysis, we used as a preprocessing of the supervectors before doing sparse


Also, a combination of these two

the sparse representation with LDA WCCN have been trained.

And also, we have tried to do the

score level fusion of the best performing system

and this ended up as the

performance od point ninety nine equal error rate in case of

these two thousand three database.

To summarize

We have highlighted the close similarity between the i-vector and sparse representation based SV system.

We have studied the use of total variability matrix as a dictionary with the matching

pursuit as the

algorithm for sparse representation.

We found that, compared to the SVD dictionary, that T matric can be used as

the dictionary with the better results,

but with the high number of atoms selected. Among all the dictionaries we found that

the supervised one,

performed much better that the other one left astray. And we also proposed a feature

level fusion of the i-vector and

sparse representation based systems. And we found that amount of channel of session compensation method

in case of sparse representation joint factor analysis

based preprecession held

better with.

Time for questions, any questions?

So let me ask one... for the sparse representations you always work within mean supervectors

from the system? Did you ever tried to reconstruct assumptions statistic predictions if you do

if you obtain supervectors up adaptation or you lose some information before

We actually anomalized supervectors within the cov

What is the motivation for using

sparse representation here?

besides tha fact that it's a technique which is available

They are good speakers, and are representing all s

There can be some dimensions which

are closer to the particular

And just a question... you sure there?

You need not to be.

The representation

sparse representation

thank you

In my experiments I did not find


Let's thank the speaker again.