Speech Transcript - I-vector transformation and scaling for PLDA based speaker recognition

okay last undo

i'm going to present well work on i-vector transformation and scaling for p lda based

recognition

and the goal of this work

two presents a way to transform over i-vectors so that they better fit the plp

assumptions

and the same time introduce a way

to perform some sort of dataset mismatch compensation similar to what length normalization is who

enforced on the p lda

as we all know the lda assumption assumes that the latent variables a portion which

with the resulting i-vectors which if we assume they are independently someone they would

follow a gaussian distribution

now we all know this is not really the case

indeed

we have two main problems personal model

our

i-vectors do not really look like they should if they were some performs a gaussian

distribution

for example here on the right

i've plotting the one dimension of the i-vectors the they mentioned with the highest skewness

i plot in the histogram and it's quite clear that

the histogram doesn't really resemble anything like a gaussian distribution but it's even almost multimodal

then the other problems that we're

a quite evident mismatch between development and evaluation

vectors

for example if we look at the left

there is a plot of the histogram of the squared i-vector models for both

our development set which is sre ten females at

and evaluation which is condition five female settles whatever sre ten

and we can see two things first of all

the distribution list pronounce or evaluation and development set are

quite different among themselves

and none of them resembles what we should expect

these i-vectors of everything sampled from a standard normal distribution

now

up to now we have

mainly to waste approach

these issues i've represented

first one was heavy tailed yesterday by patrick kenny which mainly tries to with the

non gaussian assumption

what with the gaussian assumption is that in that it removes the core channels options

and assumes that i-vector distributions are heavy tailed

and the second one is length and or

functional in our opinion is not really making things more portion what is really mainly

dealing with the dataset mismatch that we have in this between evaluation and development i-vectors

in need here i'm doing the same block that was doing on the most you

dimensional i-vectors before and after lexical and we can see that even if we apply

length on these cannot compensate since alike

multimodal distribution signal what i-vectors

it might actually compensate for heavy tailed of your that's for sure but still we

don't get things which are really

go shown like

now in this war we want to address

second the problem of doing both approximation of i-vectors so that they better fit the

lda assumption so we tried to portion right somewhat i-vectors

and that the same time we propose

way to perform the dataset compensations email toward length normalized on the difference being that

the this dataset compensation a student

for our transformation

and we estimate both of the same time

okay so

how do we perform these

this phones focus on how we

manner transform i-vectors so that they better fit the gaussian assumption

to do that stands we assume that i-vectors are sampled from a random variable feeding

which

whose pdf we don't know however we assume that we can express is unavoidable feels

a function

although a standard normal random variable

now if we do like these then we can express the pdf of this random

variable fee others

the little pdf for

samples

of samples which are transformed through f and computed over the for why class

sometimes which of the log that are we don't of the accordion of the transformation

no the good thing is that we can

due to things with this model first of all we can estimate the function f

us to maximize the lack of our i-vectors

and in that way we would obtain something which

use also the pdf of i-vectors with which is not anymore standard portion but depends

on the transformation

and the other one thing is that we can also employed this function to transform

i-vectors so that the samples which follow the distribution will fee

becomes transformed into samples which follow

standard normal distribution

two

no more than these unknown functions we decided to follow a

framework which is quite similar to the neural network framework

that is we assume that we can express this transformation function as a composition of

several a simple functions

which can be interpreted as layers of a neural network

now

the only constraint that we have with respect to the standard neural network here is

that we want to work with functions which i vegetables or our layers of the

same size and the transformation they

produce needs to be invertible

as we said we perform maximum like to estimate of the parameters of the transformation

and then instead of using the pdf directly we use the transformation function to map

back

i y i-vectors to

let's say well shall distributed i-vectors

here i have a small an example on the one dimensional data these is again

the most cute dimensional are almost you component of our training i-vectors

and from the top left the original histogram and on the right hyper the transformation

that we estimated

so how's you can see from the top left

if we directly use the transformation

to evaluate the log pdf of the

about one

i-vectors actually we obtain a pdf which are very closely matches the histogram of our

i-vectors

then if we apply the inverse transformation to these data points we obtain what we

c in the bottom v you hear

and what

does that show it shows that we managed to obtain a histogram of i-vectors which

very closely matches the gaussian

pdf which is portable i don't know if it's visible but there is the pdf

of the from one question which is pretty much on top of the histogram all

the transformed vectors

in this war

now we decided to use a simple selection for our layers in particular we have

one kind of layer which does just an affine transformation that is we can interpret

it just as the weights

of a neural network

what we call as you know it's in

let you have

which performs the nonlinearity

no the reason we chose this particular kind of an ideal is that it is

nice properties for example with a single layer we can already

represents pdfs

of the random variable which are most similar to the same in heavy tailed and

skewed with just a single layer and

if we are more like it we increase the

modelling capabilities of the program although this creates some problems of overfitting i was like

with say

later

now the other side we use a maximum likelihood criterion to estimate the transformation and

the nice thing

is that we can use are optimized on a general optimize the which we provide

the objective function and the grunt incentives guardians

can be computed we'd

an algorithm which resembles quite closely that of back propagation with mean square error of

a neural network

the main differences that would need to take into account also the contribution of the

log determinant switch

increases the complexity of the training but the training times is pretty much the same

as we what we would have with that standard neural network

no this is a full set of experiments here we still didn't a couple length

normalization and any other kind of

compensation approaches or what i'm showing here is what happens when we estimate

this transformation on our

training data and we applied to transform i wanna vectors

as you can see on top layer on the left the same histograms of the

square norm i was presenting before and on the right the squared norms of the

transformed i-vectors

of all

here i'm using a transformation way to just one not only not like

now of course as we can see the square norm is still not exactly what

we would expect from

standard normal or the distributed samples but

matches more closely our expectation and more important we also somehow

reduce the mismatch between evaluation and development squared norms which means that our i-vectors are

more similar

and this gets a reflected in the results on the first and second line you

know the lda and

the same the lda but trained with the transform i-vectors

has the same here would not

using any kind of like someone we can see that our model allows to achieve

much better performance compared to standard lda

on the last line all

we can still see that length normalization is compensating for is not as a mismatch

better which allows the lda with length normalized i-vectors to perform better than our model

right

the next part is how can we

incorporate this kind of preprocessed in our data of course we could try to maximize

i-vector but we can do better by

costing these

kind of transformation directly to our model

to this extent

we first need to in you but different interpretation elements alarm and the particular we

need to sting

all

length normalized the maximum like the solution of a quite simple model

well i what i-vectors are not i aid anymore in the sense that

we assume that each i-vector is sample from a different random variable has a distribution

which is normal

the it the all these time the variables channel i think down which is the

seed model

the covariance matrix but this covariance matrix is case for each i-vector by a scholar

that

this is quite similar to one maybe tailed distribution but instead of putting prior simple

zeros on this stems

we just optimized by the maximum like of solution

now if we perform a two-step optimization where we first estimate see no assuming that

the alpha terms are one

and then we fix that senile we estimate the optimal alpha times we would gonna

end up with something which is why

very similar to links norm indeed it's the links

is the squared no it's the norm of the white and i-vectors divided by the

square root of the dimensionality of the i-vectors

now why this is interesting because these

random variable can be represented as a transformational a standard random variable well the transformation

as a parameter which is like vector dependent

now if you have to estimate this

but i mean of using an iterative strategy which but of a first estimate the

sequence and the alpha and then we

well to apply the inverse transformation we would recover it exactly what we're doing right

now would length normalization

so these demos

you know how to implement a similar strategy into our model

we introduce what we call that not all eight euros scaling layer which is a

single parameter and this parameters i-vector dependence of for each i-vector where y to estimate

its much selected solution

now our transformation is the cascade of these

scaling layer and what we were proposing before saw

the

composition of a finance also there yes

that is one comment here

in order to

if you change in this thing we

still have to resort what adaptive training that is we first three why we estimate

the bottom the shared parameters that we fix the shared parameters and the optimize what

file

and one more thing that we need to take into account is that at this

time

while with the original more than we don't need to do anything as then transformed

i-vectors with this model at this point we also need to estimate the by selecting

the optimal scaling factor

however these

used as a great improvement as you can see well the first line of the

same i was presenting before

and then the last three lines are the lda would length normalization

then the one day of transformation with the out of a scaling with one iteration

i don't like to estimates and with three dimensional automate estimates

and as you can see

the model with three iteration is clearly outperformed the lda will end in all conditions

on the sre ten female dataset

so i guess we get the conclusions we

investigated here an approach to estimate of this transformation which allows modified by i-vectors

so that they better fit the plp assumptions

so we apply this transformation we obtain i-vectors which are more or shall i and

we calculating the more than a

prepare a way to perform length compensation which is similar to p s two length

norm

but is

but you want to the particular let us that we using in the transformation

this transformation is that you using a maximum likelihood criterion and the transformation function itself

is implemented using a frame or which is very similar to that

of the neural networks

we'd other said with some constraints because we want our latest embeddable in this case

of that we can compute

we can guarantee that the log that amount of our copiers a existence of one

no this approach allows to

so as to be improve the results remaining terms of this from the sre ten

data we also experiments in the paper that

i don't report here we show that used it may also works on nist two

thousand twelve data

there is one cup that's how they said before here we using a single layer

transformation the reason is that this kind of more there's ten two

overfit white easily

so our first experiments with more than one on you know layer

well not very satisfactory as in the they were decreasing the performance

now we are managing to get interesting results by changing

in the weights the first one is changing the kind of neat in only narratives

of the details

some constraints inside the function itself which you meet these

overfitting behaviour

and on the other hand we also find some structure where we impose constraints on

the parameters of the transformation which again

use the overfitting behaviour in these allows to train it was which are more players

although up to now we obtained with the results in the sense that we managed

train transformation which behave much better

if we don't

use the scaling down but after we have in so let's get into them and

the

all

frame or the end we more or less convincing there is also what was shown

here so do still working provide us to understand why we have this strange be

everywhere we can

improve the performance and that of the transformation itself but we cannot improve

when we add the scaling term anymore

so on

i know some questions we have are fine but

however this compared to just straight gas station

okay the

thing is how we would improvement association with one hundred fifty dimensional vectors i mean

what you got size each dimension on its own

well if you both sides it's dimensional with some we tried

something with this model which if we put cosine transformation or well the function itself

can

produce that kind of organ and by the way when working with one dimensional synthetic

data disk image period when many kind of different usual spot the results already much

worse

so my case is that it would not be sufficient to independently

gaussianized ml each on its own

but allows me i'm sorry miss you tried it didn't where's

no i didn't right exactly that i tried the same order like presenting here with

transformation which applied independently of each component and my experience what i'm working on a

single a single dimensional data points

you think size very well

it does not program over fitting we then if i are more like something data

with several kind of is the only reason aspen is that the gas station kernel

right exactly does inverse function it's not approximation to it

no but it makes one like it the spectral that approximation it that's what they

get here doesn't work so i guess is that the approximate the real thing with

the commercialisation would still not work

i don't use the sensitivity

this approach does not come and activation function for d n and

the justification to is shown to them to probably too well as the evaluation is

first of all the original transformation i was you think you know is the last

one which then it can be shown that we can split into several layers but

it is different probabilities first of all it can represent the identity transformation

so if our data already portion

are kept like that

then it has some nice properties which can be shown there are some references in

our paper where you can find that

this kind of

like single-layer skin color represents a role set of this shows which are both

same in heavy tailed is q

so the reason we shall this

kind of this show the overall layer is essentially because it was already shown lately

can more than what some broadside to family of distributions

well it's all

it's they have to strange question

first the is it possible to the universal parameters and try to understand what that

the characteristics

of you training set

in term of a twisty of in the most

station effect of ten effects

you mean what do you mean i mean

look at you transformation a try to understand this so you the loose enough phone

when the v

the mismatch between o training set the inside the training set you to the presence

said phone from them

okay that's why the s c could be applied separately on different sides

if you have some way to

more the to see what is the difference in your distribution before and after transformation

you can apply the same technique often so on my

as well

transform independently two different sets and see if this represents on the differences or not

what i have here is that

pretty much

it looks like at least if we can see that evaluation and development of two

different sets with different is usually it is somehow able to

partly compensate for that

no transformations that is partly responsible for is because these as

say maybe to have your is it allows to stretch the models which are far

from what we would expect

so in what she can also one of the middle of these used

and the other hand

there you have thing which does this processing is the scaling anyway so that scaling

is very similar to length or is it is two hundred transformation that i'm applying

for this all done blindly

and then i'm learning transformational x i-vectors but i'm estimating at the same time the

transformation into skating

okay that is the part which is in my opinion really responsible for posing due

to mismatch in the basement used in that

then another thing that i cannot is

what is what would be much

better done

we were using is really more that the speaker factors and the channel factors appear

the i for example

the problem is that

already like these it takes

several hours if not the is to train the transformation function that this time it's

very fast training is quite slow and if we move into

using it cannot be lda styles all if we wanted differently the times that i

would really explode that so computational time also this time

because we would need to consider

in cases where the i-vectors are from the same speaker or not and in that

case would grow up

you would have

similarly

something similar to what we have we uncertainty propagation where you have to do that

this time of computation of everything but much worse

okay it's just

in fact because the training needs to be but i want to try to x

exploit this much as possible you parameters and method which is related to the first

one

is it possible somewhere to use this approach to

determine if one thing though i think when i-vector

is in domain or out-of-domain

so you use the two d to detect say okay

my operationally is

probably not really i mean length normalization that is not affect you start with this

but this is not and i

and the problem with this thing is that if i of a really huge mismatch

then gets amplified by transformation itself

because the data point and transforming arnold will be should be so the weight to

as well like the non linear function

is probably going to increase my mismatch instead of using it

so i'll to some point the with respect to still work better than start up

you after some point with this but it does not been worse

mismatches datasets

thanks and disappointed

okay this like the special

I-vector transformation and scaling for PLDA based speaker recognition

Speaker Recognition: i-vector approaches

Sandro Cumani, Pietro Laface