Speech Transcript - Generative pairwise models for speaker recognition

okay so might don't women both generative better ways model for speaker recognition

i was some of you may know had been working quite not dealing with what

some sucks the sling discriminative models

for i-vector our classification and in particular i've been working

mostly with a discriminative models able to directly classify but also i-vectors that is i-vector

trials directly as belonging to same speaker or different speaker classes

this discriminative models will first introduced as a way to discriminatively trained p lda parameters

and then have all

when we get then we get some explanations some interpretation of this model sells discriminative

more training all model parameters for a second order taylor expansion of a log-likelihood ratio

so i've been working mostly in trials place here the idea was to go back

from discriminative to denote the but remaining target space so the question was

whether would it be possible to better to train a generative model and trial space

and how well would it behave

does out that it's very easy to do it

in practice and it works pretty my well

i would say was more or less like all the other states of the art

models

so in this talk a we show you how

we define these model which is a very easy model which

employs two gaussian distributions to model trials and then why we show the relationship of

this model with p lda and

the discriminately plp am pair-wise svm approach

and then i will also show how this model can be very easily extended to

handle more

complicated distributions in particular i will work with

heavy tailed distributions follow in the work from but the canny about a bit lp

lda

so to eigenspace

so actually to the final tire we take two i-vectors we stick then two k

we stuck them together and we get our definition of trial

here i have a couple of pictures we show what would happen if we were

working in with one dimensional i-vectors so on the

left here i have i've a one dimensional i-vectors which of the black dots and

on the right then taking all cross pairs of i-vectors

we can see you have that there is a nowhere the final region where

i-vectors belonging to the same speaker are

and

and which is quite well separated from the region where the i-vectors coming from the

where per person coming from different regions are

so overweight the discriminative training we try to discriminatively trained so fail surveys to separate

is the region

and now i'm going to try to build a generative model to describe

these two sets of points

so the easiest generative model we can think all okay we have two class

problem so it's a binary problem we can assume that

the trials are what buttons and that they can be modeled by question distributions

so we would have a gaussian distribution describing

the

trials which belongs to the same speaker class

and the flyers which belong to the different speaker class

each of them would have their its own parameters

and for symmetries on i with a will assume that the mean of the two

distributions is the same

reasoning about

so the symmetry of the target that is

if we take a pair of i-vectors we can stick them in two ways we

can take enrollment and test force or vice versa but we don't want to give

any

any particular altogether vectors so we want

generative models which treats

both version of the trial in the same way

this imposes some constraints on a war one ances matrices which are

sorry described here that's actually

we have this to make this is which this would be the same as well

as these two and the same for the other distribution

in practice when working with the a all pairs from a single i-vector dataset we

don't even need to impose this selection because it that arises naturally during the training

so how can we trained these weights

use

just the simple thing we can think of we did it by maximum likelihood then

we did not assuming that i-vector priors are independent

of course i-vector trials are not independent because they are all bands that we can

built from a single i-vector set

however in practice these does not really affect our results even though the assumption is

very not curate

so this is a representation of what would happen

if we were working one dimensional space so i'll assuming that the mean is zero

for the two distribution which is

essentially what we with the recovery if we center i-vectors we would end up with

our look like a racial which is just the racial between two gaussian distributions which

is up with a tick for mean the i-vectors per in the i-vector trial space

you can see two plots of two different no syntactic the one they may show

synthetic i-vectors whether you can see the

a level some the log-likelihood ratio a as a function of the trial

and you cannot is that essentially we have separating with quadratic surfaces the same speaker

area which is the

this diagonal from the rest of the

of the

points

this involves nice we show you the results in a moment but force the one

to show you the relationship between this model and

the other state-of-the-art approach is like be lda in the discriminative be lda

so this is the classical p lda approach the simplified version where we have full

around

channel factors merge will together with the residual noise

and we have a subspace for speaker for the speaker space

so if we think this model and try to jointly modeled the distribution of apparel

i-vectors were we

can consider separately the case the when the two i-vectors of from the same speaker

then

when they are from different speakers in the first

case we would have that the speaker variable

for the

latent variable for the speaker would be shared so we would have only one speaker

and we would that this expression for the jaw

for the trial

while in the case of different speaker trial we would have one different speaker latent

variable for each of the two i-vectors

now with the standard lda all these but it wasn't question distribute this so we

can integrates over the speaker

latent variables and if we integrate

it would end up with a distribution for same-speaker pairs and different speaker pairs which

is like going ocean

and which has this form so again we see that it does not share mean

and to go one else matters is which

looks very similar which have that very similar structure to what i was showing before

so i in practice p lda here is a what is telling these it's telling

us that the p lda is estimating

and model one which is coherent with our assumption we want that want to go

shown model assumptions

and the spatially difference from our model just in the

objective function that is optimized here we are optimising for i-vector like to the while

in our two gaussian model real optimising for trial likelihood

so again for the where and are we

goal

when we compute look like a racial we end up with very similar separation surface

is allows our two gaussian model in one this one dimensional space i-vector space

and we will see that

this also reflects in the real i-vector space that since the to model performs pretty

much the same

so going to the

relationship with the discriminative approach

this is the scoring function we were used for the pairwise svm

so we have assumed the this was the scoring function which

corresponds which is a scoring function we used to compute the loss of the of

the svm from

and it's going function is actually formally equivalent to the

score look like a racial function we've seen for our to go some model

and of course this is also equivalent to the plp a scoring function as it

was forced to the right from

that approach

horace all we can think about the svm as a way to discriminative train these

matrix which

which if we think about it in the two gaussian model is nothing as than

the difference between the procedure might this is of the two distribution

so i can we have a mother which is also the

same kind of separation of star feces

and the gain the only difference is the objective function we are optimising

so to see some results about this first part

okay desire was done on nist two thousand on the ten telephone condition

and i'm comparing essentially p lda with this

to go some model

so the first line a first one p lda without dimensionality reduction which is also

known as two covariance model and

spatially here it means that i'm taking full around

speaker space

and both case design doing length normalization and is the two lines of the results

of the plp a wood flooring speaker space and the two gaussian model trained by

maximum likelihood in the i-vector space in the trial space

and as you can see they perform pretty much the same

while a well of course to go two covariance model is for us to train

this logo some model is even faster than the test they the same

the same requirement computational requirements

the problem is when we moved to r p lda with

an overall speaker with

and low rank speaker subspace in this case values one on the twenty dimensional speaker

subspace what i-vector were four hundred dimensional

we cannot directly apply this

the dimensionality reduction onto the two gaussian model so we

and we

replaced it by are dimensionality reduction down by lda projection

and that's good enough so here we have p lda with the radius of speaker

subspace and two covariance model well the

the dimensionality reduction is done by lda they perform

i would say the same

and then in these

reduced one on the domain and twenty dimensional i-vector space we trained our

go show model on trials and it performs again pretty much the same as the

p lda model

for compare is on these are the results we had with the discriminative model

the difference between all these models the discriminative model didn't required

length normalization

so this means that we can

do are generative model in trial space it's very easy to do actually and it

works very well so let's see i if we can

make things a little more complicated than how do i becomes training and testing so

to complicate things we

took

we did something similar to what about the can indeed with this a bit lp

lda we said okay let's replace

i one gaussian distributions with

t distribution and see what happens

so it does all that training can still be done or using an em algorithm

although it's not that fast becomes more or less the same computational expensive as the

discriminative approach

but the good thing is that in test we can perform close

sorry

we can use closed-form integration

and sour look like a racial becomes simply the racial between two students this is

distributions

so a testing time this thing is well as fast as

be lda or the to go some more the well you i've shown before

how the soul

i said all these yes okay as with a with lp lda we don't need

length normalization if we use these heavy tailed distributions

of course the separation surfaces are slightly more complicated complex because we don't ever anymore

quadratic separation of sources is but

we have this kind of

scenes

and for the results

what happens is that we managed to get more or less the same results of

the go show model without bits

for length normalization which is

i would say aligned with

the finding about

p lda

or again this model is

and what's different between the with p lda is that is model is more expensive

in training button testing is us fossils all the others

to summarise what we get here

we get that we can use a very simple question classifier to in the target

space which can be very easily trained then

despite the

does we use incorrectly

make incorrect assumption about via independence is still work very well

and it turns out that is more that is quite easy to extend to handle

more complicated distributions

so while with p lda for example just about to the heavy tailed

the distribution it becomes

very difficult to train the model and test the model we can is the use

for example for the students these solutions without

almost any hassle

saw from here we hope to be able to find some better way to model

i a trial distribution on the in a trial space which will still allow us

to have

fast solution for scoring without incurring in

too big

problems for training

and that was like that's

the first question

the reason freedom in that i think that case

yes

a i don't remember exactly but it was or something like five six

i maybe in something like that

i remember the are we had that all you in the war should but they

had a bug then

when it was work

then a fixed

speech yes telephone speech rather than microphone

will just telephone i didn't trial microphone well i tried something on microphone rates

what can slightly worse than p lda but it's not that different anyway i didn't

write the retail version yet

i think it might run into problems without length normalization

that was my expert

i didn't really tried to maybe ten one on a the microphone data

i have a common

which may be standard and had to

i source and are used em algorithm to estimate that the heavy tailed parameters

and

for example in the paper that are presented on monday i was using at t

distribution in score space

and within em algorithm to

they help me to estimate the parameters and i found

that

didn't i would generate synthetic data where i knew what that degrees-of-freedom would be

and then i tried to recover that

using an em algorithm and that is very frustrating i just

good navigate recover

the same

degrees-of-freedom and then i switched from using an em algorithm to using that the wreck

optimisation i think it is b s g s

of all that

of the likelihood

and that is much better to recovering the that degrees-of-freedom

okay for a while so that for this synthetic models here i was generating then

with the retail distribution and that was getting more or less the same estimates

for this but i'd similar problem when i was assigned to

do some things you know to what you did for calibration with

like non gaussian distribution but skewed distribution and those kind of things and they realise

that

em there was not that would i was doing it numerically and was working but

so maybe i was lucky with the two distributions

i think to combine that that's really heavy-tail then that let's but if it's not

that doesn't so probably the degrees-of-freedom is allow you can recover it but if it's

like be around ten or twenty then you can't recovered anymore

one question

the speaker again

Generative pairwise models for speaker recognition

Speaker Modeling II

Sandro Cumani and Pietro Laface