all right about multi class discriminative training of i-vector language recognition this morning are now

on the crate from john hopkins university

and like to acknowledge some interesting discussions during this work with

my current colleague daniel my previous colleagues dog in l yet

duggan pedro sorry and more recently with nico

so

as an introduction

you guys know i think we had one discussion this morning that

language id using i-vectors is state-of-the-art system

what i wanna talk about is some particular aspects of it is typically done as

a two-stage process where we use a classifier is the first thing even after we've

got the i-vectors first we build a classifier

and then we separately build a backend which does the calibration and perhaps fusion as

well so i wanna talk about two aspects that are a little different of that

first i wanna talk about what if we try to have one system that does

the discrimination

the classification and the calibration once they're using discriminative training nobody ever said we have

used to systems back to back what we do it all together

and then the secondly i wanna talk about an open set extension to what is

usually a closed set language recognition task

so in the top i will start with a description of the gaussian model in

the i-vector space it something that many be seen before but i need to talk

about some particular aspects of it in order to get into the details here

also talk about how that relates to the open set case in that case of

go into some of the bayesian stuff that we do in speaker recognition and how

that could or couldn't be relevant in language recognition what the differences are

then i will talk about the two key things here which is the discriminative training

that i'm using in particular which is based on mmi and then i'll talk about

how i do the out of set model

so as a signal processing guy i like to thank of

this as an additive gaussian noise model and signal processing this is one of the

most basic things that we see

so

it in this context what we're talking about is that the observed i-vector you see

was generated from a language so it should look like the language vector mean but

it's corrupted by an additive gaussian noise

which we typically call a channel for lack of a better word

so this model here from a pattern recognition point of view we have a unknown

lean of each of our classes

we have a channel which is gaussian looks the same for all of the classes

that means that our classifier is a shared covariance gaussian model

and each language model is described by its mean

and that shared covariance is a channel or a within class covariance

so the building language recognition system then we need a training process in the scoring

process

training means we need to learn this shared within class covariance and then for each

language we need to learn what its mean looks like

and testing again is this gaussian scoring

and i guess unlike some people and stream are not particularly uncomfortable with closed set

detection

and that gives you a sort of funny looking form bayes rule the target if

it is this class then that's just the likelihood of this class

that's easy

but the non-target means that it's one of the other classes and then you need

some implicit prior of the distribution of the other classes

which for the other eases design where you can use a flat prior

given that is not the target

so that the key question then for building language model is how do we estimate

the mean estimating the mean of a gaussian is not one of the most complicated

things and statistics but there are multiple ways to do it of course this paper

or thing to do with just take the sample mean maximum likelihood

and that's mainly what i'm gonna end up using here in this work but i

wanna imprecise there are other things you could do and in speaker recognition we do

not do that we do something more complicated

the next a more sophisticated thing is map adaptation that we all know from gmm

ubms and dogs work

but you can do that in this context as well it's very simple formula that

requires however that you have a second covariance matrix which we can probably across class

covariance which is the prior distribution of what all models could look like

where in this case the distribution of what means are drawn from

and then finally from there you can go instead of taking point estimate you can

go to a bayesian approach where you don't actually estimate the mean for each class

you estimate the posterior distribution of the mean of each class given the training data

for that class

and in that case

you can see that posterior distribution and then you could be scoring with what's called

the predictive distribution which is

a bigger gaussian the fatter gaussian it includes the within class covariance

but also has an additional term which is the uncertainty of how much you didn't

show me that particular class

one little trick that i only learned recently no we started learned a lot sooner

it's a

develop many years ago have a reference in the book but it's really handy for

all these kind of systems is

everybody knows you can buy wise one covariance matrix recognizer data such that the covariance

you can but a linear transform and datasets the covariance which by the fact you

can do it for two

and since we have to this is really helpful

and i have a formulas in the paper it's actually not very heart

and you end up with a linear transform where within classes identity which we're often

used to w c and for example compasses that

but across class is also diagonal and it's sorted in order so the most important

dimensions are first

and the beautiful global transformation

it means that you can do linear discriminant analysis you can do dimension reduction easily

in the space just by picking the most interesting mentions the person

and it's also a reminder that when you say you do lda in your system

these little careful because lda

there's a number of ways to formulate lda they all give the same subspace but

they don't give the same transformation within that subspace

because that's not part of the criterion

and that's what this doesn't give the same subspace but it's not the same linear

transformation

so i'm gonna that some experiments here i'll start with some simple ones of the

want to the discriminative training next

where using acoustic i-vectors i think maybe it was mentioned here

the main thing we're gonna a lid system is you need to do shifted delta

cepstra and you need to do vocal tract length normalisation might not do speaker

i'm gonna present lre eleven "'cause" it's the most recent lre but as the kind

of hinted i'm not gonna use pair detection "'cause" i'm not a big fan of

pair detection

somebody's the over metric c average

but you get similar performance rankings

when you pair

detection as well

and

within lre

you build your own train and that's of these are of lincoln's training data sets

that are currently

zero mean

so i mentioned just as generative gaussian models i mentioned that you can do ml

and you can do these other things i mentioned ml map of a

have a nice applied here we just three things but is actually not those three

things

so you have to pay attention but didn't describe what this is

but ml so what i'm doing here is there is no back and there is

just bayes rule applied "'cause" that's the formula that i showed you to the generative

model of gaussian

and these numbers for people who do our reads these are not very good numbers

but this is what happened straight out of the generative model

and what i'm showing is c average didn't in c average

so means the average means you had a heart detection hard-decisioning on the detection

so the ml system

is the baseline

if you do this is the bayesian system so where you make the bayesian estimation

of the being then you actually in the end don't actually have the same covariance

for every class "'cause" they had different counts and that gives different a predictive uncertainty

but in a factor are very similar because in language recognition

you have many instances per class so it almost degenerates to the same thing

the reason i didn't show map is "'cause" it's in between those two and there's

not much space in between those two so it's not a very interesting thing

this last one is kind of interesting in that

it's not right but it actually works better

from a calibration

well as you say calibration that you think that it works better in the bayes

rule

what i've done here is what we typically do in speaker recognition where you use

the right map what you pretend that there's only one cut instead of keeping the

correct count of the number of cuts

and that gives you in terms of the predicted distribution that gives you a greater

uncertainty that and a wider covariance

and so happens that actually works a little better in this case

but i

once you put a back into the in the system which is what everybody's usually

showing with some then these differences really disappears so i'm gonna use ml systems for

the rest of the discriminative training work

as i said these numbers are very good there about three times as bad as

a state-of-the-art

what usually done it is with additionally trained back end the simplest one i think

john had was the full tell the scalar multiclass thing that a coded decoded before

that's logistic regression

you can do a full of logistic regression with a matrix instead of with a

scalar you can put a gaussian backend in front

a logistic regression which is something that we tried for or you can use a

discrimate we train gaussian as the back and which is something we were doing it

lincoln for quite awhile

and these systems all work much better and pretty similar to each other

you can also build a classifier to be discriminative one of the more common things

to do is an svm one verses rest

that's not that still doesn't solve the final task but it can help

and if you do one verses rest logistic regression you also still need to back

end or you can do recently uniquer has been doing a multiclass

training of the classifier itself followed by multiclass backend

but what i wanna talk about is trying to do everything together one training of

the multiclass system that won't need its own separate back end ready to apply bayes

rule straight out

and

it's not commonly used in backends but in our field mmi is a very common

thing in a given in the gmm world in the speech recognition work

the criterion if you're not familiar with it

it is another name for the cross entropy which is the same metric that logistic

regression uses

it is a multiclass are your probabilities correct kind of a metric

and it is a this is a closed set

discriminative training of classes against each other

the update equations

are you haven't seen are kind of cool and they're kind of different it's a

it's a little bit of a we're derivation compare the gradient descent that everybody's two

it can be interpreted of like a gradient descent with kind of a magical step

size

but it's quite effective and the weights it's always done in speech recognition is

since you were doing this to a gaussian system you start with an ml version

of the gaussian and then you discriminatively updated so to speak

a that makes the converse is much easier

it gives an actual regularisation because you're starting with something that is already a reasonable

solution and in fact the simplest form of regularization is just to not let it

run very long it is also a lot cheaper

and it also gives you something you can tie back and put a penalty function

that says don't be too different from the ml solution

so regularization is it is and straightforward thing to do an mmi

and this diagonal covariance transformation that i was talking about is really helpful there here

because

then we can only discriminately update these diagonal covariances instead of full covariances

so we have fewer parameters than a full matrix logistic regression but more parameters the

lowest logistic burst

so now these are pretty much state-of-the-art numbers now remember the previous number that couldn't

were up here essentially

so this is the ml gaussian followed by an mmi gaussian backend in the score

space which is kind of our dpot way of doing things when i was at

lincoln

this for score is kind of a disappointment which is what if you take the

training set and you discrimate we trained with them in mine and they don't have

a back here

it is in fact

considerably better than the ml system really of its equivalent would which i started

but is nowhere near where we wanna be obviously

so

why not

and

one of the core of our e

that

is more data dependent i think then realistic

is the dataset actually looks different than the training set

so this is only done on the training set it's not using any dev set

at all

the most obvious thing is that the training set and that at the data set

in the test set are all thirty seconds approximately the training set is whatever sides

of conversations that happen to be so that's an obvious mismatch selected the training set

and truncated everything to be thirty seconds instead of the entire sorry

drawing away data in that way turned out to be very helpful because it's now

what better match to what the test data looks like

but not everything i wanted so then i to the thirty second training set

concatenated together with the dev set which is a thirty second set

used the entire set at once

for training the system and that in fact works as well as in and slightly

better

then the two different us as the system followed by

discriminant right by a backend

so i looked at the number of different ways

permutations of this mmi system the anybody who's done it for gmm mmi is no

you can you can

train this that or the other and various things that

and that the simplest thing to do is just to do the means only and

that is fairly effective at the moment

you can train the mean and the within class covariance which is

and of course in the clothes that system the across class covariance is not coming

into play it's only the within class covariance which is having five

one thing that i found kind of interesting used to instead of training the entire

covariance matrix to train the scale factor which scales the covariance that's to a little

bit simpler system with fewer parameters

and you can also play with the sequential system

and in particular i found interesting to do the scale factor first and then the

means just in terms of the it it's really

that will given the end the same solution but

when you only do a limited number of iterations to starting point in the sequence

does affect you get

so

again these same sorts a lot this is what happens if you do so this

is now purely no back-end and the discriminately train classifier itself if you do need

only

your partisan system is not terribly good but you're means the average is pretty close

so that is an indication

what is calibration mean in a multiclass detection

task is kind of controversy all but

one thing that i think i can say comfortably is whenever you see this happen

it means that you're not calibrated

the fact that they might not doesn't necessarily mean that you are calibrated "'cause" bayes

rule is more complicated than that but

but this means that it is clearly not calibrated

so once we do something to the variance this is doing the mean and the

entire variance this is doing the mean and the scale factor is very except same

time

and this is due in a two stage process or of the scale factor of

the various followed by the mean

all of those

work much better so in order to get calibration you need to actually adjust the

covariance matrix which kinda makes sense you need to scale factor or something

and

once you fine tune on the numbers as we typically do when we're actually working

on these kind of task

been actually see that the two stage process it is the baddest the best one

and it is better than error

our a two-step process that we used to have before of separate system followed by

back in

okay so that's the discriminative training part the other thing i want to talk about

is the out of set problem that has mentioned in a question earlier

because oftentimes were interested in something where there's it could be another language is not

one of the closer

the nice thing about our two covariance mathematics that we've been using for speaker recognition

is it has in front of you a model for what out of set is

supposed to the

already mentioned that essentially that if you have a gaussian

distribution of what all models look like then an out of set languages are randomly

drawn language from that who

and that's represented by the gaussian distribution

then at test time

you have again and even bigger gaussian because the uncertainty is both the channel plus

which language was

so now you have

the out of set is also a gaussian bided have the bigger covariance then all

the others have a share variance which is smaller so it you no longer have

this a linear system

when you make a comparison

this is the most general formula when you have

and open-set problem which is both out of set and closed set

this is how you would combine them this is what i had before the sort

of bayes rule a quick competition of all the other closer classes this is the

new distribution the out of set distribution

if you wanna pure out of set problem which is what i'm gonna talk about

here you just take it needs to be out of set is one but in

fact you could make a mix distribution well

okay so i wanna talk about the out of set

just a touch on is what i have now

if i where to do the bayesian numerator for each class that i mentioned before

and then this denominator

then i have what would you like to call

bayesian speaker comparison

jones narrative paper about four

it is the same answer as p lda or the two covariance model

and i'd like to

emphasise that

they're set up differently so the numerator and denominator are different in these two mathematics

but the ratio is the same thing "'cause" it's a models and is the same

correct answer

i think you know formalism like i'm talking about here i find it much easier

to understand it in this context

the philosophy

and

daniel i've spent a lot of time on this can see that only a few

of the a guy from this perspective point of view

but in this terminology we say that we have a model for each class and

the covariances are hyper parameters in this terminology you guys like to say that there

is no model

and the parameters of the system are the covariance matrices again is the same

system it's a different same answer to different perspective but when we're talking about close

that and ml models

i know how to say that in this context and i don't know so well

how to say that

in the p lda one

so discriminative training of the out that i described this is the out of set

model but as i've said now i have this mmi hammer in my toolbox

and this is just one more covariance that i can train so i've got an

across class mean and covariance

the ml out of set system just takes that all of these where the sample

covariance matrices so this p

but i can

do an mmi updated this out of set classes well the simplest way for me

to do this is to take the

the closed set system there are presented and then separately

frees the closed set models and then separately update the out of set model given

the closed set models

i can do that with the by scoring would one verses rest instead of scoring

with bayes rule and doing a round robin on the same training set so

the advantage of this is i can actually build a system without ever actually having

any out-of-class data probably do better if i really did have out-of-class data but in

this case i don't and i can build a perfectly legitimate system

so

the performance of this system whatever done here

is scored this lre even though there is no out of set data scoring without

bayes rule where the system is then allowed to know what the other classes were

and so that the simulation one open set

scoring function

the ml version of this

the actual c average is actually a the chart it is that's kind of bad

numbers that i started with four

the mmi training of the closed set system

and then the

mel version of the across class covariance in fact is already a lot better so

whatever's happening in the closed set discriminative training is actually helping the

open set scoring as well but explicitly retraining

the out of set covariance matrix

with the same mechanism mel scale factor then the me

in fact if the pretty reasonably

and the system which is not obviously on calibrated

and it's pretty reasonable performance

the closed set scoring performance is still down here but this is gone a lot

better and it's perfectly feasible

so

the two contributions here where the single system concept of we don't have to do

system design and then back end we can discriminatively trained system to already be calibrated

and we can model out of set using the same mathematics that we have in

speaker recognition

but a simpler version "'cause" we don't need to be used bayesian in this case

and i think can also be discriminatively updated so that we can that be reasonably

calibrated for the open set

task as well

so thanks island

the very nice to see that you unified those two parts of the system

i which we could do that than in speaker recognition

so my question is your

your maximum likelihood

across class covariance so you've got twenty four languages to work within a six hundred

dimensional

i-vectors so

how did you estimated or a sign that

parameter

it is the sample covariance so everything here was done with the dimension reduction in

the front

to twenty three dimensions

i'm sorry that's why i illustration

already specified that there would be twenty three dimensions

and anything that has a prior is limited to twenty three dimensions

okay in this i i'd since i just took the sample covariance matrix at regularized

it somehow you can make it

appear to be bigger to be full size

okay so

well those formulas you showed with the covariances that happens in twenty three dimensional space

yes

so in this case you doing lda and then i'm not my tanks she's got

at the same as doing question back and another calibration

well as lda and regression back at you that this evaluation bloodless

this was your computing the sample covariance as ones in the full space

but

the across class is only rank twenty three

so you take that the six hundred dimensional within class and map-adapted twenty three but

yes

so if you do lda and regression but gaussian backend is the same subspace

use lda

yes if you product of lda

and twenty three or you get the gore since any get twenty four scores

is almost the same thing so you're still doing to state to steps

it's still just two steps

in my view are the ml estimation which in this case forces you to be

twenty three dimensional

and then

the update of those equations

but lda english and

back and there is similarity very close

well like the way we would have done a system before would be lda and

then gaussian in that space and then

mmi training in the score space

the likelihood ratios of the first thing this is mmi training in the i-vector space

directly

but

these are not very complicated mathematics of things are pretty closely related yes

so when you did the joint diagonalization

in there and then you

work with diagonal covariance matrices but then you're also updating the covariance matrices training

is that diagonalisation still valid then

i mean you do the static one projection was that mean then when you forced

to be it back it's sort of like saying i mean

the entire thing can be mapped back to

by undoing the diagonalisation into a full covariance so you in some sense you are

still updating a full covariance with your only updating in a constraint what

so the matrix is still an apple size but the number of parameters that you

discriminatively updated is not the full set

so if i guess i remember correctly so you're doing actually closed set

twenty three or twenty four language that is that correct twenty four language right so

is it possible i mean i don't want change your problem but if you were

to look at a subset so you're gonna pick twelve each and take the others

are is completely open set data so you to screw training it only on a

portion of we don't have actions we have said data

you have some sense of how strong your solution would be

if you didn't have access to those similar sounds that languages that you want to

reject

i think it's an interesting thought that

you could more extensively test this out of set hypothesis by doing a whole one

out or something and round robin in that and i think that isn't it interesting

idea but i haven't

have done