this is very short about the topic

we are apply we are working with a probabilistic linear discriminant analysis and it has

previously been proved by discriminative training

previous studies now use a loss functions that essential to focus on a very broad

range of applications so in this work we are trying to

train the p lda in a way that it becomes

more suitable for

narrow range of applications

and we observe a small improvement in the minimum detection cost by doing so

so as a background

s when we use the speaker verification system we would like to minimize the expected


from our decision

and that this is a very much reflected in the detection cost of the of

then use

so we have at the cost for false rejection and false alarm and also a

prior which we can say together constitutes the operating point of our system

and which of course depend on the application

so the targets here is to yield a application specific system that is optimal for

one or several

operating point rather than one wall

it is more specifics of the same

and there so it can only idea some already been explored force score calibration

in interspeech paper mention that

however well so score calibration with score calibration we can reduce the gap between actual

detection cost and minimum detection cost

but we cannot be used the minimum detection cost


by applying these channel five use some earlier stage of the speaker verification system we

could how to reduce also the minimum detection cost

so we will apply to

discriminative p lda training

we use this method that has been previously been developed for training well for discriminating

ple training

and the only kind of thing we need to do here is that the this

well the log-likelihood ratio score of the period more data is

you've done by this kind of for right here

and we can apply some discussion discriminative training criteria to these



well only you should be in of the i-vectors

which i same out that the still we basically take all possible pairs of my

make those in the training database and minimize sound loss function l possible with some


and also be applied so on

regularization term

but this

have been


when we need to consider a

one which operating point we should data

talk about how we should target a system to be

suitable for a certain operating points we need to consider

the part we have here and gmm weights b that which is different for well

depends on the trial

in essence it will be different for target and non-target trials

and we also have a loss function and to say very simple the

that will depend which operating point they're targeting whereas the choice of loss function chime

decide how much emphasis would put on surrounding operating points


well just a bit short about the forest be that

well as probably several you know we can we'll in some applications where approach

these three parameters probability of target trial two costs we can rewrite it

where we can

we have an equivalent cost which will have a

a loss

in the training or evaluation that is proportional to this

first application so

we can as well consider this


such kind of application is that and to minimize

well as and we will as such we make sure that the

we have a

our system will be able to also

for that are breaking points are looking at

so essentially we need to be scale

every trial so that we get the retard the

percentage of target trials in the


evaluation database we consider all data

because we consider can compare two

the training database

so regarding the choice of a loss function

previous studies you for discrimate bp lda training use a logistic regression loss or the

svm hinge loss

and the logistic regression scores which is essentially the same as the cmllr loss to

justify the eer application independent the


evaluation metrics so you could be suitable as a loss function if we want to

target a very broad range of applications

well what we want consider here is to

c by targeting a more narrow range of application of up of operating points if

we can give better performance for such operating points


well the most

i think that would call course one exactly to one detection cost would be that

zero one loss

and that we will also consider one which is a little bit broad one loss

function which is a little bit broader than that

zero one loss button bit more marilyn logistic regression loss which is the be a


and well


explain why about this the case i can report that the speech paper which is

very interesting


i'm showing various the picture of how these different things slopes

and the blue one would be the logistic regression loss which is there

complex but a

and say comes

because on that this could also be sensitive to outliers because

for some

maybe a new show a trial so this would be by the way they look

for a target trial metric for example

for so one also we have

some cost here and then of the past the threshold which is here you that

no cost

what basically this one can be very large for some

when you change point in our database so

our system may be very much adjusted to one of my

degree one

targets the real loss and we are the zero one loss here as i said

with a couple of approximations that we will later use

we use this sigmoid approximation

in order to do optimization under which includes the parameter i'll show that makes it

more and more similar to the zero one loss when you increase it and we

have that for


well one ten hundred


there are a couple of problems though the real zero one loss is not differentiable

so that slightly use this one function


we also or a real as in the same one loss or non-convex so we

do one approach here where we can of gradually increase the non complexity and

for the sigmoid loss it means we start from the logistic regression model

we also tried from the ml model but it's better to start from the logistic

regression well

and then increase of five gradually on there is another papers to doing that's for

other applications

we do something similar for the radio lost what we start from the logistic regression

model and then train the sycamore loss with our for it was the one loss



regarding the experiments we didn't we use the main telephone trials and

we use

of a couple of different databases and we used as development set there is this

research six which is which using one

the regular session but i

and then use this series zero eight and that's it is it intended for testing

and this one cannot standard datasets for p lda training


an engineer the number of i-vectors and speakers with or without including is there is

a sixteen goes this we use the fast development set but sometimes we included in

the training set off to react decided on the parameters to get the little bit

better performance

and we conducted the for experiments

okay i should also say that we target is the operating point mentioning here which

has been standard in which the operating point in several nist evaluations


for you need for experiments one is just considering a couple of different normalisation regularization

techniques because we limit on sure about what is the best although it's not really

related to the topic of this paper

the second experiment we just compare the different the loss functions that use of are


and then the underlies also the effect of calibration finally we address were tried to

investigate little bit


the choice of be a according to the formal idea before is actually suitable or


so well for regular stations there are two options that are popular i guess and

one in this kind of

topics so we can do regular size and regularization to see her which would be

most remote and warranted or station towards ml and icexml i mean normal generating trained

and because logistic regression is also in that sense

maximum likelihood approach

and to compare also weddings within class covariance for just whitening with

full covariance total covariance

and maybe we found that in terms of mindcf and eer

using just as

covariance the phone call total covariance and regularization towards a likelihood you lead to better

performance we use that

the remaining experiments

so comparing loss functions

well first we should say that there is given to training schemes that their actual

detection cost than the standard maximum likelihood training but that is kind of expect that


they at the same time do calibration

however not great calibration

which we will discuss

make the wrong


but the for calibration it's is that the matrix they're model is very competitive

but we can see some improvement by

these the application specific loss function compared to logistic regression minimum detection cost any all

three and

for sre silly there is no such that


maximum likelihood standard maximum likelihood model and a bit worse calibration but

since we can

take start by doing calibration

we will

we in order to a fair comparison we will

also consider that here

so what to do it we need to use some training someone some portion of

the training data that's really tried i hear you see that fifty cent defined ninety

or ninety five percent of that

training data for p lda training and the rest for calibration

and we use to see an alarm loss here which is essentially the same as

logistic regression

and used operating point that we are targeting

and in these experiments we assume zero six is not include

so the result looks like this and the first thing to say is that the

applying the calibration model

will be better results than discriminative training without calibration

the second thing is that the

distributed training here also

benefited from calibration which must be explained by

the fact that they're using a regularization


and are also maybe overall can say that seventy five percent of the different training

and the rest for

using the rest for calibration was the optimal

and the also

we notice that the logistic regression

performs quite bad for the very small amount of training data using is the fifty


and the whereas the real also zero one loss

we perform better

but the this is probably course

and those two loss functions


if i can go back to this

figure here

for example the zero one loss

we do not make so much use of a in the data but that's a

score like this

but as the logistic regression would and

that means that the regression loss will use more of the data

so what happened here

i think stuff that

since we do regularization towards the ml model

simply also one most remotes leads

changed so much change the model so much in the state of the model when

we used a really great

so also it is

choice of a


optimal that sounds assuming that the one that

trials in the database all channels which is still not the case because we have

made up the training data by

carrying all the i-vectors

and also of course it also assumes that the

training database and evaluation based have a better kind of similar properties which probably is

also not really case

so the optimal beat the could be different from

this according to the form

so i and i that looks at a bit strange but basically we want to

check a couple of different to not use for that and which means that the

effective prior p n

so we just trying different we make some kind of parameters section which make sure

that the

we use this parameter gamma which one mace zero point five

we used a standard that the


effective prior according to the for all and when i am is equal to one

we will use

and effective prior one which means people way to the target trials

and when it's zero we will use

this one this section we make sure that we use weights of the non-target trials

we used real also in this experiment

and also do not include

it's a zero six

so this the figures a little bit interesting i think

and it seems first like it's much more important to for the actual detection cost

simple minimum detection cost but remember also here we didn't the by calibration of the



it is clear that the best choice is not

that one we can see that was calculated to formalize of the thing

that's very interesting and media

area that should be more explored

and i should probably have said also that

it's very actually goes up a little bit that which is very noticeable and i'm

not sure why and

because that is actually that



the prior

effective fire okay the recording for which we used in other experiments

but anyway we can see that pattern really regularization towards the ml model

this is relaxation ones


very interesting

thing is that it seems that from name detection cost that actually goes down a

little bit here

which means that we

which is the cases where we used for just for training data or target right

you one way to target trials or we give a way to non-target trials

and it but i think this the results when the so this is that


this should really not work but

because we do regularization towards the ml model it i just a very close to

the and the model for such kind of a system that was actually be something


so we can also can not need any you wanna be included

the results for regular station towards here where we can see that this is not

the case

so in conclusions

we can see that sometimes can improve the performance but quite often there is not

so much different

and we tried different optimization strategy is

and the

what that should say about that it's that starting from the ml a from the

logistic regression model is important to the starting point is important but this kind of

gradually increasing the complexity of course

the non-convex of was not so

effective actually but they didn't discuss the details about it

so the optimization is something to consider and also

since it seems to be the has really the weight be that's

some kind of importance what is not simply area

well what we should do we probably shouldn't consider a better estimate its of it

may be something that depends on other factors and just whether it's of target and

non-target trials




the discriminative training

criterias for the two is connected it is trained models

needed calibration i think we could be interesting to

mate the regularization towards as well

parameter vector where we have built in the regularization parameter so we do calibration of

the ml model and then kind of

put in the parameters from the regularization into that

a from the candidate calibration into the regular stations we actually do

regularization towards something that's calibrated

okay so

or something

might be opposed to

what optimiser could you use a used to be yes algorithm

a little attention

okay so

you mentioned there was some issues with a non complexity of your objective so

hidden in the work that are just like

note that this morning

i also had issues with non-convex of t


we have to its was a problem

of course basically

the gift use forms a rough approximation to the inverse of the haitian right and

if you do we have two years probably that here soon matrix is going to

be positive definite

what's so

we have two s can see

the non-convex of t right and that it can do anything about

so it's probably i think you that's a good point and we should consider some

better optimization algorithm

or we can come from that we have reduced the value of the objective function

quite the significantly but maybe

we are

we could have done much better in that aspect well using something more sure i

think so

in my case there was simple solutions are could calculate the full history and a

inferred that without problems because i very few parameters and i could do an eigenvalue

analysis and then go down the steepest negative eigen vectors that can be out of

the non-convex regions right the for you

very high also

you could perhaps to some other things but it's more this week why i'm afraid

but thank you

which does


okay well basically because we are not the doing calibration we doing discriminative training of

the ple