0:00:15this is very short about the topic
0:00:18we are apply we are working with a probabilistic linear discriminant analysis and it has
0:00:25previously been proved by discriminative training
0:00:30previous studies now use a loss functions that essential to focus on a very broad
0:00:34range of applications so in this work we are trying to
0:00:39train the p lda in a way that it becomes
0:00:42more suitable for
0:00:44narrow range of applications
0:00:47and we observe a small improvement in the minimum detection cost by doing so
0:00:55so as a background
0:00:57s when we use the speaker verification system we would like to minimize the expected
0:01:04from our decision
0:01:06and that this is a very much reflected in the detection cost of the of
0:01:10then use
0:01:12so we have at the cost for false rejection and false alarm and also a
0:01:16prior which we can say together constitutes the operating point of our system
0:01:22and which of course depend on the application
0:01:25so the targets here is to yield a application specific system that is optimal for
0:01:31one or several
0:01:34operating point rather than one wall
0:01:37it is more specifics of the same
0:01:41and there so it can only idea some already been explored force score calibration
0:01:48in interspeech paper mention that
0:01:51however well so score calibration with score calibration we can reduce the gap between actual
0:01:57detection cost and minimum detection cost
0:02:00but we cannot be used the minimum detection cost
0:02:05by applying these channel five use some earlier stage of the speaker verification system we
0:02:10could how to reduce also the minimum detection cost
0:02:14so we will apply to
0:02:17discriminative p lda training
0:02:25we use this method that has been previously been developed for training well for discriminating
0:02:31ple training
0:02:33and the only kind of thing we need to do here is that the this
0:02:37well the log-likelihood ratio score of the period more data is
0:02:44you've done by this kind of for right here
0:02:47and we can apply some discussion discriminative training criteria to these
0:02:55well only you should be in of the i-vectors
0:03:00which i same out that the still we basically take all possible pairs of my
0:03:05make those in the training database and minimize sound loss function l possible with some
0:03:12and also be applied so on
0:03:14regularization term
0:03:23but this
0:03:25have been
0:03:28when we need to consider a
0:03:30one which operating point we should data
0:03:34talk about how we should target a system to be
0:03:38suitable for a certain operating points we need to consider
0:03:42the part we have here and gmm weights b that which is different for well
0:03:46depends on the trial
0:03:48in essence it will be different for target and non-target trials
0:03:52and we also have a loss function and to say very simple the
0:03:57that will depend which operating point they're targeting whereas the choice of loss function chime
0:04:04decide how much emphasis would put on surrounding operating points
0:04:18well just a bit short about the forest be that
0:04:24well as probably several you know we can we'll in some applications where approach
0:04:29these three parameters probability of target trial two costs we can rewrite it
0:04:34where we can
0:04:36we have an equivalent cost which will have a
0:04:40a loss
0:04:41in the training or evaluation that is proportional to this
0:04:44first application so
0:04:46we can as well consider this
0:04:51such kind of application is that and to minimize
0:04:56well as and we will as such we make sure that the
0:05:00we have a
0:05:02our system will be able to also
0:05:05for that are breaking points are looking at
0:05:08so essentially we need to be scale
0:05:16every trial so that we get the retard the
0:05:21percentage of target trials in the
0:05:25evaluation database we consider all data
0:05:29because we consider can compare two
0:05:33the training database
0:05:38so regarding the choice of a loss function
0:05:42previous studies you for discrimate bp lda training use a logistic regression loss or the
0:05:47svm hinge loss
0:05:49and the logistic regression scores which is essentially the same as the cmllr loss to
0:05:54justify the eer application independent the
0:05:59evaluation metrics so you could be suitable as a loss function if we want to
0:06:03target a very broad range of applications
0:06:07well what we want consider here is to
0:06:10c by targeting a more narrow range of application of up of operating points if
0:06:16we can give better performance for such operating points
0:06:22well the most
0:06:24i think that would call course one exactly to one detection cost would be that
0:06:29zero one loss
0:06:31and that we will also consider one which is a little bit broad one loss
0:06:35function which is a little bit broader than that
0:06:38zero one loss button bit more marilyn logistic regression loss which is the be a
0:06:44and well
0:06:49explain why about this the case i can report that the speech paper which is
0:06:53very interesting
0:07:01i'm showing various the picture of how these different things slopes
0:07:06and the blue one would be the logistic regression loss which is there
0:07:11complex but a
0:07:14and say comes
0:07:16because on that this could also be sensitive to outliers because
0:07:21for some
0:07:22maybe a new show a trial so this would be by the way they look
0:07:26for a target trial metric for example
0:07:29for so one also we have
0:07:33some cost here and then of the past the threshold which is here you that
0:07:37no cost
0:07:38what basically this one can be very large for some
0:07:43when you change point in our database so
0:07:51our system may be very much adjusted to one of my
0:07:56degree one
0:07:57targets the real loss and we are the zero one loss here as i said
0:08:02with a couple of approximations that we will later use
0:08:07we use this sigmoid approximation
0:08:10in order to do optimization under which includes the parameter i'll show that makes it
0:08:16more and more similar to the zero one loss when you increase it and we
0:08:20have that for
0:08:24well one ten hundred
0:08:39there are a couple of problems though the real zero one loss is not differentiable
0:08:43so that slightly use this one function
0:08:50we also or a real as in the same one loss or non-convex so we
0:08:55do one approach here where we can of gradually increase the non complexity and
0:09:02for the sigmoid loss it means we start from the logistic regression model
0:09:06we also tried from the ml model but it's better to start from the logistic
0:09:10regression well
0:09:12and then increase of five gradually on there is another papers to doing that's for
0:09:17other applications
0:09:20we do something similar for the radio lost what we start from the logistic regression
0:09:25model and then train the sycamore loss with our for it was the one loss
0:09:39regarding the experiments we didn't we use the main telephone trials and
0:09:46we use
0:09:47of a couple of different databases and we used as development set there is this
0:09:51research six which is which using one
0:09:54the regular session but i
0:09:56and then use this series zero eight and that's it is it intended for testing
0:10:02and this one cannot standard datasets for p lda training
0:10:09an engineer the number of i-vectors and speakers with or without including is there is
0:10:14a sixteen goes this we use the fast development set but sometimes we included in
0:10:19the training set off to react decided on the parameters to get the little bit
0:10:23better performance
0:10:26and we conducted the for experiments
0:10:30okay i should also say that we target is the operating point mentioning here which
0:10:35has been standard in which the operating point in several nist evaluations
0:10:42for you need for experiments one is just considering a couple of different normalisation regularization
0:10:47techniques because we limit on sure about what is the best although it's not really
0:10:52related to the topic of this paper
0:10:56the second experiment we just compare the different the loss functions that use of are
0:11:01and then the underlies also the effect of calibration finally we address were tried to
0:11:06investigate little bit
0:11:09the choice of be a according to the formal idea before is actually suitable or
0:11:18so well for regular stations there are two options that are popular i guess and
0:11:24one in this kind of
0:11:27topics so we can do regular size and regularization to see her which would be
0:11:32most remote and warranted or station towards ml and icexml i mean normal generating trained
0:11:40and because logistic regression is also in that sense
0:11:44maximum likelihood approach
0:11:48and to compare also weddings within class covariance for just whitening with
0:11:54full covariance total covariance
0:11:58and maybe we found that in terms of mindcf and eer
0:12:04using just as
0:12:06covariance the phone call total covariance and regularization towards a likelihood you lead to better
0:12:12performance we use that
0:12:14the remaining experiments
0:12:19so comparing loss functions
0:12:31well first we should say that there is given to training schemes that their actual
0:12:36detection cost than the standard maximum likelihood training but that is kind of expect that
0:12:43they at the same time do calibration
0:12:47however not great calibration
0:12:50which we will discuss
0:12:52make the wrong
0:12:56but the for calibration it's is that the matrix they're model is very competitive
0:13:04but we can see some improvement by
0:13:08these the application specific loss function compared to logistic regression minimum detection cost any all
0:13:15three and
0:13:17for sre silly there is no such that
0:13:29maximum likelihood standard maximum likelihood model and a bit worse calibration but
0:13:34since we can
0:13:36take start by doing calibration
0:13:38we will
0:13:40we in order to a fair comparison we will
0:13:43also consider that here
0:13:45so what to do it we need to use some training someone some portion of
0:13:50the training data that's really tried i hear you see that fifty cent defined ninety
0:13:54or ninety five percent of that
0:13:57training data for p lda training and the rest for calibration
0:14:01and we use to see an alarm loss here which is essentially the same as
0:14:05logistic regression
0:14:07and used operating point that we are targeting
0:14:11and in these experiments we assume zero six is not include
0:14:15so the result looks like this and the first thing to say is that the
0:14:19applying the calibration model
0:14:23will be better results than discriminative training without calibration
0:14:29the second thing is that the
0:14:32distributed training here also
0:14:35benefited from calibration which must be explained by
0:14:39the fact that they're using a regularization
0:14:52and are also maybe overall can say that seventy five percent of the different training
0:14:58and the rest for
0:15:00using the rest for calibration was the optimal
0:15:09and the also
0:15:13we notice that the logistic regression
0:15:16performs quite bad for the very small amount of training data using is the fifty
0:15:22and the whereas the real also zero one loss
0:15:26we perform better
0:15:28but the this is probably course
0:15:33and those two loss functions
0:15:38if i can go back to this
0:15:42figure here
0:15:43for example the zero one loss
0:15:46we do not make so much use of a in the data but that's a
0:15:50score like this
0:15:51but as the logistic regression would and
0:15:56that means that the regression loss will use more of the data
0:16:00so what happened here
0:16:05i think stuff that
0:16:08since we do regularization towards the ml model
0:16:11simply also one most remotes leads
0:16:15changed so much change the model so much in the state of the model when
0:16:20we used a really great
0:16:26so also it is
0:16:35choice of a
0:16:40optimal that sounds assuming that the one that
0:16:43trials in the database all channels which is still not the case because we have
0:16:47made up the training data by
0:16:49carrying all the i-vectors
0:16:52and also of course it also assumes that the
0:16:56training database and evaluation based have a better kind of similar properties which probably is
0:17:01also not really case
0:17:03so the optimal beat the could be different from
0:17:09this according to the form
0:17:12so i and i that looks at a bit strange but basically we want to
0:17:15check a couple of different to not use for that and which means that the
0:17:21effective prior p n
0:17:22so we just trying different we make some kind of parameters section which make sure
0:17:27that the
0:17:28we use this parameter gamma which one mace zero point five
0:17:33we used a standard that the
0:17:37effective prior according to the for all and when i am is equal to one
0:17:41we will use
0:17:43and effective prior one which means people way to the target trials
0:17:48and when it's zero we will use
0:17:51this one this section we make sure that we use weights of the non-target trials
0:17:58we used real also in this experiment
0:18:01and also do not include
0:18:04it's a zero six
0:18:09so this the figures a little bit interesting i think
0:18:14and it seems first like it's much more important to for the actual detection cost
0:18:19simple minimum detection cost but remember also here we didn't the by calibration of the
0:18:29it is clear that the best choice is not
0:18:32that one we can see that was calculated to formalize of the thing
0:18:37that's very interesting and media
0:18:39area that should be more explored
0:18:47and i should probably have said also that
0:18:50it's very actually goes up a little bit that which is very noticeable and i'm
0:18:53not sure why and
0:18:58because that is actually that
0:19:05the prior
0:19:06effective fire okay the recording for which we used in other experiments
0:19:11but anyway we can see that pattern really regularization towards the ml model
0:19:16this is relaxation ones
0:19:27very interesting
0:19:30thing is that it seems that from name detection cost that actually goes down a
0:19:35little bit here
0:19:36which means that we
0:19:39which is the cases where we used for just for training data or target right
0:19:45you one way to target trials or we give a way to non-target trials
0:19:50and it but i think this the results when the so this is that
0:19:57this should really not work but
0:20:00because we do regularization towards the ml model it i just a very close to
0:20:05the and the model for such kind of a system that was actually be something
0:20:09so we can also can not need any you wanna be included
0:20:13the results for regular station towards here where we can see that this is not
0:20:17the case
0:20:20so in conclusions
0:20:23we can see that sometimes can improve the performance but quite often there is not
0:20:28so much different
0:20:30and we tried different optimization strategy is
0:20:33and the
0:20:35what that should say about that it's that starting from the ml a from the
0:20:38logistic regression model is important to the starting point is important but this kind of
0:20:43gradually increasing the complexity of course
0:20:47the non-convex of was not so
0:20:50effective actually but they didn't discuss the details about it
0:20:57so the optimization is something to consider and also
0:21:01since it seems to be the has really the weight be that's
0:21:06some kind of importance what is not simply area
0:21:10well what we should do we probably shouldn't consider a better estimate its of it
0:21:14may be something that depends on other factors and just whether it's of target and
0:21:19non-target trials
0:21:24the discriminative training
0:21:28criterias for the two is connected it is trained models
0:21:32needed calibration i think we could be interesting to
0:21:35mate the regularization towards as well
0:21:42parameter vector where we have built in the regularization parameter so we do calibration of
0:21:47the ml model and then kind of
0:21:50put in the parameters from the regularization into that
0:21:54a from the candidate calibration into the regular stations we actually do
0:21:59regularization towards something that's calibrated
0:22:02okay so
0:22:03or something
0:22:29might be opposed to
0:22:31what optimiser could you use a used to be yes algorithm
0:22:36a little attention
0:22:38okay so
0:22:40you mentioned there was some issues with a non complexity of your objective so
0:22:47hidden in the work that are just like
0:22:50note that this morning
0:22:53i also had issues with non-convex of t
0:22:59we have to its was a problem
0:23:02of course basically
0:23:04the gift use forms a rough approximation to the inverse of the haitian right and
0:23:10if you do we have two years probably that here soon matrix is going to
0:23:15be positive definite
0:23:18what's so
0:23:19we have two s can see
0:23:21the non-convex of t right and that it can do anything about
0:23:25so it's probably i think you that's a good point and we should consider some
0:23:30better optimization algorithm
0:23:32or we can come from that we have reduced the value of the objective function
0:23:36quite the significantly but maybe
0:23:39we are
0:23:40we could have done much better in that aspect well using something more sure i
0:23:44think so
0:23:45in my case there was simple solutions are could calculate the full history and a
0:23:50inferred that without problems because i very few parameters and i could do an eigenvalue
0:23:56analysis and then go down the steepest negative eigen vectors that can be out of
0:24:02the non-convex regions right the for you
0:24:05very high also
0:24:08you could perhaps to some other things but it's more this week why i'm afraid
0:24:13but thank you
0:24:21which does
0:25:04okay well basically because we are not the doing calibration we doing discriminative training of
0:25:09the ple