0:00:15 | this is very short about the topic |
---|---|

0:00:18 | we are apply we are working with a probabilistic linear discriminant analysis and it has |

0:00:25 | previously been proved by discriminative training |

0:00:30 | previous studies now use a loss functions that essential to focus on a very broad |

0:00:34 | range of applications so in this work we are trying to |

0:00:39 | train the p lda in a way that it becomes |

0:00:42 | more suitable for |

0:00:44 | narrow range of applications |

0:00:47 | and we observe a small improvement in the minimum detection cost by doing so |

0:00:55 | so as a background |

0:00:57 | s when we use the speaker verification system we would like to minimize the expected |

0:01:02 | cost |

0:01:04 | from our decision |

0:01:06 | and that this is a very much reflected in the detection cost of the of |

0:01:10 | then use |

0:01:12 | so we have at the cost for false rejection and false alarm and also a |

0:01:16 | prior which we can say together constitutes the operating point of our system |

0:01:22 | and which of course depend on the application |

0:01:25 | so the targets here is to yield a application specific system that is optimal for |

0:01:31 | one or several |

0:01:34 | operating point rather than one wall |

0:01:37 | it is more specifics of the same |

0:01:41 | and there so it can only idea some already been explored force score calibration |

0:01:48 | in interspeech paper mention that |

0:01:51 | however well so score calibration with score calibration we can reduce the gap between actual |

0:01:57 | detection cost and minimum detection cost |

0:02:00 | but we cannot be used the minimum detection cost |

0:02:04 | part |

0:02:05 | by applying these channel five use some earlier stage of the speaker verification system we |

0:02:10 | could how to reduce also the minimum detection cost |

0:02:14 | so we will apply to |

0:02:17 | discriminative p lda training |

0:02:25 | we use this method that has been previously been developed for training well for discriminating |

0:02:31 | ple training |

0:02:33 | and the only kind of thing we need to do here is that the this |

0:02:37 | well the log-likelihood ratio score of the period more data is |

0:02:44 | you've done by this kind of for right here |

0:02:47 | and we can apply some discussion discriminative training criteria to these |

0:02:52 | parameter |

0:02:53 | here |

0:02:55 | well only you should be in of the i-vectors |

0:03:00 | which i same out that the still we basically take all possible pairs of my |

0:03:05 | make those in the training database and minimize sound loss function l possible with some |

0:03:11 | weight |

0:03:12 | and also be applied so on |

0:03:14 | regularization term |

0:03:23 | but this |

0:03:25 | have been |

0:03:26 | so |

0:03:28 | when we need to consider a |

0:03:30 | one which operating point we should data |

0:03:34 | talk about how we should target a system to be |

0:03:38 | suitable for a certain operating points we need to consider |

0:03:42 | the part we have here and gmm weights b that which is different for well |

0:03:46 | depends on the trial |

0:03:48 | in essence it will be different for target and non-target trials |

0:03:52 | and we also have a loss function and to say very simple the |

0:03:57 | that will depend which operating point they're targeting whereas the choice of loss function chime |

0:04:04 | decide how much emphasis would put on surrounding operating points |

0:04:17 | so |

0:04:18 | well just a bit short about the forest be that |

0:04:24 | well as probably several you know we can we'll in some applications where approach |

0:04:29 | these three parameters probability of target trial two costs we can rewrite it |

0:04:34 | where we can |

0:04:36 | we have an equivalent cost which will have a |

0:04:40 | a loss |

0:04:41 | in the training or evaluation that is proportional to this |

0:04:44 | first application so |

0:04:46 | we can as well consider this |

0:04:50 | well |

0:04:51 | such kind of application is that and to minimize |

0:04:56 | well as and we will as such we make sure that the |

0:05:00 | we have a |

0:05:02 | our system will be able to also |

0:05:05 | for that are breaking points are looking at |

0:05:08 | so essentially we need to be scale |

0:05:16 | every trial so that we get the retard the |

0:05:21 | percentage of target trials in the |

0:05:24 | evaluation |

0:05:25 | evaluation database we consider all data |

0:05:29 | because we consider can compare two |

0:05:33 | the training database |

0:05:38 | so regarding the choice of a loss function |

0:05:42 | previous studies you for discrimate bp lda training use a logistic regression loss or the |

0:05:47 | svm hinge loss |

0:05:49 | and the logistic regression scores which is essentially the same as the cmllr loss to |

0:05:54 | justify the eer application independent the |

0:05:58 | and |

0:05:59 | evaluation metrics so you could be suitable as a loss function if we want to |

0:06:03 | target a very broad range of applications |

0:06:07 | well what we want consider here is to |

0:06:10 | c by targeting a more narrow range of application of up of operating points if |

0:06:16 | we can give better performance for such operating points |

0:06:20 | and |

0:06:22 | well the most |

0:06:24 | i think that would call course one exactly to one detection cost would be that |

0:06:29 | zero one loss |

0:06:31 | and that we will also consider one which is a little bit broad one loss |

0:06:35 | function which is a little bit broader than that |

0:06:38 | zero one loss button bit more marilyn logistic regression loss which is the be a |

0:06:43 | loss |

0:06:44 | and well |

0:06:47 | to |

0:06:49 | explain why about this the case i can report that the speech paper which is |

0:06:53 | very interesting |

0:07:00 | so |

0:07:01 | i'm showing various the picture of how these different things slopes |

0:07:06 | and the blue one would be the logistic regression loss which is there |

0:07:11 | complex but a |

0:07:14 | and say comes |

0:07:16 | because on that this could also be sensitive to outliers because |

0:07:21 | for some |

0:07:22 | maybe a new show a trial so this would be by the way they look |

0:07:26 | for a target trial metric for example |

0:07:29 | for so one also we have |

0:07:33 | some cost here and then of the past the threshold which is here you that |

0:07:37 | no cost |

0:07:38 | what basically this one can be very large for some |

0:07:43 | when you change point in our database so |

0:07:51 | our system may be very much adjusted to one of my |

0:07:56 | degree one |

0:07:57 | targets the real loss and we are the zero one loss here as i said |

0:08:02 | with a couple of approximations that we will later use |

0:08:07 | we use this sigmoid approximation |

0:08:10 | in order to do optimization under which includes the parameter i'll show that makes it |

0:08:16 | more and more similar to the zero one loss when you increase it and we |

0:08:20 | have that for |

0:08:22 | well |

0:08:24 | well one ten hundred |

0:08:37 | so |

0:08:39 | there are a couple of problems though the real zero one loss is not differentiable |

0:08:43 | so that slightly use this one function |

0:08:47 | and |

0:08:50 | we also or a real as in the same one loss or non-convex so we |

0:08:55 | do one approach here where we can of gradually increase the non complexity and |

0:09:02 | for the sigmoid loss it means we start from the logistic regression model |

0:09:06 | we also tried from the ml model but it's better to start from the logistic |

0:09:10 | regression well |

0:09:12 | and then increase of five gradually on there is another papers to doing that's for |

0:09:17 | other applications |

0:09:20 | we do something similar for the radio lost what we start from the logistic regression |

0:09:25 | model and then train the sycamore loss with our for it was the one loss |

0:09:30 | finally |

0:09:37 | so |

0:09:39 | regarding the experiments we didn't we use the main telephone trials and |

0:09:46 | we use |

0:09:47 | of a couple of different databases and we used as development set there is this |

0:09:51 | research six which is which using one |

0:09:54 | the regular session but i |

0:09:56 | and then use this series zero eight and that's it is it intended for testing |

0:10:02 | and this one cannot standard datasets for p lda training |

0:10:07 | and |

0:10:09 | an engineer the number of i-vectors and speakers with or without including is there is |

0:10:14 | a sixteen goes this we use the fast development set but sometimes we included in |

0:10:19 | the training set off to react decided on the parameters to get the little bit |

0:10:23 | better performance |

0:10:26 | and we conducted the for experiments |

0:10:30 | okay i should also say that we target is the operating point mentioning here which |

0:10:35 | has been standard in which the operating point in several nist evaluations |

0:10:40 | and |

0:10:42 | for you need for experiments one is just considering a couple of different normalisation regularization |

0:10:47 | techniques because we limit on sure about what is the best although it's not really |

0:10:52 | related to the topic of this paper |

0:10:56 | the second experiment we just compare the different the loss functions that use of are |

0:11:00 | discussed |

0:11:01 | and then the underlies also the effect of calibration finally we address were tried to |

0:11:06 | investigate little bit |

0:11:08 | well |

0:11:09 | the choice of be a according to the formal idea before is actually suitable or |

0:11:14 | not |

0:11:18 | so well for regular stations there are two options that are popular i guess and |

0:11:24 | one in this kind of |

0:11:27 | topics so we can do regular size and regularization to see her which would be |

0:11:32 | most remote and warranted or station towards ml and icexml i mean normal generating trained |

0:11:40 | and because logistic regression is also in that sense |

0:11:44 | maximum likelihood approach |

0:11:48 | and to compare also weddings within class covariance for just whitening with |

0:11:54 | full covariance total covariance |

0:11:58 | and maybe we found that in terms of mindcf and eer |

0:12:04 | using just as |

0:12:06 | covariance the phone call total covariance and regularization towards a likelihood you lead to better |

0:12:12 | performance we use that |

0:12:14 | the remaining experiments |

0:12:19 | so comparing loss functions |

0:12:31 | well first we should say that there is given to training schemes that their actual |

0:12:36 | detection cost than the standard maximum likelihood training but that is kind of expect that |

0:12:42 | because |

0:12:43 | they at the same time do calibration |

0:12:47 | however not great calibration |

0:12:50 | which we will discuss |

0:12:52 | make the wrong |

0:12:54 | and |

0:12:56 | but the for calibration it's is that the matrix they're model is very competitive |

0:13:04 | but we can see some improvement by |

0:13:08 | these the application specific loss function compared to logistic regression minimum detection cost any all |

0:13:15 | three and |

0:13:17 | for sre silly there is no such that |

0:13:24 | so |

0:13:29 | maximum likelihood standard maximum likelihood model and a bit worse calibration but |

0:13:34 | since we can |

0:13:36 | take start by doing calibration |

0:13:38 | we will |

0:13:40 | we in order to a fair comparison we will |

0:13:43 | also consider that here |

0:13:45 | so what to do it we need to use some training someone some portion of |

0:13:50 | the training data that's really tried i hear you see that fifty cent defined ninety |

0:13:54 | or ninety five percent of that |

0:13:57 | training data for p lda training and the rest for calibration |

0:14:01 | and we use to see an alarm loss here which is essentially the same as |

0:14:05 | logistic regression |

0:14:07 | and used operating point that we are targeting |

0:14:11 | and in these experiments we assume zero six is not include |

0:14:15 | so the result looks like this and the first thing to say is that the |

0:14:19 | applying the calibration model |

0:14:23 | will be better results than discriminative training without calibration |

0:14:29 | the second thing is that the |

0:14:32 | distributed training here also |

0:14:35 | benefited from calibration which must be explained by |

0:14:39 | the fact that they're using a regularization |

0:14:43 | top |

0:14:52 | and are also maybe overall can say that seventy five percent of the different training |

0:14:58 | and the rest for |

0:15:00 | using the rest for calibration was the optimal |

0:15:09 | and the also |

0:15:13 | we notice that the logistic regression |

0:15:16 | performs quite bad for the very small amount of training data using is the fifty |

0:15:21 | percent |

0:15:22 | and the whereas the real also zero one loss |

0:15:26 | we perform better |

0:15:28 | but the this is probably course |

0:15:33 | and those two loss functions |

0:15:35 | chan |

0:15:38 | if i can go back to this |

0:15:42 | figure here |

0:15:43 | for example the zero one loss |

0:15:46 | we do not make so much use of a in the data but that's a |

0:15:50 | score like this |

0:15:51 | but as the logistic regression would and |

0:15:56 | that means that the regression loss will use more of the data |

0:16:00 | so what happened here |

0:16:05 | i think stuff that |

0:16:08 | since we do regularization towards the ml model |

0:16:11 | simply also one most remotes leads |

0:16:15 | changed so much change the model so much in the state of the model when |

0:16:20 | we used a really great |

0:16:26 | so also it is |

0:16:35 | choice of a |

0:16:37 | use |

0:16:40 | optimal that sounds assuming that the one that |

0:16:43 | trials in the database all channels which is still not the case because we have |

0:16:47 | made up the training data by |

0:16:49 | carrying all the i-vectors |

0:16:52 | and also of course it also assumes that the |

0:16:56 | training database and evaluation based have a better kind of similar properties which probably is |

0:17:01 | also not really case |

0:17:03 | so the optimal beat the could be different from |

0:17:09 | this according to the form |

0:17:12 | so i and i that looks at a bit strange but basically we want to |

0:17:15 | check a couple of different to not use for that and which means that the |

0:17:21 | effective prior p n |

0:17:22 | so we just trying different we make some kind of parameters section which make sure |

0:17:27 | that the |

0:17:28 | we use this parameter gamma which one mace zero point five |

0:17:33 | we used a standard that the |

0:17:36 | calculated |

0:17:37 | effective prior according to the for all and when i am is equal to one |

0:17:41 | we will use |

0:17:43 | and effective prior one which means people way to the target trials |

0:17:48 | and when it's zero we will use |

0:17:51 | this one this section we make sure that we use weights of the non-target trials |

0:17:58 | we used real also in this experiment |

0:18:01 | and also do not include |

0:18:04 | it's a zero six |

0:18:09 | so this the figures a little bit interesting i think |

0:18:14 | and it seems first like it's much more important to for the actual detection cost |

0:18:19 | simple minimum detection cost but remember also here we didn't the by calibration of the |

0:18:24 | way |

0:18:25 | and |

0:18:29 | it is clear that the best choice is not |

0:18:32 | that one we can see that was calculated to formalize of the thing |

0:18:37 | that's very interesting and media |

0:18:39 | area that should be more explored |

0:18:47 | and i should probably have said also that |

0:18:50 | it's very actually goes up a little bit that which is very noticeable and i'm |

0:18:53 | not sure why and |

0:18:58 | because that is actually that |

0:19:01 | the |

0:19:03 | right |

0:19:05 | the prior |

0:19:06 | effective fire okay the recording for which we used in other experiments |

0:19:11 | but anyway we can see that pattern really regularization towards the ml model |

0:19:16 | this is relaxation ones |

0:19:23 | and |

0:19:27 | very interesting |

0:19:30 | thing is that it seems that from name detection cost that actually goes down a |

0:19:35 | little bit here |

0:19:36 | which means that we |

0:19:39 | which is the cases where we used for just for training data or target right |

0:19:45 | you one way to target trials or we give a way to non-target trials |

0:19:50 | and it but i think this the results when the so this is that |

0:19:55 | well |

0:19:57 | this should really not work but |

0:20:00 | because we do regularization towards the ml model it i just a very close to |

0:20:05 | the and the model for such kind of a system that was actually be something |

0:20:08 | good |

0:20:09 | so we can also can not need any you wanna be included |

0:20:13 | the results for regular station towards here where we can see that this is not |

0:20:17 | the case |

0:20:20 | so in conclusions |

0:20:23 | we can see that sometimes can improve the performance but quite often there is not |

0:20:28 | so much different |

0:20:30 | and we tried different optimization strategy is |

0:20:33 | and the |

0:20:35 | what that should say about that it's that starting from the ml a from the |

0:20:38 | logistic regression model is important to the starting point is important but this kind of |

0:20:43 | gradually increasing the complexity of course |

0:20:47 | the non-convex of was not so |

0:20:50 | effective actually but they didn't discuss the details about it |

0:20:57 | so the optimization is something to consider and also |

0:21:01 | since it seems to be the has really the weight be that's |

0:21:06 | some kind of importance what is not simply area |

0:21:10 | well what we should do we probably shouldn't consider a better estimate its of it |

0:21:14 | may be something that depends on other factors and just whether it's of target and |

0:21:19 | non-target trials |

0:21:21 | and |

0:21:22 | since |

0:21:23 | also |

0:21:24 | the discriminative training |

0:21:28 | criterias for the two is connected it is trained models |

0:21:32 | needed calibration i think we could be interesting to |

0:21:35 | mate the regularization towards as well |

0:21:42 | parameter vector where we have built in the regularization parameter so we do calibration of |

0:21:47 | the ml model and then kind of |

0:21:50 | put in the parameters from the regularization into that |

0:21:54 | a from the candidate calibration into the regular stations we actually do |

0:21:59 | regularization towards something that's calibrated |

0:22:02 | okay so |

0:22:03 | or something |

0:22:29 | might be opposed to |

0:22:31 | what optimiser could you use a used to be yes algorithm |

0:22:36 | a little attention |

0:22:38 | okay so |

0:22:40 | you mentioned there was some issues with a non complexity of your objective so |

0:22:47 | hidden in the work that are just like |

0:22:50 | note that this morning |

0:22:53 | i also had issues with non-convex of t |

0:22:58 | and |

0:22:59 | we have to its was a problem |

0:23:02 | of course basically |

0:23:04 | the gift use forms a rough approximation to the inverse of the haitian right and |

0:23:10 | if you do we have two years probably that here soon matrix is going to |

0:23:15 | be positive definite |

0:23:18 | what's so |

0:23:19 | we have two s can see |

0:23:21 | the non-convex of t right and that it can do anything about |

0:23:25 | so it's probably i think you that's a good point and we should consider some |

0:23:30 | better optimization algorithm |

0:23:32 | or we can come from that we have reduced the value of the objective function |

0:23:36 | quite the significantly but maybe |

0:23:39 | we are |

0:23:40 | we could have done much better in that aspect well using something more sure i |

0:23:44 | think so |

0:23:45 | in my case there was simple solutions are could calculate the full history and a |

0:23:50 | inferred that without problems because i very few parameters and i could do an eigenvalue |

0:23:56 | analysis and then go down the steepest negative eigen vectors that can be out of |

0:24:02 | the non-convex regions right the for you |

0:24:05 | very high also |

0:24:08 | you could perhaps to some other things but it's more this week why i'm afraid |

0:24:13 | but thank you |

0:24:21 | which does |

0:24:31 | i |

0:25:04 | okay well basically because we are not the doing calibration we doing discriminative training of |

0:25:09 | the ple |

0:25:12 | okay |