| 0:00:15 | this is very short about the topic | 
|---|
| 0:00:18 | we are apply we are working with a probabilistic linear discriminant analysis and it has | 
|---|
| 0:00:25 | previously been proved by discriminative training | 
|---|
| 0:00:30 | previous studies now use a loss functions that essential to focus on a very broad | 
|---|
| 0:00:34 | range of applications so in this work we are trying to | 
|---|
| 0:00:39 | train the p lda in a way that it becomes | 
|---|
| 0:00:42 | more suitable for | 
|---|
| 0:00:44 | narrow range of applications | 
|---|
| 0:00:47 | and we observe a small improvement in the minimum detection cost by doing so | 
|---|
| 0:00:55 | so as a background | 
|---|
| 0:00:57 | s when we use the speaker verification system we would like to minimize the expected | 
|---|
| 0:01:02 | cost | 
|---|
| 0:01:04 | from our decision | 
|---|
| 0:01:06 | and that this is a very much reflected in the detection cost of the of | 
|---|
| 0:01:10 | then use | 
|---|
| 0:01:12 | so we have at the cost for false rejection and false alarm and also a | 
|---|
| 0:01:16 | prior which we can say together constitutes the operating point of our system | 
|---|
| 0:01:22 | and which of course depend on the application | 
|---|
| 0:01:25 | so the targets here is to yield a application specific system that is optimal for | 
|---|
| 0:01:31 | one or several | 
|---|
| 0:01:34 | operating point rather than one wall | 
|---|
| 0:01:37 | it is more specifics of the same | 
|---|
| 0:01:41 | and there so it can only idea some already been explored force score calibration | 
|---|
| 0:01:48 | in interspeech paper mention that | 
|---|
| 0:01:51 | however well so score calibration with score calibration we can reduce the gap between actual | 
|---|
| 0:01:57 | detection cost and minimum detection cost | 
|---|
| 0:02:00 | but we cannot be used the minimum detection cost | 
|---|
| 0:02:04 | part | 
|---|
| 0:02:05 | by applying these channel five use some earlier stage of the speaker verification system we | 
|---|
| 0:02:10 | could how to reduce also the minimum detection cost | 
|---|
| 0:02:14 | so we will apply to | 
|---|
| 0:02:17 | discriminative p lda training | 
|---|
| 0:02:25 | we use this method that has been previously been developed for training well for discriminating | 
|---|
| 0:02:31 | ple training | 
|---|
| 0:02:33 | and the only kind of thing we need to do here is that the this | 
|---|
| 0:02:37 | well the log-likelihood ratio score of the period more data is | 
|---|
| 0:02:44 | you've done by this kind of for right here | 
|---|
| 0:02:47 | and we can apply some discussion discriminative training criteria to these | 
|---|
| 0:02:52 | parameter | 
|---|
| 0:02:53 | here | 
|---|
| 0:02:55 | well only you should be in of the i-vectors | 
|---|
| 0:03:00 | which i same out that the still we basically take all possible pairs of my | 
|---|
| 0:03:05 | make those in the training database and minimize sound loss function l possible with some | 
|---|
| 0:03:11 | weight | 
|---|
| 0:03:12 | and also be applied so on | 
|---|
| 0:03:14 | regularization term | 
|---|
| 0:03:23 | but this | 
|---|
| 0:03:25 | have been | 
|---|
| 0:03:26 | so | 
|---|
| 0:03:28 | when we need to consider a | 
|---|
| 0:03:30 | one which operating point we should data | 
|---|
| 0:03:34 | talk about how we should target a system to be | 
|---|
| 0:03:38 | suitable for a certain operating points we need to consider | 
|---|
| 0:03:42 | the part we have here and gmm weights b that which is different for well | 
|---|
| 0:03:46 | depends on the trial | 
|---|
| 0:03:48 | in essence it will be different for target and non-target trials | 
|---|
| 0:03:52 | and we also have a loss function and to say very simple the | 
|---|
| 0:03:57 | that will depend which operating point they're targeting whereas the choice of loss function chime | 
|---|
| 0:04:04 | decide how much emphasis would put on surrounding operating points | 
|---|
| 0:04:17 | so | 
|---|
| 0:04:18 | well just a bit short about the forest be that | 
|---|
| 0:04:24 | well as probably several you know we can we'll in some applications where approach | 
|---|
| 0:04:29 | these three parameters probability of target trial two costs we can rewrite it | 
|---|
| 0:04:34 | where we can | 
|---|
| 0:04:36 | we have an equivalent cost which will have a | 
|---|
| 0:04:40 | a loss | 
|---|
| 0:04:41 | in the training or evaluation that is proportional to this | 
|---|
| 0:04:44 | first application so | 
|---|
| 0:04:46 | we can as well consider this | 
|---|
| 0:04:50 | well | 
|---|
| 0:04:51 | such kind of application is that and to minimize | 
|---|
| 0:04:56 | well as and we will as such we make sure that the | 
|---|
| 0:05:00 | we have a | 
|---|
| 0:05:02 | our system will be able to also | 
|---|
| 0:05:05 | for that are breaking points are looking at | 
|---|
| 0:05:08 | so essentially we need to be scale | 
|---|
| 0:05:16 | every trial so that we get the retard the | 
|---|
| 0:05:21 | percentage of target trials in the | 
|---|
| 0:05:24 | evaluation | 
|---|
| 0:05:25 | evaluation database we consider all data | 
|---|
| 0:05:29 | because we consider can compare two | 
|---|
| 0:05:33 | the training database | 
|---|
| 0:05:38 | so regarding the choice of a loss function | 
|---|
| 0:05:42 | previous studies you for discrimate bp lda training use a logistic regression loss or the | 
|---|
| 0:05:47 | svm hinge loss | 
|---|
| 0:05:49 | and the logistic regression scores which is essentially the same as the cmllr loss to | 
|---|
| 0:05:54 | justify the eer application independent the | 
|---|
| 0:05:58 | and | 
|---|
| 0:05:59 | evaluation metrics so you could be suitable as a loss function if we want to | 
|---|
| 0:06:03 | target a very broad range of applications | 
|---|
| 0:06:07 | well what we want consider here is to | 
|---|
| 0:06:10 | c by targeting a more narrow range of application of up of operating points if | 
|---|
| 0:06:16 | we can give better performance for such operating points | 
|---|
| 0:06:20 | and | 
|---|
| 0:06:22 | well the most | 
|---|
| 0:06:24 | i think that would call course one exactly to one detection cost would be that | 
|---|
| 0:06:29 | zero one loss | 
|---|
| 0:06:31 | and that we will also consider one which is a little bit broad one loss | 
|---|
| 0:06:35 | function which is a little bit broader than that | 
|---|
| 0:06:38 | zero one loss button bit more marilyn logistic regression loss which is the be a | 
|---|
| 0:06:43 | loss | 
|---|
| 0:06:44 | and well | 
|---|
| 0:06:47 | to | 
|---|
| 0:06:49 | explain why about this the case i can report that the speech paper which is | 
|---|
| 0:06:53 | very interesting | 
|---|
| 0:07:00 | so | 
|---|
| 0:07:01 | i'm showing various the picture of how these different things slopes | 
|---|
| 0:07:06 | and the blue one would be the logistic regression loss which is there | 
|---|
| 0:07:11 | complex but a | 
|---|
| 0:07:14 | and say comes | 
|---|
| 0:07:16 | because on that this could also be sensitive to outliers because | 
|---|
| 0:07:21 | for some | 
|---|
| 0:07:22 | maybe a new show a trial so this would be by the way they look | 
|---|
| 0:07:26 | for a target trial metric for example | 
|---|
| 0:07:29 | for so one also we have | 
|---|
| 0:07:33 | some cost here and then of the past the threshold which is here you that | 
|---|
| 0:07:37 | no cost | 
|---|
| 0:07:38 | what basically this one can be very large for some | 
|---|
| 0:07:43 | when you change point in our database so | 
|---|
| 0:07:51 | our system may be very much adjusted to one of my | 
|---|
| 0:07:56 | degree one | 
|---|
| 0:07:57 | targets the real loss and we are the zero one loss here as i said | 
|---|
| 0:08:02 | with a couple of approximations that we will later use | 
|---|
| 0:08:07 | we use this sigmoid approximation | 
|---|
| 0:08:10 | in order to do optimization under which includes the parameter i'll show that makes it | 
|---|
| 0:08:16 | more and more similar to the zero one loss when you increase it and we | 
|---|
| 0:08:20 | have that for | 
|---|
| 0:08:22 | well | 
|---|
| 0:08:24 | well one ten hundred | 
|---|
| 0:08:37 | so | 
|---|
| 0:08:39 | there are a couple of problems though the real zero one loss is not differentiable | 
|---|
| 0:08:43 | so that slightly use this one function | 
|---|
| 0:08:47 | and | 
|---|
| 0:08:50 | we also or a real as in the same one loss or non-convex so we | 
|---|
| 0:08:55 | do one approach here where we can of gradually increase the non complexity and | 
|---|
| 0:09:02 | for the sigmoid loss it means we start from the logistic regression model | 
|---|
| 0:09:06 | we also tried from the ml model but it's better to start from the logistic | 
|---|
| 0:09:10 | regression well | 
|---|
| 0:09:12 | and then increase of five gradually on there is another papers to doing that's for | 
|---|
| 0:09:17 | other applications | 
|---|
| 0:09:20 | we do something similar for the radio lost what we start from the logistic regression | 
|---|
| 0:09:25 | model and then train the sycamore loss with our for it was the one loss | 
|---|
| 0:09:30 | finally | 
|---|
| 0:09:37 | so | 
|---|
| 0:09:39 | regarding the experiments we didn't we use the main telephone trials and | 
|---|
| 0:09:46 | we use | 
|---|
| 0:09:47 | of a couple of different databases and we used as development set there is this | 
|---|
| 0:09:51 | research six which is which using one | 
|---|
| 0:09:54 | the regular session but i | 
|---|
| 0:09:56 | and then use this series zero eight and that's it is it intended for testing | 
|---|
| 0:10:02 | and this one cannot standard datasets for p lda training | 
|---|
| 0:10:07 | and | 
|---|
| 0:10:09 | an engineer the number of i-vectors and speakers with or without including is there is | 
|---|
| 0:10:14 | a sixteen goes this we use the fast development set but sometimes we included in | 
|---|
| 0:10:19 | the training set off to react decided on the parameters to get the little bit | 
|---|
| 0:10:23 | better performance | 
|---|
| 0:10:26 | and we conducted the for experiments | 
|---|
| 0:10:30 | okay i should also say that we target is the operating point mentioning here which | 
|---|
| 0:10:35 | has been standard in which the operating point in several nist evaluations | 
|---|
| 0:10:40 | and | 
|---|
| 0:10:42 | for you need for experiments one is just considering a couple of different normalisation regularization | 
|---|
| 0:10:47 | techniques because we limit on sure about what is the best although it's not really | 
|---|
| 0:10:52 | related to the topic of this paper | 
|---|
| 0:10:56 | the second experiment we just compare the different the loss functions that use of are | 
|---|
| 0:11:00 | discussed | 
|---|
| 0:11:01 | and then the underlies also the effect of calibration finally we address were tried to | 
|---|
| 0:11:06 | investigate little bit | 
|---|
| 0:11:08 | well | 
|---|
| 0:11:09 | the choice of be a according to the formal idea before is actually suitable or | 
|---|
| 0:11:14 | not | 
|---|
| 0:11:18 | so well for regular stations there are two options that are popular i guess and | 
|---|
| 0:11:24 | one in this kind of | 
|---|
| 0:11:27 | topics so we can do regular size and regularization to see her which would be | 
|---|
| 0:11:32 | most remote and warranted or station towards ml and icexml i mean normal generating trained | 
|---|
| 0:11:40 | and because logistic regression is also in that sense | 
|---|
| 0:11:44 | maximum likelihood approach | 
|---|
| 0:11:48 | and to compare also weddings within class covariance for just whitening with | 
|---|
| 0:11:54 | full covariance total covariance | 
|---|
| 0:11:58 | and maybe we found that in terms of mindcf and eer | 
|---|
| 0:12:04 | using just as | 
|---|
| 0:12:06 | covariance the phone call total covariance and regularization towards a likelihood you lead to better | 
|---|
| 0:12:12 | performance we use that | 
|---|
| 0:12:14 | the remaining experiments | 
|---|
| 0:12:19 | so comparing loss functions | 
|---|
| 0:12:31 | well first we should say that there is given to training schemes that their actual | 
|---|
| 0:12:36 | detection cost than the standard maximum likelihood training but that is kind of expect that | 
|---|
| 0:12:42 | because | 
|---|
| 0:12:43 | they at the same time do calibration | 
|---|
| 0:12:47 | however not great calibration | 
|---|
| 0:12:50 | which we will discuss | 
|---|
| 0:12:52 | make the wrong | 
|---|
| 0:12:54 | and | 
|---|
| 0:12:56 | but the for calibration it's is that the matrix they're model is very competitive | 
|---|
| 0:13:04 | but we can see some improvement by | 
|---|
| 0:13:08 | these the application specific loss function compared to logistic regression minimum detection cost any all | 
|---|
| 0:13:15 | three and | 
|---|
| 0:13:17 | for sre silly there is no such that | 
|---|
| 0:13:24 | so | 
|---|
| 0:13:29 | maximum likelihood standard maximum likelihood model and a bit worse calibration but | 
|---|
| 0:13:34 | since we can | 
|---|
| 0:13:36 | take start by doing calibration | 
|---|
| 0:13:38 | we will | 
|---|
| 0:13:40 | we in order to a fair comparison we will | 
|---|
| 0:13:43 | also consider that here | 
|---|
| 0:13:45 | so what to do it we need to use some training someone some portion of | 
|---|
| 0:13:50 | the training data that's really tried i hear you see that fifty cent defined ninety | 
|---|
| 0:13:54 | or ninety five percent of that | 
|---|
| 0:13:57 | training data for p lda training and the rest for calibration | 
|---|
| 0:14:01 | and we use to see an alarm loss here which is essentially the same as | 
|---|
| 0:14:05 | logistic regression | 
|---|
| 0:14:07 | and used operating point that we are targeting | 
|---|
| 0:14:11 | and in these experiments we assume zero six is not include | 
|---|
| 0:14:15 | so the result looks like this and the first thing to say is that the | 
|---|
| 0:14:19 | applying the calibration model | 
|---|
| 0:14:23 | will be better results than discriminative training without calibration | 
|---|
| 0:14:29 | the second thing is that the | 
|---|
| 0:14:32 | distributed training here also | 
|---|
| 0:14:35 | benefited from calibration which must be explained by | 
|---|
| 0:14:39 | the fact that they're using a regularization | 
|---|
| 0:14:43 | top | 
|---|
| 0:14:52 | and are also maybe overall can say that seventy five percent of the different training | 
|---|
| 0:14:58 | and the rest for | 
|---|
| 0:15:00 | using the rest for calibration was the optimal | 
|---|
| 0:15:09 | and the also | 
|---|
| 0:15:13 | we notice that the logistic regression | 
|---|
| 0:15:16 | performs quite bad for the very small amount of training data using is the fifty | 
|---|
| 0:15:21 | percent | 
|---|
| 0:15:22 | and the whereas the real also zero one loss | 
|---|
| 0:15:26 | we perform better | 
|---|
| 0:15:28 | but the this is probably course | 
|---|
| 0:15:33 | and those two loss functions | 
|---|
| 0:15:35 | chan | 
|---|
| 0:15:38 | if i can go back to this | 
|---|
| 0:15:42 | figure here | 
|---|
| 0:15:43 | for example the zero one loss | 
|---|
| 0:15:46 | we do not make so much use of a in the data but that's a | 
|---|
| 0:15:50 | score like this | 
|---|
| 0:15:51 | but as the logistic regression would and | 
|---|
| 0:15:56 | that means that the regression loss will use more of the data | 
|---|
| 0:16:00 | so what happened here | 
|---|
| 0:16:05 | i think stuff that | 
|---|
| 0:16:08 | since we do regularization towards the ml model | 
|---|
| 0:16:11 | simply also one most remotes leads | 
|---|
| 0:16:15 | changed so much change the model so much in the state of the model when | 
|---|
| 0:16:20 | we used a really great | 
|---|
| 0:16:26 | so also it is | 
|---|
| 0:16:35 | choice of a | 
|---|
| 0:16:37 | use | 
|---|
| 0:16:40 | optimal that sounds assuming that the one that | 
|---|
| 0:16:43 | trials in the database all channels which is still not the case because we have | 
|---|
| 0:16:47 | made up the training data by | 
|---|
| 0:16:49 | carrying all the i-vectors | 
|---|
| 0:16:52 | and also of course it also assumes that the | 
|---|
| 0:16:56 | training database and evaluation based have a better kind of similar properties which probably is | 
|---|
| 0:17:01 | also not really case | 
|---|
| 0:17:03 | so the optimal beat the could be different from | 
|---|
| 0:17:09 | this according to the form | 
|---|
| 0:17:12 | so i and i that looks at a bit strange but basically we want to | 
|---|
| 0:17:15 | check a couple of different to not use for that and which means that the | 
|---|
| 0:17:21 | effective prior p n | 
|---|
| 0:17:22 | so we just trying different we make some kind of parameters section which make sure | 
|---|
| 0:17:27 | that the | 
|---|
| 0:17:28 | we use this parameter gamma which one mace zero point five | 
|---|
| 0:17:33 | we used a standard that the | 
|---|
| 0:17:36 | calculated | 
|---|
| 0:17:37 | effective prior according to the for all and when i am is equal to one | 
|---|
| 0:17:41 | we will use | 
|---|
| 0:17:43 | and effective prior one which means people way to the target trials | 
|---|
| 0:17:48 | and when it's zero we will use | 
|---|
| 0:17:51 | this one this section we make sure that we use weights of the non-target trials | 
|---|
| 0:17:58 | we used real also in this experiment | 
|---|
| 0:18:01 | and also do not include | 
|---|
| 0:18:04 | it's a zero six | 
|---|
| 0:18:09 | so this the figures a little bit interesting i think | 
|---|
| 0:18:14 | and it seems first like it's much more important to for the actual detection cost | 
|---|
| 0:18:19 | simple minimum detection cost but remember also here we didn't the by calibration of the | 
|---|
| 0:18:24 | way | 
|---|
| 0:18:25 | and | 
|---|
| 0:18:29 | it is clear that the best choice is not | 
|---|
| 0:18:32 | that one we can see that was calculated to formalize of the thing | 
|---|
| 0:18:37 | that's very interesting and media | 
|---|
| 0:18:39 | area that should be more explored | 
|---|
| 0:18:47 | and i should probably have said also that | 
|---|
| 0:18:50 | it's very actually goes up a little bit that which is very noticeable and i'm | 
|---|
| 0:18:53 | not sure why and | 
|---|
| 0:18:58 | because that is actually that | 
|---|
| 0:19:01 | the | 
|---|
| 0:19:03 | right | 
|---|
| 0:19:05 | the prior | 
|---|
| 0:19:06 | effective fire okay the recording for which we used in other experiments | 
|---|
| 0:19:11 | but anyway we can see that pattern really regularization towards the ml model | 
|---|
| 0:19:16 | this is relaxation ones | 
|---|
| 0:19:23 | and | 
|---|
| 0:19:27 | very interesting | 
|---|
| 0:19:30 | thing is that it seems that from name detection cost that actually goes down a | 
|---|
| 0:19:35 | little bit here | 
|---|
| 0:19:36 | which means that we | 
|---|
| 0:19:39 | which is the cases where we used for just for training data or target right | 
|---|
| 0:19:45 | you one way to target trials or we give a way to non-target trials | 
|---|
| 0:19:50 | and it but i think this the results when the so this is that | 
|---|
| 0:19:55 | well | 
|---|
| 0:19:57 | this should really not work but | 
|---|
| 0:20:00 | because we do regularization towards the ml model it i just a very close to | 
|---|
| 0:20:05 | the and the model for such kind of a system that was actually be something | 
|---|
| 0:20:08 | good | 
|---|
| 0:20:09 | so we can also can not need any you wanna be included | 
|---|
| 0:20:13 | the results for regular station towards here where we can see that this is not | 
|---|
| 0:20:17 | the case | 
|---|
| 0:20:20 | so in conclusions | 
|---|
| 0:20:23 | we can see that sometimes can improve the performance but quite often there is not | 
|---|
| 0:20:28 | so much different | 
|---|
| 0:20:30 | and we tried different optimization strategy is | 
|---|
| 0:20:33 | and the | 
|---|
| 0:20:35 | what that should say about that it's that starting from the ml a from the | 
|---|
| 0:20:38 | logistic regression model is important to the starting point is important but this kind of | 
|---|
| 0:20:43 | gradually increasing the complexity of course | 
|---|
| 0:20:47 | the non-convex of was not so | 
|---|
| 0:20:50 | effective actually but they didn't discuss the details about it | 
|---|
| 0:20:57 | so the optimization is something to consider and also | 
|---|
| 0:21:01 | since it seems to be the has really the weight be that's | 
|---|
| 0:21:06 | some kind of importance what is not simply area | 
|---|
| 0:21:10 | well what we should do we probably shouldn't consider a better estimate its of it | 
|---|
| 0:21:14 | may be something that depends on other factors and just whether it's of target and | 
|---|
| 0:21:19 | non-target trials | 
|---|
| 0:21:21 | and | 
|---|
| 0:21:22 | since | 
|---|
| 0:21:23 | also | 
|---|
| 0:21:24 | the discriminative training | 
|---|
| 0:21:28 | criterias for the two is connected it is trained models | 
|---|
| 0:21:32 | needed calibration i think we could be interesting to | 
|---|
| 0:21:35 | mate the regularization towards as well | 
|---|
| 0:21:42 | parameter vector where we have built in the regularization parameter so we do calibration of | 
|---|
| 0:21:47 | the ml model and then kind of | 
|---|
| 0:21:50 | put in the parameters from the regularization into that | 
|---|
| 0:21:54 | a from the candidate calibration into the regular stations we actually do | 
|---|
| 0:21:59 | regularization towards something that's calibrated | 
|---|
| 0:22:02 | okay so | 
|---|
| 0:22:03 | or something | 
|---|
| 0:22:29 | might be opposed to | 
|---|
| 0:22:31 | what optimiser could you use a used to be yes algorithm | 
|---|
| 0:22:36 | a little attention | 
|---|
| 0:22:38 | okay so | 
|---|
| 0:22:40 | you mentioned there was some issues with a non complexity of your objective so | 
|---|
| 0:22:47 | hidden in the work that are just like | 
|---|
| 0:22:50 | note that this morning | 
|---|
| 0:22:53 | i also had issues with non-convex of t | 
|---|
| 0:22:58 | and | 
|---|
| 0:22:59 | we have to its was a problem | 
|---|
| 0:23:02 | of course basically | 
|---|
| 0:23:04 | the gift use forms a rough approximation to the inverse of the haitian right and | 
|---|
| 0:23:10 | if you do we have two years probably that here soon matrix is going to | 
|---|
| 0:23:15 | be positive definite | 
|---|
| 0:23:18 | what's so | 
|---|
| 0:23:19 | we have two s can see | 
|---|
| 0:23:21 | the non-convex of t right and that it can do anything about | 
|---|
| 0:23:25 | so it's probably i think you that's a good point and we should consider some | 
|---|
| 0:23:30 | better optimization algorithm | 
|---|
| 0:23:32 | or we can come from that we have reduced the value of the objective function | 
|---|
| 0:23:36 | quite the significantly but maybe | 
|---|
| 0:23:39 | we are | 
|---|
| 0:23:40 | we could have done much better in that aspect well using something more sure i | 
|---|
| 0:23:44 | think so | 
|---|
| 0:23:45 | in my case there was simple solutions are could calculate the full history and a | 
|---|
| 0:23:50 | inferred that without problems because i very few parameters and i could do an eigenvalue | 
|---|
| 0:23:56 | analysis and then go down the steepest negative eigen vectors that can be out of | 
|---|
| 0:24:02 | the non-convex regions right the for you | 
|---|
| 0:24:05 | very high also | 
|---|
| 0:24:08 | you could perhaps to some other things but it's more this week why i'm afraid | 
|---|
| 0:24:13 | but thank you | 
|---|
| 0:24:21 | which does | 
|---|
| 0:24:31 | i | 
|---|
| 0:25:04 | okay well basically because we are not the doing calibration we doing discriminative training of | 
|---|
| 0:25:09 | the ple | 
|---|
| 0:25:12 | okay | 
|---|