0:00:00 | okay |
---|---|

0:00:01 | i everyone mm and you know |

0:00:04 | and i work the |

0:00:06 | brno university of technology and amelia and i will be giving this the control room |

0:00:11 | about |

0:00:13 | and two and speaker verification |

0:00:20 | so the topics |

0:00:22 | to discuss in this tutorial is we'll start with some background and definition l and |

0:00:28 | when training |

0:00:30 | and then discuss some alternative training proceed years but mean which often use |

0:00:36 | and then talk about the motivation for ends when training |

0:00:40 | and continue with some difficult this of and when training |

0:00:45 | and then |

0:00:46 | talk about that were reviewing sound |

0:00:50 | existing work on and then speaker recognition but not that you in like grade the |

0:00:55 | real |

0:00:57 | and then we will rubber with some summary and all but and |

0:01:01 | questions |

0:01:03 | i was would like to give some acknowledgement and assigns to my colleagues from but |

0:01:07 | in the media |

0:01:08 | ming |

0:01:10 | by who i'm |

0:01:11 | it's cost bayes topics a lot |

0:01:15 | so let's start we recognition |

0:01:19 | this is |

0:01:21 | i kind of typical |

0:01:23 | mm at the recognition scenario and |

0:01:26 | in steps to marry we assume we have some features x and some labels line |

0:01:30 | and we wish to find some function which is parameterized by |

0:01:35 | second let's say |

0:01:37 | and it which |

0:01:38 | given the |

0:01:41 | features critiques |

0:01:42 | some label or predict the label |

0:01:46 | which should be close or equal to the true |

0:01:50 | like |

0:01:53 | to be more precise |

0:01:54 | we would like to me prediction to be such that some loss function which compares |

0:02:00 | the |

0:02:00 | predicted label with the true label is as small as possible on unseen data |

0:02:06 | and the loss functions for example if we do call a classification it can be |

0:02:11 | something that used |

0:02:13 | zero if the predicted label is same as the true label and one |

0:02:16 | otherwise the basis i kind of error or |

0:02:18 | not case |

0:02:23 | of course ideally what we want to do is to |

0:02:28 | minimize the expected loss on unseen test data which we could calculate like bass |

0:02:36 | and here we use capital x and y to denote that they are unseeing random |

0:02:40 | variables |

0:02:41 | but since we don't know the probability distribution of |

0:02:44 | x and y we cannot do this |

0:02:46 | exactly or explicitly |

0:02:49 | so |

0:02:51 | in the supervised learning problem we have access to some training data which would be |

0:02:56 | many examples of features and labels we can complete not the most set |

0:03:02 | and |

0:03:02 | p check the average loss on the training data and we are trying to minimize |

0:03:07 | that |

0:03:08 | and then we hope that this we |

0:03:11 | this procedure here means that we will also get a low loss on |

0:03:15 | unseen test data |

0:03:18 | and this is a call empirical risk minimisation |

0:03:21 | and use expected to work uses |

0:03:25 | the classifier that we use this not to our four |

0:03:29 | e |

0:03:30 | the to be precise something would be to dimension should be if units and it |

0:03:34 | also requires that the distribution of the loss |

0:03:37 | needs |

0:03:38 | not to have attained but to for typical scenarios this |

0:03:42 | really into it improves in your is expected to work |

0:03:49 | so then let's talk about speaker recognition |

0:03:53 | as probably most |

0:03:55 | in the audience here knows we have these three some tasks of speaker recognition |

0:04:01 | it's speaker identification |

0:04:04 | which basically is used to classify close to all speakers of this is a very |

0:04:09 | standard |

0:04:11 | i recognition |

0:04:14 | scenario and then we have speaker verification where we deal we |

0:04:17 | open set as we say |

0:04:19 | so the speakers that we may see in testing |

0:04:22 | or not the same as we have access to in training when building the model |

0:04:27 | and our task is typically to say whether two segments utterances are from the same |

0:04:32 | speaker or not |

0:04:34 | and then there's also speaker diarization which is |

0:04:38 | to assign basically you know in a long recording each time you mean you need |

0:04:43 | to a speaker |

0:04:47 | so here i will focus on speaker verification because the speaker identification task is |

0:04:53 | quite easy you know at least conceptually |

0:04:57 | and the speaker diarization is card and then approaches are still in very rarely station |

0:05:03 | or although some great |

0:05:05 | stuff as has been done |

0:05:07 | it's maybe too early to focus on that you know tutorial |

0:05:14 | so |

0:05:15 | generally |

0:05:17 | it's |

0:05:19 | preferable |

0:05:20 | if a classifier |

0:05:22 | i'll codes |

0:05:23 | not a heart the heart prediction like it's this class or in this class but |

0:05:28 | rather probability of different classes |

0:05:31 | so we would like some |

0:05:34 | classifier that uses an estimate of the probability of some label given the data |

0:05:39 | in the case of speaker verification with are rather prefer it all put the log-likelihood |

0:05:45 | ratios |

0:05:46 | because from that we can |

0:05:49 | okay |

0:05:51 | the probability of a class given the labour i classes here is just target over |

0:05:55 | non-target |

0:05:57 | but we can |

0:05:59 | do this based on a specified prior probability |

0:06:03 | so it uses a bit more flexibility in how to use this |

0:06:07 | system |

0:06:11 | so |

0:06:13 | but some talk about and training |

0:06:16 | and my impression is that it's not completely or well defined in the literature |

0:06:23 | but it seems to enable |

0:06:26 | these two |

0:06:29 | aspects |

0:06:30 | first all parameters of the system |

0:06:34 | should be trained jointly and that could be anything from feature extraction to producing some |

0:06:38 | speaker inventing |

0:06:40 | to the back in the comparison of speaker and endings and increasing the score |

0:06:46 | a second aspect is that |

0:06:48 | and then system should be trained specifically for the and |

0:06:51 | intended task in which in our case would be verification |

0:06:58 | one could go even more stricter say that it should match to extract evaluation metrics |

0:07:02 | but we are interested in for example in right |

0:07:06 | so |

0:07:07 | in this tutorial i will try to |

0:07:11 | discuss |

0:07:13 | no |

0:07:17 | how |

0:07:19 | important |

0:07:20 | these criterias are or what is it can be |

0:07:24 | to impose this criteria or what doesn't mean if we don't do it |

0:07:31 | so |

0:07:33 | first |

0:07:33 | let's look at what would |

0:07:37 | typical and when speaker verification architecture |

0:07:41 | look like and |

0:07:43 | well i process first i know this was first attempted for speaker verification in two |

0:07:47 | thousand sixteen |

0:07:49 | in the paper mentioned here the mortal |

0:07:53 | and |

0:07:54 | it will be some so we start with some |

0:07:57 | enrollment utterance so as |

0:07:59 | here it's three and we have some test utterance |

0:08:02 | all of these goes through some embedding extracting neural networks |

0:08:06 | reducing in many different architectures there |

0:08:09 | we produced and bindings which are fixed size |

0:08:12 | so |

0:08:14 | utterance representations |

0:08:16 | one for each utterance of in three now enrollment and endings and one test reading |

0:08:22 | and then we will create one and rollers model by some kind of pulling for |

0:08:26 | example taking the meeting |

0:08:28 | of the and warm of them buildings |

0:08:31 | and then we have some similarity measure and in the and |

0:08:35 | a score comes out that says |

0:08:38 | the log-likelihood ratio for four |

0:08:41 | the hypothesis that these |

0:08:43 | test segments |

0:08:44 | it's from the same speaker as this enrollment segments |

0:08:49 | and |

0:08:50 | all of these models should all these parts of the speaker model should be |

0:08:55 | trained |

0:08:56 | jointly |

0:09:02 | to be a bit fair and maybe a for historical interest we should say that |

0:09:08 | this is |

0:09:10 | no a |

0:09:12 | new idea |

0:09:15 | we had it's already in nineteen ninety three maybe that's their list i'm aware of |

0:09:21 | at least |

0:09:22 | and the one paper at the time was about |

0:09:26 | handwritten signature recognition and another paper was about the fingerprint recognition |

0:09:33 | but they used exactly this idea |

0:09:39 | and |

0:09:40 | okay so we talk about and |

0:09:43 | training and modeling |

0:09:46 | so what would be the alternative |

0:09:49 | one thing would be |

0:09:51 | generative modeling so we train a generative model |

0:09:54 | that |

0:09:55 | means a model that can generate the data both the observations x and |

0:10:02 | labels line and it can you was |

0:10:08 | it can also give us |

0:10:10 | probability of or probability density for such a observations |

0:10:16 | me typically training with maximum likelihood and if the model is correctly specified for example |

0:10:22 | of the data really comes from a normal distribution and we have assumed that |

0:10:26 | in our model are then |

0:10:29 | with enough training data we will find the correct parameters but the |

0:10:33 | that is no |

0:10:35 | and it's may be worth pointing out that |

0:10:37 | and the lars from such a model is the best |

0:10:40 | we can have its |

0:10:43 | so to have access to the log-likelihood ratios from |

0:10:47 | from the model that really generated today that is |

0:10:51 | then we can make the model decision for classification verification is a long |

0:10:56 | no |

0:10:57 | other |

0:10:58 | classifier would have was more |

0:11:04 | problem with this is that when the |

0:11:07 | more than |

0:11:09 | assumptions are not correct then the parameters we find with maximum likelihood may not be |

0:11:14 | optimal for classification |

0:11:17 | and sometimes maximum likelihood training is also difficult |

0:11:25 | other approaches will be some type of discriminative training so and then training can be |

0:11:30 | seen as a where is a lot one type of discriminative training but other discriminative |

0:11:36 | approaches we can tries to train the neural network where the embedding extractor for speaker |

0:11:41 | identification which seems to be the most |

0:11:45 | popular approach right now |

0:11:48 | and then we will use output of some intermediate layer as somebody and train and |

0:11:54 | i'm not either |

0:11:55 | back end on top of that |

0:11:58 | then there is this a course of the metric learning which |

0:12:05 | there |

0:12:07 | mean kind of train the embedding extractor together with a distance matrix with sometimes can |

0:12:12 | be simple |

0:12:14 | so in principle the inventing and kind of distance metric or back end |

0:12:19 | trained jointly |

0:12:21 | but typically not for the speaker verification task |

0:12:24 | so this is kind of and then training according to the first criteria but not |

0:12:28 | according to the second |

0:12:32 | so |

0:12:34 | now |

0:12:36 | when we know that we will |

0:12:38 | is costs |

0:12:40 | why the end-to-end training would be preferable |

0:12:44 | so |

0:12:45 | we had two things one is that we should train models jointly and the other |

0:12:48 | thing is that which are trained for the |

0:12:50 | intended task |

0:12:52 | so |

0:12:54 | mm |

0:12:56 | in the case of joint training is actually quite obvious selects the consider |

0:13:01 | system consisting of two modules a and b and we have fit that a which |

0:13:05 | is the parameters of model a and b which is the |

0:13:08 | only there's of what would be if we just first training module a and then |

0:13:14 | module b |

0:13:15 | it is essentially like doing |

0:13:18 | one iteration of |

0:13:20 | coordinate descent or block coordinate descent |

0:13:22 | so we train model |

0:13:24 | a |

0:13:25 | and we get here we train one ubm we get here |

0:13:29 | but we will not get for them that's not to the optimum which would be |

0:13:34 | so of course we could trade continue |

0:13:38 | two |

0:13:39 | do a few more iterations |

0:13:40 | and we might end up in the |

0:13:43 | optimal and this is actually kind of in principle equivalent to a joint optimization |

0:13:51 | when we have right kind of a non-convex model as one we may not actually |

0:13:55 | get the same |

0:13:57 | right optimum but as if we did |

0:14:00 | all the parameters in one go what would happen also depending on which optimize the |

0:14:05 | we used so |

0:14:06 | in principle |

0:14:08 | this is |

0:14:12 | why or so joint training would be like |

0:14:16 | really make sure that you find the optimal |

0:14:19 | also both |

0:14:20 | models and that's clearly better than just training one |

0:14:25 | first one and then the other ones |

0:14:28 | so i think there is no really argument here |

0:14:31 | that the these part of and then training is justified |

0:14:36 | the joint training of for more details |

0:14:42 | the task specific training the idea that we should training for |

0:14:48 | the |

0:14:51 | the intended task so if we do |

0:14:55 | you our application we want to do speaker verification why we should training for verification |

0:15:00 | and not for identification for example |

0:15:04 | well |

0:15:05 | first mission say that |

0:15:10 | we have some guarantee that this idea of minimizing loss on training data |

0:15:14 | we need was good performance on test a the empirical risk minimisation idea |

0:15:20 | and the only guarantee we have there is |

0:15:26 | this in this case the only holds if we are training for four we for |

0:15:30 | the metric that we are interested in with the task of very interested in |

0:15:35 | if we |

0:15:36 | trained for one task and or |

0:15:39 | you can evaluate |

0:15:40 | on another source we don't really have any guarantee that |

0:15:45 | we find the optimal model parameters for this task |

0:15:49 | but one can of course ask shouldn't is really work anyway training for |

0:15:54 | identification |

0:15:55 | and use the model for verification "'cause" it's kind of similar tasks |

0:16:00 | it does as we know |

0:16:02 | so but let's just discuss a little bit what could |

0:16:05 | go wrong |

0:16:08 | or why it wouldn't be optimal |

0:16:16 | so here is kind of toy example |

0:16:20 | we are looking at one dimensional inventing so we imagine that these have been |

0:16:25 | where rather the distribution of one dimensional and endings |

0:16:31 | so the embedding space is here and each of these colour represent the |

0:16:38 | distribution of impending is for some speakers of you is one speaker or will is |

0:16:42 | another speaker and so one |

0:16:46 | of course this is a little bit that we are |

0:16:49 | shape of the distributions i showed it alright okay kind of for simplicity |

0:16:54 | so in this kind of for example we assume that the mean of the |

0:16:59 | speakers are used a new that when you call distance like this |

0:17:06 | so |

0:17:09 | what would be the identification error in this case |

0:17:13 | so whenever we observe an amending we will assign it to the closest speaker |

0:17:19 | so |

0:17:20 | if we |

0:17:22 | observed on a bending in this region we will assign it so that no speaker |

0:17:26 | if we also observe it here |

0:17:28 | we will assign its to this end |

0:17:31 | this and |

0:17:32 | you green |

0:17:34 | speaker |

0:17:36 | and of course it means that sometimes it will be the blue speaker |

0:17:43 | when something sampled from the blue speaker will be here but we will assign its |

0:17:47 | the v is |

0:17:48 | green |

0:17:48 | style speaker area |

0:17:50 | so we will have some error in this situation |

0:17:54 | and |

0:17:55 | if we consider only the neighboring speakers the error rate will be |

0:17:59 | a twelve point two percent in this example |

0:18:10 | what would be the verification error rate |

0:18:14 | so |

0:18:15 | if we consider |

0:18:16 | for this type of data |

0:18:18 | so |

0:18:19 | we will assume that we |

0:18:21 | have speakers |

0:18:23 | which are you can be installed is to muted |

0:18:26 | like well |

0:18:27 | these stars |

0:18:29 | and |

0:18:30 | now the target trial we will sample |

0:18:35 | and bending from one speaker |

0:18:37 | and see if they are closer to each other than some threshold |

0:18:41 | based happen to the optimal special for this iteration |

0:18:46 | and if the |

0:18:48 | they are after that first we that i think so that you |

0:18:54 | thank you |

0:18:56 | okay |

0:18:56 | but i |

0:18:58 | if available |

0:19:02 | the case |

0:19:03 | the |

0:19:05 | and for nontarget trials |

0:19:08 | so |

0:19:11 | here in this image we could see |

0:19:14 | it would have an error rate of fourteen percent |

0:19:17 | again i'm only actually considering that the non-target trials are from neighboring speakers |

0:19:26 | that's why they're rate is high |

0:19:33 | so |

0:19:34 | no |

0:19:35 | i'm only changing this is to use a little bit |

0:19:39 | the within speaker is to me you show so |

0:19:43 | as before |

0:19:45 | the speaker means are on the same distance |

0:19:48 | like this |

0:19:50 | and |

0:19:50 | we have made them little bit more narrow here the within speaker distribution a little |

0:19:55 | bit more broad here |

0:19:57 | the overall variance the within speaker variance this the same obtain a little bit different |

0:20:01 | shape |

0:20:02 | and we will see that identification error has increased to thirteen point seven percent |

0:20:09 | whereas the verification error is that there |

0:20:15 | well |

0:20:16 | more extreme situation we have made them |

0:20:19 | the distributions equally sake or broad |

0:20:23 | do those two mixtures |

0:20:26 | now id and the means speaker means are all the same distance |

0:20:31 | like this |

0:20:32 | but the within speaker variance is |

0:20:35 | well in the within speaker variance is also the same as before |

0:20:40 | and here it would actually get |

0:20:42 | zero |

0:20:44 | identification error |

0:20:46 | but you will have worse |

0:20:48 | verification error or in any of the other example and it's because |

0:20:53 | if you sample a target trial you we very often have |

0:20:57 | and endings that are far from each other and similarly |

0:21:01 | for a non-target trials will very often have weddings that are close to each other |

0:21:07 | so this |

0:21:08 | example |

0:21:10 | should illustrate that |

0:21:14 | the within speaker distribution that is optimal for identification is not the same is not |

0:21:20 | necessarily the distribution that is optimal for verification |

0:21:27 | okay so |

0:21:29 | as another example |

0:21:31 | let us consider triplet loss which is another popular |

0:21:38 | most |

0:21:38 | function |

0:21:40 | could i |

0:21:42 | so it looks like this that |

0:21:44 | each training example you have |

0:21:48 | and bending for some speaker which we call the anchor invading |

0:21:52 | and then you have an embedding from the same speaker in which all the positive |

0:21:55 | example and animating from another speaker we should call the |

0:21:59 | negative example |

0:22:00 | and basically we want the distance between the anchor and the positive example can be |

0:22:06 | small |

0:22:07 | and the anchor between the at the distance between the anchor and the negative example |

0:22:11 | to be big |

0:22:14 | so |

0:22:15 | if this distance is bigger than |

0:22:18 | this class and |

0:22:20 | then these loss is gonna be zero |

0:22:26 | however |

0:22:27 | this is not |

0:22:29 | ideal the an ideal criteria for speaker verification and two show this i have a |

0:22:34 | rather complicated feed your here the illustrates |

0:22:40 | three speakers |

0:22:41 | and the embedding some three speakers in a |

0:22:45 | two dimensional space |

0:22:47 | so we have |

0:22:48 | the speaker may |

0:22:50 | with and buildings |

0:22:52 | distributed in this area |

0:22:55 | speaker be with the meetings in this area and speaker c with them endings in |

0:22:59 | this area |

0:23:01 | and |

0:23:03 | eve |

0:23:04 | we are using some and go from speaker to a the worst case would be |

0:23:08 | to use it here on the border |

0:23:10 | and then the biggest this test for a positive example would be to have it |

0:23:15 | here on the other side |

0:23:17 | and the biggest the smallest this there's to a negative example would be to take |

0:23:21 | something here |

0:23:23 | so simply we want this |

0:23:27 | and distance with the positive example |

0:23:30 | here class some margin to be smaller than the distance from the |

0:23:35 | negative example of anchor |

0:23:37 | so it's okay |

0:23:38 | in this situation |

0:23:41 | consider then speaker seen which hasn't b |

0:23:46 | wind |

0:23:46 | is the fusion of data now if we have i'm gonna here |

0:23:51 | we need |

0:23:52 | the |

0:23:53 | distance to the next speaker the closest speaker to be |

0:23:58 | be here then the internal distance |

0:24:00 | class some margie |

0:24:02 | so |

0:24:03 | and that's the case in this figure so that replied loss is completely fine with |

0:24:08 | this situation |

0:24:10 | but if we want to use |

0:24:12 | we do |

0:24:13 | verification on data that is distributed in this way then we should |

0:24:19 | accept |

0:24:21 | at all well if we want to have good |

0:24:24 | performance of target trials from speakers t |

0:24:27 | we need to accept |

0:24:30 | trials as target trials whenever we have a smaller distance then this otherwise we will |

0:24:34 | have some error or for target trials of speakers e |

0:24:38 | but this means that if we have a threshold like this year we will have |

0:24:42 | would be in confusion between |

0:24:45 | speaker a and b |

0:24:47 | so |

0:24:48 | this |

0:24:49 | again of course they could be ways to compensate for this environment or another but |

0:24:53 | it's just to show that like to sign |

0:24:55 | these |

0:24:57 | metric is not |

0:24:58 | gonna lead to optimal |

0:25:00 | performance for |

0:25:03 | verification |

0:25:06 | so if we try to summarise a little bit about the idea of task specific |

0:25:10 | training |

0:25:12 | minimizing identification error wouldn't necessarily the minimal verification error or |

0:25:18 | but of course i was showing these on kind of toy examples and the reality |

0:25:22 | is much more complicated |

0:25:24 | we |

0:25:25 | usually don't optimize classification error but they're all the cross entropy |

0:25:29 | or something like that |

0:25:31 | and we may use some loss to encourage more jean |

0:25:36 | between the speaker and endings |

0:25:39 | and maybe these assumptions that the made about the |

0:25:42 | distributions here are |

0:25:44 | well to compute more realistic at all |

0:25:48 | and |

0:25:50 | mm |

0:25:53 | so the maybe not completely clear |

0:25:56 | what would happen we knew test speakers that were not in the training set as |

0:26:00 | one |

0:26:01 | so i one and then to say is that this should not be interpreted as |

0:26:05 | some kind of proof that other object is would fan maybe they would even be |

0:26:09 | really good |

0:26:11 | but |

0:26:12 | yes to use training data be that it's not really |

0:26:17 | completely just defined to use them |

0:26:20 | and this is of course something that ideally should be studied much more |

0:26:24 | in future |

0:26:27 | but |

0:26:31 | and so we discuss that the and then training has some and good motivation |

0:26:39 | but still it's not really the most popular strategy for building speaker recognition systems today |

0:26:46 | at least in my impression it is my impression is that the multiclass training is |

0:26:50 | still the most popular |

0:26:52 | and |

0:26:54 | so |

0:26:55 | why is that well there are many difficulties with the and when training |

0:26:59 | it seems |

0:27:01 | no e |

0:27:02 | he's more prone to overfitting |

0:27:05 | we have additions we statistical dependence of training |

0:27:08 | trials which are we go more into detail in |

0:27:12 | i of the dislike |

0:27:15 | and |

0:27:16 | they're also maybe questionable how to do how should be trained based in the system |

0:27:21 | when we want to |

0:27:23 | and many enrollment utterances also to be mentioned of it |

0:27:28 | but one |

0:27:30 | so |

0:27:35 | the issue |

0:27:36 | one of the issues with using a cane of verification objective let's call it that |

0:27:41 | when we are comparing draw |

0:27:43 | two utterances and wondered say whether it's the same speaker or not |

0:27:48 | is that |

0:27:51 | the day that |

0:27:52 | we e |

0:27:54 | statistical independence i mean same y |

0:27:57 | well you know minutes about |

0:27:59 | so this is |

0:28:01 | generally these idea of training of minimizing some training also assumes that |

0:28:07 | the training data |

0:28:09 | are independent samples from whatever distribution comes from |

0:28:14 | and this is often the case i mean we have data that has been independently |

0:28:19 | selected |

0:28:21 | but |

0:28:21 | in speaker verification |

0:28:23 | the data |

0:28:25 | x |

0:28:26 | automation |

0:28:27 | is |

0:28:28 | a pair also happens then roll utterance and the testing utterance and the label is |

0:28:34 | indicating whether it's the target trial or a non-target trial |

0:28:38 | so for location i mean use |

0:28:41 | why equal one for target trial and one equal minus one for nontarget trials |

0:28:46 | the issue here is that |

0:28:49 | typically at least if we have limited amount of training data |

0:28:53 | we create |

0:28:54 | many trials |

0:28:56 | from the same speaker from the same utterance of each of the speaker and utterances |

0:29:01 | are used in many different right and then these |

0:29:05 | date time is not |

0:29:06 | these trials are not which is the training data |

0:29:10 | is not |

0:29:12 | statistically independent |

0:29:14 | which is something that the training procedure assumes they are |

0:29:19 | so |

0:29:22 | this can be a problem exactly how big the problem is |

0:29:25 | i think it's still something that needs to be investigated more but let's elaborately to |

0:29:30 | be what about what happens |

0:29:35 | so |

0:29:38 | here i brought adjust the training objective that we would use in the for a |

0:29:43 | kind of a verification loss when we train the systems and in verification |

0:29:48 | so it looks |

0:29:49 | complicated than being but it's not really anything special is yes the average training loss |

0:29:55 | of |

0:29:56 | target trials here and the average training loss of |

0:30:00 | non-target trials here and they are weighted with a fact or |

0:30:05 | probability of target trials and probability of non-target trials which are |

0:30:10 | some parameter that we use that to |

0:30:14 | dear the system to fit |

0:30:15 | better for the application that we are interested in |

0:30:19 | and again |

0:30:22 | what we hope is that this would minimize the expected loss |

0:30:26 | of |

0:30:28 | target trials and non-target trials |

0:30:32 | weighted we these |

0:30:33 | min |

0:30:34 | probability of target trials and non-target trials |

0:30:38 | on some unseen data |

0:30:40 | this loss function here is often the cross entropy but could be other things |

0:30:49 | so what are the desirable properties of training objective |

0:30:56 | so |

0:30:57 | here |

0:30:59 | we have |

0:31:00 | are hat which is the |

0:31:03 | and directional for training the loss |

0:31:06 | and |

0:31:07 | since the training data |

0:31:09 | use |

0:31:09 | can be assumed to be generated from some probability distribution this or have is also |

0:31:14 | a random variable |

0:31:18 | and we won't these |

0:31:20 | to be close |

0:31:21 | to the |

0:31:23 | expect that |

0:31:24 | loss |

0:31:29 | where the expectation is calculated according for the true probability distribution of the data |

0:31:35 | and for every value of |

0:31:37 | fit that because |

0:31:39 | in that case |

0:31:43 | and |

0:31:46 | if |

0:31:47 | the expected loss is this black line here |

0:31:54 | then |

0:31:56 | e |

0:31:57 | well let's say we are |

0:31:59 | we have some training set the blue one |

0:32:02 | and we check the average loss as a function of data |

0:32:06 | it may look like this |

0:32:09 | another training set it may look like this the red line and the third one |

0:32:13 | would be |

0:32:14 | the power of one so the point is that it's a little bit random and |

0:32:17 | it's not gonna be exactly like the expected loss |

0:32:22 | but ideally it should be close to this one because if we find a filter |

0:32:26 | that minimize the training loss for example here for the in the case of the |

0:32:29 | red training set |

0:32:31 | then |

0:32:32 | e we know that okay it will be also a good value for the |

0:32:38 | expected loss which means that the loss on things test data |

0:32:43 | so we want |

0:32:46 | the |

0:32:47 | training loss |

0:32:48 | for some as a function of the parameter in grammar the model parameters |

0:32:53 | can be close to the expected loss for one values of the |

0:32:57 | parameters |

0:33:02 | so |

0:33:03 | in order to study the effect of |

0:33:08 | statistical dependences in the training data in this context |

0:33:11 | we |

0:33:12 | right the |

0:33:14 | training objective slightly more general than before |

0:33:19 | so |

0:33:20 | use the same as before but yes that's for each trial |

0:33:23 | we have a way to be done |

0:33:25 | and if we set the to when one over and then it would be the |

0:33:30 | same as before but now we consider that we can choose some other value of |

0:33:35 | these |

0:33:37 | try and weights |

0:33:38 | in the training data |

0:33:39 | training trials |

0:33:44 | we won't |

0:33:45 | the training objective so the average training loss to have an expected value which is |

0:33:52 | same as the expected value |

0:33:56 | of the loss of test data so it should be an unbiased estimator of the |

0:34:03 | the test loss or the expected loss |

0:34:07 | and we also want these want to be good in the sense that it has |

0:34:10 | a small variance |

0:34:18 | well the expected value of the training loss is just calculated like this so we |

0:34:23 | end up with the expected value of a loss |

0:34:26 | and this is exactly are |

0:34:28 | what we what we usually denoted or |

0:34:30 | so in order for these to be |

0:34:32 | unbiased we simply want the sum of the weights to be one |

0:34:39 | and of course this would be the case when we use the standard choice of |

0:34:45 | meta which is one over and the number of |

0:34:48 | trials |

0:34:49 | in the training data |

0:34:53 | the variance |

0:34:55 | of this empirical loss |

0:34:58 | is gonna look like this |

0:34:59 | it's the |

0:35:00 | weight vector or for all the trials |

0:35:03 | and so on the matrix |

0:35:06 | times the weight vector |

0:35:09 | and this matrix is the covariance matrix for the loss of all trials with the |

0:35:14 | with this little t so that easy the one for the target trials or |

0:35:18 | minus one for the non-target trials |

0:35:21 | and one could derive that |

0:35:23 | the optimal |

0:35:24 | choice of |

0:35:26 | he does that would minimize this variance |

0:35:29 | is |

0:35:29 | and i look like this |

0:35:36 | so this is what we can call them you training objective |

0:35:40 | a best linear |

0:35:42 | unbiased estimate |

0:35:44 | that's the meaning of you so this is the best linear unbiased estimate of |

0:35:48 | the |

0:35:50 | test loss |

0:35:51 | using the training data to estimate what |

0:35:53 | well the test loss would be |

0:35:59 | some |

0:36:00 | details about this is that we don't really need covariance between the most of the |

0:36:05 | raw the correlation |

0:36:07 | because |

0:36:08 | we assume the diagonal elements in section matrix is |

0:36:12 | equal |

0:36:14 | then it turns out like this |

0:36:18 | and in practice we would assume that |

0:36:21 | search |

0:36:22 | and lennon's in this covariance matrix does not depend on cedar which |

0:36:26 | could be questioned |

0:36:31 | so |

0:36:32 | the objective that we discussed is not really specific the speaker verification in this is |

0:36:37 | that whenever you have a |

0:36:39 | dependence is in the training data can you could |

0:36:42 | use this idea |

0:36:43 | but for |

0:36:45 | the structure of this the covariance matrix |

0:36:49 | between the training which describes the covariances of the loss of the training data |

0:36:54 | that depends on the problem the specific problem that you're studying |

0:36:58 | so now we will look into how to |

0:37:01 | creating search a matrix for speaker verification |

0:37:06 | so here |

0:37:07 | we will use |

0:37:08 | x |

0:37:09 | i two denotes the |

0:37:12 | i utterances of speaker x |

0:37:16 | so we will assume that |

0:37:19 | correlation coefficients |

0:37:21 | hands on what trials i mean comments so for example |

0:37:24 | the here we have |

0:37:26 | trial of speaker a utterance one speaker to a utterance to and some loss of |

0:37:31 | that and the all several also speaker eight utterance long speaker eight |

0:37:36 | utterance three and some loss of that |

0:37:38 | and they have some correlation |

0:37:40 | it because |

0:37:42 | they involve the same speaker |

0:37:45 | so we assume there is a correlation |

0:37:48 | coefficient denoted c |

0:37:50 | at least eight here |

0:37:52 | so in total we have these kind of situation in verification if we consider target |

0:37:57 | trials |

0:38:00 | there you could have the situation that's |

0:38:02 | well okay let's look here |

0:38:05 | the |

0:38:05 | to target trials which have one utterance in common this is speak a target trial |

0:38:10 | of speaker eight |

0:38:11 | and here we have buttons one of those two and here you have buttons one |

0:38:15 | utterance trees is also has a long using both |

0:38:17 | trials there is some correlation between these trite |

0:38:21 | here |

0:38:22 | there is no common utterance but the speaker still the same and this is as |

0:38:26 | opposed to this situation where |

0:38:28 | you have |

0:38:30 | i |

0:38:30 | trial of speaker a and the trial of speaker a they have nothing in common |

0:38:34 | so we assume here the correlation is zero |

0:38:37 | for such trials |

0:38:39 | for the non-target trials you have more complicated situation but all possible situations are listed |

0:38:46 | here |

0:38:47 | for example |

0:38:48 | you may have that |

0:38:50 | okay |

0:38:50 | the speaker is you have one |

0:38:54 | utterance in common |

0:38:58 | so we have this utterance in common and in addition to that |

0:39:02 | these speaker is in common that's what they mean with this notation here |

0:39:08 | and so one |

0:39:14 | and if we have such weights one can derive |

0:39:18 | yes |

0:39:18 | the all the words such correlation push coefficients we can drive the optimal weights for |

0:39:24 | a speaker with this many utterances |

0:39:27 | is gonna look like this |

0:39:32 | the exact form is maybe not so important but just |

0:39:34 | we should note that one could |

0:39:37 | the right |

0:39:38 | how to |

0:39:39 | given the way to these speaker and it depends on how many utterances |

0:39:44 | the speaker s |

0:39:47 | for the non-target trials to formalize more complex |

0:39:51 | it would depend on me if the trial involves speaker names p can be it |

0:39:55 | depends on how many |

0:39:56 | utterances speech to speaker as |

0:40:02 | so |

0:40:03 | then comes they show how to estimate correlation coefficients one could look at some recorrelation |

0:40:09 | of some trained model |

0:40:12 | or we couldn't |

0:40:14 | learned them somehow |

0:40:16 | or which we will mention briefly later or we can just make some assumption and |

0:40:21 | into neat so for example one simple assumption is the set |

0:40:25 | this for score coefficient of target trials are five and this one which we assume |

0:40:30 | should be smaller so i'll four square |

0:40:32 | and then |

0:40:35 | to an affine this range and similarly for the non-target trials |

0:40:44 | just to get some idea of how we would change the weight for the target |

0:40:47 | trials |

0:40:48 | well |

0:40:49 | for target trials |

0:40:51 | we see here that this is the number of utterances for the speaker |

0:40:56 | on the y-axis here we have their corresponding weights |

0:41:01 | so |

0:41:02 | and for different values of these correlations so if the correlation is |

0:41:07 | a small |

0:41:09 | then |

0:41:11 | even when we have many utterances up to twenty here we will still give reasonable |

0:41:16 | way to each utterance |

0:41:19 | but if the correlation is a large |

0:41:22 | then we will not give so much weight to |

0:41:25 | but each utterance when a speaker as many utterances |

0:41:29 | which means that the total |

0:41:31 | and |

0:41:32 | wait for this speaker is not gonna increased a much even if it has a |

0:41:35 | lot of |

0:41:36 | utterances |

0:41:45 | and |

0:41:48 | in the past i was exploring little bits how |

0:41:52 | these kind of correlations really are |

0:41:55 | this was on the i-vector system with clearly a and the scores |

0:42:01 | here in the first |

0:42:05 | i in this |

0:42:07 | column here |

0:42:08 | it's a |

0:42:09 | okay lda model trained with em algorithm and then the score samples and instigated system |

0:42:14 | i find calibration |

0:42:18 | and the other column here is for discriminatively trained p lda |

0:42:22 | so the main thing top so here is that we |

0:42:25 | to have |

0:42:26 | correlations between trials that's how for example an utterance in common answer one |

0:42:32 | in correlations can be quite large in some situations |

0:42:38 | so these |

0:42:40 | problems seem to exist |

0:42:44 | and doing this kind of correlation composition main goals this is like again on the |

0:42:49 | kind of discriminative |

0:42:50 | clearly a |

0:42:55 | and |

0:42:57 | e does have a bit |

0:43:05 | so it's something |

0:43:08 | two |

0:43:11 | possibly take into account |

0:43:13 | the course of ssl it's four db lda but the where we train a p |

0:43:17 | lda model |

0:43:18 | using all the trials in the training set |

0:43:21 | that can be construct and then training set but of course the same |

0:43:25 | problem with the dependence exist all seen and system |

0:43:37 | so |

0:43:40 | no some problems that the we could encounter if we tried to do this |

0:43:45 | well mister the |

0:43:47 | results or the |

0:43:50 | compensation formless that we derive |

0:43:52 | was assuming that |

0:43:54 | all trials |

0:43:55 | stuff can be created from the training set or used equally often which is the |

0:43:58 | case if you train a backend likely p lda |

0:44:02 | discriminatively and you use all the trials |

0:44:05 | a we |

0:44:07 | in |

0:44:08 | well we train a kind of and system with involving neural networks |

0:44:14 | we use media bashers so one could achieve this situation by |

0:44:20 | making a |

0:44:21 | list of trials |

0:44:24 | and |

0:44:25 | then we just sample trials from years okay here is a trial is this speaker |

0:44:29 | compared to this final trial is the speaker compared to this one as a long |

0:44:33 | and this is |

0:44:34 | long list of all trials that can be formed and then we just |

0:44:41 | select some of them into the mini batch |

0:44:44 | the point is of course that if we have these speakers like this |

0:44:47 | in the mini batch and we compare this one with this one |

0:44:50 | this one we this one and so long |

0:44:53 | we are not using all the trials that we have |

0:44:56 | we have for example not comparing this one with this one in the mini batch |

0:45:01 | recall and that's maybe a bit the waste because we are anyway using this deep |

0:45:06 | neural network to produce them paintings and so once we can just as well |

0:45:12 | produced and reading or will use all of them in the in the scoring part |

0:45:15 | as well |

0:45:17 | well then |

0:45:17 | we will have a little bit different |

0:45:20 | balanced |

0:45:22 | of the trials |

0:45:24 | globally compared to what we had before |

0:45:27 | so the former lastly that we derived wouldn't be exactly valid in this situation |

0:45:33 | so |

0:45:34 | the |

0:45:36 | question then it is if we do decide that all the segments |

0:45:40 | that |

0:45:42 | can that be extract them ratings for |

0:45:44 | that we have in the mini batch if we want to use all of them |

0:45:48 | was in the scoring what you how are we gonna select |

0:45:52 | the data for the mini batch |

0:45:54 | they can be different strategies here |

0:45:57 | we could consider for example |

0:45:59 | we |

0:46:00 | strategy a |

0:46:02 | we |

0:46:03 | select some speakers |

0:46:05 | and then for each speaker we take all the day the segments that they have |

0:46:08 | let's say that these rates speaker has |

0:46:11 | three segments and these yellow speaker has |

0:46:14 | for speaker for segments |

0:46:17 | and then all |

0:46:21 | we can consider only five so we can have |

0:46:26 | segment one of the red speaker scored against segment to segment one scored against segment |

0:46:30 | three as a long |

0:46:33 | we don't use the diagonal because we don't consider |

0:46:39 | try segment scored against themselves |

0:46:42 | and the course here is just the same as here |

0:46:46 | a scoring segment two |

0:46:48 | i guess segment one |

0:46:50 | so |

0:46:52 | this would be one way another way would be constructed you be |

0:46:57 | two |

0:47:00 | select speakers but then just select to utterance for each speaker in the mini batch |

0:47:08 | so |

0:47:10 | you will have just one target right for each speaker |

0:47:14 | it differs here is that |

0:47:16 | we have |

0:47:17 | we are gonna have |

0:47:19 | fewer target trials |

0:47:21 | overall in the mini batch but one of them will be from different speakers and |

0:47:24 | we will add target five from more speakers |

0:47:28 | typically |

0:47:29 | so |

0:47:30 | needs |

0:47:31 | not exactly clear what would be the right thing but some little bit informal experiments |

0:47:36 | we have done |

0:47:37 | so just of this strategy b is a better |

0:47:46 | then again the formulas that we'd right before how to weight strives on not completely |

0:47:51 | the they were not the right on the assumption that we are doing like this |

0:47:55 | so they are not |

0:47:56 | darlene |

0:47:58 | and |

0:48:00 | and they need to be modified to be it and i mean come to that |

0:48:03 | in a minute |

0:48:07 | the second problem that can occur in and when training is that |

0:48:12 | in respect of these issues is that |

0:48:17 | we do want |

0:48:19 | to use |

0:48:20 | what we do want to have a system that can deal with the session enrollment |

0:48:24 | and it |

0:48:26 | of course of the session trials can be incorporated |

0:48:30 | it work can be handled with dances and system as we discussed in the initial |

0:48:34 | slide |

0:48:36 | by having some pruning armour enrollment utterance |

0:48:40 | but how to create a training date time is again a little bit the |

0:48:46 | complicated |

0:48:47 | because |

0:48:48 | already in the case of single session tries we had a complicated situation how many |

0:48:54 | different kind of dependent system can occurrence along and in them with the session case |

0:48:59 | it's gonna be even more |

0:49:01 | complicated because you can have situations like |

0:49:04 | these |

0:49:06 | trial |

0:49:08 | for example these two could be the enrollment and this is the test and another |

0:49:12 | trial where |

0:49:13 | these two are the enrollment |

0:49:15 | and |

0:49:15 | this is the test then you have one optimizing common here |

0:49:19 | we're gonna have a more extreme situation where both enrollment utterances |

0:49:24 | in to try to solve the same but the test utterance is different |

0:49:27 | so the number of possible a dependence is that can occur is way more complex |

0:49:32 | and i think it's |

0:49:33 | very difficult to derive some kind of formal or how the trials should be weighted |

0:49:41 | so to deal both with the mini batch the fact that we're using mini batch |

0:49:46 | as and to move the session trials and to estimate proper trial weights |

0:49:52 | for that maybe one strategy can be to learn them hand this is not something |

0:49:56 | i tried i just think it's |

0:49:57 | something that maybe should be tried |

0:49:59 | well |

0:50:01 | so we can define |

0:50:02 | i training loss |

0:50:04 | again as average of losses over the training data with some weights |

0:50:09 | and the we also neon use a development loss with some |

0:50:14 | which is an average over |

0:50:16 | another set of the average of most over the development set |

0:50:22 | and these weights here should depend only on number of utterances of the speaker |

0:50:31 | or speakers involved in that right |

0:50:35 | then one can imagine some scheme like these |

0:50:38 | mm |

0:50:39 | we send both training and development data through the and then we get the neural |

0:50:44 | network and we get some |

0:50:47 | training loss and some |

0:50:49 | and development lost |

0:50:53 | as usual be estimate the |

0:50:56 | the grand here we take the gradient with respect to the model parameter off |

0:51:03 | for the training lost |

0:51:05 | and it |

0:51:06 | this |

0:51:06 | right in is not a function of the weights the trial weights |

0:51:11 | and we can update |

0:51:13 | the model parameters still keeping in mind that these are then value is a function |

0:51:18 | of the |

0:51:21 | the trial weights |

0:51:23 | the training try and weights |

0:51:25 | and then |

0:51:27 | we can |

0:51:28 | on the development sets |

0:51:30 | calculate |

0:51:31 | the gradient |

0:51:33 | with respect to these training weights |

0:51:36 | and then |

0:51:37 | use this to update |

0:51:40 | the training try and weights |

0:51:46 | a second |

0:51:47 | thing |

0:51:49 | to explore |

0:51:51 | or like a final note on these |

0:51:56 | and |

0:51:57 | depend statistical dependence issue is that |

0:52:00 | we just |

0:52:02 | discussed some ideas for balancing the training data the training trials for better optimization |

0:52:08 | but for example in the case when all speakers have the same |

0:52:12 | number of utterances |

0:52:14 | this rebalancing has no effect |

0:52:17 | still of course there are dependence is there is a one would think shouldn't we |

0:52:20 | do something more than just we balance the training data |

0:52:24 | and one possibility that i think would we will worth |

0:52:28 | try |

0:52:29 | is to |

0:52:32 | we |

0:52:34 | we assume the following |

0:52:35 | that |

0:52:37 | the covariance of |

0:52:39 | to what's a scores of the |

0:52:42 | of a trial of speaker at |

0:52:45 | which has |

0:52:45 | one utterance |

0:52:47 | in common should be bigger than |

0:52:49 | the covariance between two trials |

0:52:51 | of these |

0:52:52 | speaker which has |

0:52:54 | no often as in common |

0:52:56 | which should be bigger than the covariance between |

0:53:01 | two |

0:53:02 | target trials of different speaker this should be zero actually |

0:53:06 | so one could consider two regularized the model to be in that way |

0:53:14 | so now |

0:53:17 | after discussing the issues with |

0:53:21 | and hence training |

0:53:23 | then i will briefly mention some of the |

0:53:27 | eight pairs |

0:53:29 | or some papers |

0:53:32 | on and trend |

0:53:33 | training and i this should not be considered as i kind of literature review or |

0:53:38 | describing the best architectures or anything like that |

0:53:42 | it is |

0:53:43 | more |

0:53:45 | just a few selected paper that illustrate some point source on them |

0:53:53 | some of which and some good take away messages about and find training |

0:53:59 | so this paper called and point text dependent speaker verification as follows i know was |

0:54:04 | the first the paper on and ten training in speaker verification |

0:54:09 | and it also networks like this or some architecture like this feature goes in the |

0:54:14 | throes on |

0:54:15 | and neural network and in the end we are doing |

0:54:21 | we this network is gonna say |

0:54:24 | is it the same |

0:54:26 | speaker or not |

0:54:28 | the important thing here is that |

0:54:30 | the |

0:54:32 | input is fixed |

0:54:36 | so the inputs to the neural network as the feature dimension times the number of |

0:54:41 | features |

0:54:45 | the duration that is |

0:54:48 | and there was no temporal pooling which is |

0:54:52 | the done in many other situations |

0:54:55 | and this is suitable |

0:54:56 | when |

0:54:58 | when you do text dependent speaker verification as they did in this paper |

0:55:02 | so because this means that |

0:55:05 | the network is kind of aware of the word and phoneme order |

0:55:10 | and |

0:55:11 | i would say that the main conclusion from this paper is that |

0:55:15 | the verification loss was better than the identification lost |

0:55:19 | especially when you have been the amounts of training data for small amount of training |

0:55:24 | data guys |

0:55:25 | not as big difference |

0:55:28 | and the one can also say that t-norm could |

0:55:32 | too large extent to make these two things |

0:55:35 | this colossus more the models trained with these two moses more similar |

0:55:42 | but i still won't say that this kind of suggested verification loss is beneficial |

0:55:48 | if you have large amounts of training data |

0:55:55 | so this is another paper |

0:55:59 | there wasn't doing in |

0:56:01 | text-independent speaker verification and here |

0:56:05 | different from the other is that they do have a temporal pooling layer |

0:56:11 | so |

0:56:12 | that would kind of remove the dependence on wonder of the input |

0:56:17 | the to some extent at least and is maybe a more suitable architecture for text |

0:56:22 | independent speaker verification |

0:56:25 | and this was compared to i-vector p lda baseline down here to it was found |

0:56:30 | that really large amount of training data is needed even to be something like an |

0:56:34 | i-vector |

0:56:36 | the lda system |

0:56:44 | and this is |

0:56:46 | some study that we did and |

0:56:51 | it was |

0:56:52 | use also again text independent speaker recognition or verification |

0:56:58 | but trained on smaller amount of data and to make it work we instead constrained |

0:57:04 | these neural network here this big and time system to behave |

0:57:08 | something like a |

0:57:10 | another i-vector and p lda baseline so we cannot constrain did not to be two |

0:57:16 | different from the |

0:57:18 | i-vector purely a baseline |

0:57:21 | and |

0:57:23 | we found there that training model blocks jointly with their verification also was improving |

0:57:33 | so as can be seen here |

0:57:36 | you |

0:57:36 | little bit regrettably we data as a separate |

0:57:40 | clearly whether that improvement came from the fact that we were doing joint training |

0:57:45 | or the fact that we were |

0:57:50 | using the verification loss |

0:57:55 | another interesting thing here is that |

0:57:59 | we found that |

0:58:00 | training we verification most requires very large batches |

0:58:05 | and this was an experiment done only on the |

0:58:09 | scoring art and of course lda discriminatively lda |

0:58:12 | so if we train is gonna be p lda with |

0:58:16 | a and b if yes using full batches |

0:58:19 | that |

0:58:21 | so not i mean you match |

0:58:24 | training scheme |

0:58:26 | you achieve some |

0:58:27 | loss |

0:58:28 | like this on the development set |

0:58:31 | and this dash |

0:58:33 | blue line |

0:58:34 | whereas if we trained with adam with mini batch just for different slices front end |

0:58:39 | up to five thousand |

0:58:41 | we see that we need really be batches to actually |

0:58:45 | get close to be of q s |

0:58:48 | trained model which was trained on full marshes |

0:58:50 | so that kind of little bit suggests that you really need to have many trials |

0:58:55 | within the mini batch for you know what of four |

0:58:59 | training these kind of |

0:59:02 | system with a verification lots which is a bit of a problem and maybe a |

0:59:06 | challenge to deal with |

0:59:07 | in future |

0:59:12 | this is some more recent paper and the interesting point of this paper was that |

0:59:17 | they didn't train the whole system |

0:59:20 | all the way from the waveform is that this from features as the other |

0:59:27 | first |

0:59:29 | but it was |

0:59:31 | i couldn't to |

0:59:33 | understand completely the improvement came from the from the fact that they were |

0:59:37 | training from the waveform or if it was because of |

0:59:41 | the choice of architecture and so one |

0:59:45 | but it's interesting that |

0:59:48 | systems and going |

0:59:49 | all the way from waveform to the and |

0:59:53 | can work well |

0:59:58 | and this is paper |

1:00:00 | for this year's |

1:00:02 | in their speech it's interesting because |

1:00:08 | it's one of the more recent studies that the really proposed or showed some good |

1:00:13 | performance of using verification loss |

1:00:17 | here it was a joint |

1:00:19 | you |

1:00:20 | but i can have more details training so they were training using both identification was |

1:00:24 | and verification lost |

1:00:28 | and that's actually something i have tried to another and any |

1:00:32 | benefit from we but one thing they did here was to |

1:00:36 | start with a large weight for that it is indication of austin gradually |

1:00:40 | increase the weight for the verification will also make this is the interesting and maybe |

1:00:47 | actually the right way to go |

1:00:49 | i'm curious about it |

1:00:54 | so |

1:00:55 | now comes just little bits summary of this talk |

1:00:59 | we discussed about the motivation for and two and |

1:01:04 | training |

1:01:05 | and |

1:01:06 | we said that it has some good motivation |

1:01:09 | and |

1:01:10 | we show that's on |

1:01:13 | we will refer to some |

1:01:16 | experimental results the of also another first |

1:01:19 | which shows that it seems to work quite well for text-dependent task with large amount |

1:01:24 | of training data |

1:01:27 | in such case it's probably prefer able to preserve the temporal structure to avoid |

1:01:33 | the temporal pooling |

1:01:35 | in text-independent benchmark one would need to strongly like a regular station or a mix |

1:01:42 | the training objective in order to benefit from |

1:01:45 | and when training and typically we would want to do some temporal pooling their |

1:01:54 | one couldn't guess that and twenty training would be preferable choice in scenarios where we |

1:02:00 | have many training speaker with few utterances we have less of the statistical dependence in |

1:02:05 | problem |

1:02:09 | something that to me seems to be or button questions is and which would be |

1:02:14 | great if someone it explore |

1:02:17 | is |

1:02:18 | okay |

1:02:19 | it is difficult actually to train and then system especially for the text independent |

1:02:24 | tell us |

1:02:25 | so this is because of overfitting so training convergence this dependency issue we discussed |

1:02:32 | it's not really clear i would say |

1:02:34 | and |

1:02:36 | practical question is how to adapt search systems because see this more blockwise systems we |

1:02:43 | would of the nine at the back end |

1:02:45 | well could be trained the system in a way that we don't need adaptation |

1:02:53 | and also how could we input some human knowledge about speech into these training and |

1:02:58 | we need it |

1:03:00 | something we know about the data distribution or number of phonemes or |

1:03:04 | whatever |

1:03:07 | and we discuss that maybe |

1:03:12 | training a model for speaker identification is not ideal for speaker verification but is there |

1:03:18 | some way to |

1:03:21 | to find and bindings that are good for all these tasks |

1:03:27 | another interesting quick question is |

1:03:32 | how well |

1:03:34 | the llr is that comes from |

1:03:36 | and to end |

1:03:38 | architectures |

1:03:39 | actually could simulate the true llr |

1:03:44 | so in other words what kind of |

1:03:47 | and |

1:03:49 | distributions could be |

1:03:51 | arbitrary accurately simulate or modeled by these architectures |

1:03:57 | so completely clear out there |

1:04:00 | okay so |

1:04:02 | thank you for your attention |

1:04:05 | by right |

1:04:10 | hello this is you'll huh and no i really present the hassan session for that |

1:04:19 | and that speaker verification concordia |

1:04:26 | e these informal do not work well i don't know book |

1:04:31 | well i'm not really run cold war used rate it let's see |

1:04:37 | i mean "'cause" |

1:04:40 | one and talk about ease |

1:04:43 | two things first |

1:04:44 | i will go through the call that are using their |

1:04:48 | most of my experiments |

1:04:51 | and |

1:04:53 | after that i mean how well if you can do tricks to solve the batteries |

1:04:58 | implementation issues |

1:05:02 | that i have used and |

1:05:06 | okay so |

1:05:10 | first |

1:05:11 | the call for and final system so this is a call that i started work |

1:05:17 | on during my forestalled a but from to those in sixteen the person time t |

1:05:23 | initially horse in the on all but the now consider a while |

1:05:30 | and idea sees the |

1:05:32 | time to switch to or a data tensor able to or like torture or something |

1:05:38 | else |

1:05:41 | the links of the repository is here |

1:05:44 | and most stuff in this repository is no and is more most states there are |

1:05:52 | four multiclass the weighting well mostly to use a little because training where maybe in |

1:05:59 | combination with other stuff |

1:06:01 | but the |

1:06:03 | don't know much on a |

1:06:05 | that's uses |

1:06:07 | you're and then training with the verification loss |

1:06:11 | the paper is that we're of only stores actually based on hold close to the |

1:06:15 | on the one i think it's not so much point two |

1:06:19 | maintain that are in more |

1:06:23 | but i do have a one screen here that you to the verification lost in |

1:06:29 | combination with the identification lost so that's description we will look at |

1:06:37 | and generally |

1:06:40 | or |

1:06:41 | well it's a this first i'm trying to point out things in this call that |

1:06:46 | i think yes certainly well known and worked well and are known and also mention |

1:06:51 | what they we show |

1:06:52 | really them differently |

1:06:54 | to maybe give so |

1:06:57 | well at least i can say from like stressful as good an allpass time |

1:07:02 | some |

1:07:03 | small toolkit for speaker verification |

1:07:09 | i know that i didn't see and then if we hear from and the verification |

1:07:15 | lost to that identification the most |

1:07:18 | and contrary to the paper and mentioned in the tutorial |

1:07:23 | it could be that these quite complicated scheme for changing the balance between the losses |

1:07:30 | throughout the training is really ladies this may be something i don't look at some |

1:07:37 | point |

1:07:41 | and this screen i think units |

1:07:45 | you want to try to instances where only you know little normal way |

1:07:49 | the in the local but you don't want running in the not here unique feel |

1:07:54 | a little bit with the intention because |

1:07:58 | right in |

1:07:59 | in |

1:08:00 | cantonese in such a way that it's |

1:08:02 | here three but |

1:08:07 | some small adjustment might be needed if you actually want to run it here |

1:08:16 | so |

1:08:17 | nh |

1:08:18 | i tried in these in when organising my experiments to high in the way that |

1:08:25 | there is one screen where everything that is specifically the experiment is set so that |

1:08:32 | includes which data to use and the configuration of the more balanced along |

1:08:38 | i was really i |

1:08:42 | an efficient lighting to have |

1:08:44 | input arguments to this researchers we should be to use as long because anyway you |

1:08:50 | were wireless always have to change something in this creation |

1:08:56 | for a new experiments are then you can just as long routine often a and |

1:09:02 | so on |

1:09:03 | a wrestler |

1:09:06 | but other things that a little bit more face from extend this experiment this is |

1:09:12 | just the loaded from this good |

1:09:15 | such as model on different architectures as long |

1:09:25 | so usually i use these underscore for denoted sensible variables underscore v for placeholders |

1:09:34 | so long |

1:09:36 | the kind of |

1:09:40 | models are |

1:09:43 | similar to here as models are then maybe a little bit less |

1:09:49 | fancy if you're |

1:09:54 | features |

1:09:58 | i didn't use here us here initially because when i started with this years ago |

1:10:03 | cares more flexible enough there were i quite agree pure only those of recruited two |

1:10:11 | neatly with this but i know it is definitely flexible enough |

1:10:25 | so |

1:10:31 | for example here is this is five where features are things that |

1:10:37 | things maybe some one would think is that are those on a you all remember |

1:10:43 | about a |

1:10:45 | seems anyway necessary to change things in this problem for every experiment i prefer you |

1:10:51 | their thing here |

1:10:53 | so you're somebody stole training data |

1:10:56 | how long as the shortest and a longer segments are trained on |

1:11:03 | some other patterns related to training batch size |

1:11:08 | maximum number of the box |

1:11:10 | and |

1:11:12 | number of bashes in an input so i don't really define |

1:11:18 | yep or as warm day a by defining that's the second number of patches that |

1:11:23 | the wine in it in a minute |

1:11:29 | also patience probably most of your familiar with it is worth mentioning |

1:11:34 | you train or |

1:11:35 | what it is score |

1:11:37 | so the next part of the screen is the bar for defining how to load |

1:11:46 | and prepare data |

1:11:48 | and here is long important points is the |

1:11:53 | so you the bashers we will |

1:11:56 | well gee chunks of feature from different utterances so randomly selected segments |

1:12:05 | if you know say that from a normal hardest and randomly select different segments from |

1:12:13 | different utterances |

1:12:15 | this will be nice too small i was to sell |

1:12:21 | so often |

1:12:23 | you can would meeting it is time varying or case at a time or can |

1:12:28 | compare a |

1:12:29 | many lashes |

1:12:32 | well |

1:12:33 | so that's one way he in all my service so i is the to the |

1:12:40 | data on missus the and then can be loaded as you wish feature shows can |

1:12:47 | be loaded randomly fast enough for that |

1:12:50 | so this is |

1:12:53 | good because it allows for a lot much more flexibility in experiments for example sometimes |

1:12:59 | you may want to load to segments from the same as is that what one |

1:13:04 | proportional |

1:13:06 | to go for some for some experiments |

1:13:10 | or sometimes you just want to change the duration of the segments |

1:13:16 | i |

1:13:16 | you |

1:13:18 | use our case then you have to prepare and you are case for this |

1:13:22 | so i don't say that |

1:13:24 | using is the ease |

1:13:28 | and then just load features a single going is |

1:13:33 | very good thing and as the c is really good however to invest see if |

1:13:38 | you want to |

1:13:39 | it can of experiments |

1:13:44 | i define some functions for example low fee training process given some |

1:13:52 | given and some list of finals this one we load the data and that could |

1:13:59 | so if you want remotes parcels these batteries specifically as long again |

1:14:05 | if find that here but if you want to do for example of the thing |

1:14:08 | i mentioned too low to segments from the same utterances that one then you would |

1:14:13 | have to change the function here |

1:14:15 | so this was quite the |

1:14:18 | useful way of organising is that for me at least in my experiments |

1:14:26 | i also another important thing in this for easter creates on dictionary a religious train |

1:14:32 | is sixty four conversation other missionaries of for example a closest eager not be |

1:14:38 | and thus to fine off a thing and the law |

1:14:43 | and that's |

1:14:46 | created here |

1:14:51 | and he's |

1:14:55 | no means are used to create a media batches |

1:14:59 | and a little bit later down here i create a generator for media batches and |

1:15:04 | it takes the this stationary off |

1:15:09 | mappings across a speaker mapping as a long and i have different the generators depending |

1:15:15 | on what kind of media matches i won't for example you want |

1:15:19 | randomly selected speakers and older data are going to the actual remote randomly selected speakers |

1:15:24 | and for example two apples each or something like that |

1:15:30 | so that's its shape by changing on a gender |

1:15:39 | then the next step is to |

1:15:42 | so that the modal |

1:15:44 | and here i'm using here |

1:15:47 | t v in a artificial light expect or other comics or |

1:15:52 | and it i also a det lda model |

1:15:57 | a half to the school and endings from this |

1:16:03 | or text editors still called |

1:16:09 | we should |

1:16:11 | do kind of verification |

1:16:19 | i mentioned is minor differences from the holiday architecture is that i found it necessary |

1:16:24 | to have some kind of normalization layer alter the temporal coolly better or more just |

1:16:32 | at feast elicitation but estimated on the data that supports in the beginning works fine |

1:16:36 | as well |

1:16:38 | i guess line is needed here could be because we use a simpler optimize the |

1:16:44 | we use just stochastic gradient descent as compared to colour the use that are most |

1:16:48 | of the monster |

1:16:54 | so in this conan columns |

1:17:00 | definition of the are they show like number of layers their sizes |

1:17:07 | activation functions |

1:17:09 | and so |

1:17:11 | whether we should have a normalization of features normalization all the are truly |

1:17:18 | and whether they these |

1:17:22 | normalizations |

1:17:25 | mm |

1:17:27 | or you don't face of the data being initial last |

1:17:37 | auctions for regular stations the lower |

1:17:41 | we initialize the model here and we provide |

1:17:47 | when you do this at the rate or the generator for the |

1:17:52 | they the training data and this is used to initialize to model the normalisation layers |

1:17:58 | this is something that creates be a mess and i probably wouldn't song i |

1:18:04 | differently if i work right and you are |

1:18:09 | maybe some knowingly initialization and that's around a few |

1:18:16 | iterations |

1:18:18 | in the before starting the can you just |

1:18:21 | initialize the layers the normalisation layers |

1:18:30 | you if we apply a smaller to today the which is in this place holders |

1:18:35 | here |

1:18:36 | and |

1:18:39 | then |

1:18:41 | what comes out will be this and endings the classifications |

1:18:46 | and |

1:18:47 | so or |

1:18:49 | then ratings in this particular we will send them to |

1:18:55 | in the lda model |

1:18:58 | basically here |

1:18:59 | we make some settings for here |

1:19:03 | and |

1:19:05 | probabilistic lda model we can get the score |

1:19:09 | scores |

1:19:10 | and for all pairwise comparisons |

1:19:14 | it in the dash and also loss for that can provide |

1:19:18 | labels for it |

1:19:22 | so next car is to and are defined lost and train functions along |

1:19:31 | we have lost as a weighted keisha lost it has lost and a single because |

1:19:36 | the verification loss |

1:19:38 | well here's in binary and their average fits weights in the original one point five |

1:19:44 | and still one seventy five respectively |

1:19:47 | and maybe one important thing use here we these forces are normalized in there and |

1:19:54 | from be so that's |

1:19:56 | minus |

1:19:58 | log of their probability in the case of so long we're number of speakers |

1:20:05 | i mean for around a classification of random quotes |

1:20:09 | and the reason to do this is |

1:20:13 | if the model is yes initialized or just a round of relations that the loss |

1:20:18 | maybe one or approximately well |

1:20:21 | and we do the same thing for the verification loss |

1:20:27 | i |

1:20:28 | you this means that all these also source data you know similar way and it |

1:20:33 | becomes easier to choose to interpolate between them |

1:20:40 | and the end of these the screen we define a training function which takes |

1:20:45 | the data for actually in school and to one article |

1:20:50 | the more |

1:20:53 | next for please for a |

1:20:59 | defining functions for a set i think parameters locating parameters for the more |

1:21:04 | and |

1:21:06 | define |

1:21:07 | i function two |

1:21:10 | change of the easy to shake some kind of validation lots of the each block |

1:21:21 | so this starts just for setting parameters and getting parameters |

1:21:26 | and |

1:21:28 | maybe no so importance |

1:21:31 | it can find |

1:21:34 | function for changing the validation was here |

1:21:39 | finally the training is to combine these |

1:21:41 | function here which takes these |

1:21:45 | function and therefore |

1:21:47 | changing validation loss takes many other parameters |

1:21:51 | and things that the undefined |

1:21:55 | okay |

1:22:03 | for example in function for training and so on |

1:22:07 | so these the way we trained here is basically so |

1:22:12 | alternately she for which was defined as |

1:22:16 | alright for however bashers |

1:22:20 | and this is because we don't really have a case you just complete equal continues |

1:22:24 | every random statements |

1:22:26 | this is |

1:22:29 | as long as they won't work so there's really clear idea what is the what |

1:22:33 | is data |

1:22:35 | but anyway |

1:22:38 | we do training if he doesn't include one the one additional also be a good |

1:22:45 | review |

1:22:46 | try a few more times o and two patients number of times and you is |

1:22:51 | to include that we will |

1:22:54 | research around there's to the best on the whole the learning rate increase but okay |

1:23:00 | i don't know this is the best |

1:23:02 | "'kay" be seen but as for well enough for me |

1:23:13 | so |

1:23:14 | yes for the whole piece |

1:23:16 | and |

1:23:18 | going on i would like to mention a few weeks |

1:23:23 | not very complicated things |

1:23:25 | it was maybe slightly difficult for me to figure out |

1:23:30 | and |

1:23:31 | they are related to back propagation and the things i wanted to modify their |

1:23:41 | so let's just first briefly review the back propagation algorithm |

1:23:48 | basically |

1:23:52 | you know that the neural network is just |

1:23:55 | some serious of affine transformation followed by nonlinearity then again affine transformation and again only |

1:24:01 | the install |

1:24:03 | so that's a result in some you will be applied affine transformation |

1:24:09 | i guess is set here and then we apply some nonlinearity and |

1:24:14 | i mean yes the a that's going to and we do that over and over |

1:24:18 | and that's called a final |

1:24:22 | i'll put four and then we have some cost function |

1:24:26 | i'm on that for example cross entropy |

1:24:29 | and we if we you know function composition bit is the reading here's basically means |

1:24:34 | the compositional g and h is just like |

1:24:39 | an h on the data and energy and still then we know that we can |

1:24:43 | write the whole neural network s |

1:24:46 | applying the first affine transformation of the input |

1:24:50 | next door first the nonlinearity |

1:24:54 | all the way |

1:24:55 | but the output |

1:24:57 | it can be written like these |

1:24:59 | and is also easy to write well the |

1:25:04 | gradient of the |

1:25:06 | loss with respect to that you could point using the chain rule i is |

1:25:11 | so it's just |

1:25:13 | basically everybody will see with respect to improve this just |

1:25:19 | change like this study video scene with respect to a time period well a i'm |

1:25:23 | this dataset i install |

1:25:25 | so i have this |

1:25:27 | funny thing brackets here just and you know that these are |

1:25:32 | just covariance so the multivariate shaver looks |

1:25:37 | same as the second one just that we need to use digital us instead of |

1:25:43 | this is not normal productive |

1:25:48 | so forceful |

1:25:55 | the |

1:25:56 | relative lc with respect to a |

1:26:01 | this is i criterion is really right because it's a vector so |

1:26:07 | when all these elements like these here |

1:26:12 | criminal a with respect was |

1:26:15 | easy just gonna be a diagonal probably unlike is because f is the functional design |

1:26:20 | elements bias |

1:26:22 | and the other one three that you off |

1:26:26 | san interesting to a |

1:26:28 | if we look at this thing here we will see maybe a little bit for |

1:26:33 | this is just the weight matrix |

1:26:36 | so then back propagation is |

1:26:39 | okay we start by calculating the |

1:26:42 | d c |

1:26:45 | this is a i |

1:26:48 | and that's just these two |

1:26:51 | and then |

1:26:52 | we can |

1:26:54 | continue with |

1:26:56 | get it is easy with respect to some other set i by just taking that |

1:27:02 | are that we have and multiply for example we these two then we get an |

1:27:06 | extra and still |

1:27:08 | so it's |

1:27:10 | but course process like that so that yes you lost the remote people loss with |

1:27:16 | respect to include in the of what we want this of course with respect to |

1:27:19 | model parameters which is that |

1:27:22 | biases in the weights |

1:27:24 | which we have |

1:27:26 | here and here |

1:27:29 | those are given by these extensions here |

1:27:33 | so |

1:27:34 | for the biases is just these |

1:27:38 | a second down here |

1:27:40 | for the weights model can claim that the corresponding part of the weight matrix |

1:27:49 | this is just sorry within corresponding part of the |

1:27:55 | ye activation and a here we are interested in contributing with respect to this also |

1:28:00 | we need more like the corresponding part this |

1:28:07 | okay so no i'm talking about when we are fresh test |

1:28:13 | and here we also to really good references for these if you want to |

1:28:19 | further into it |

1:28:24 | no where we have |

1:28:27 | mentioned this |

1:28:28 | i would say |

1:28:30 | well buffy different issues that i run into their that require some |

1:28:36 | little bit of thinking in relation to this |

1:28:40 | first thing is that you see here that in order to calculate the derivative existing |

1:28:47 | weights you need the our schools of each layer is a here |

1:28:53 | and so that means that we need to see all of those memory from the |

1:28:58 | forward also needed you the main memory okay we look that passed and if you |

1:29:03 | have to be batches many utterances also long utterances this can become too much |

1:29:10 | it can go up to many gigabytes several makes sense for example |

1:29:17 | or larger batches |

1:29:20 | so |

1:29:23 | both |

1:29:24 | the no and sensible well as on printing home way of getting around this |

1:29:31 | and that is that you |

1:29:37 | or where the data |

1:29:39 | then they have the option in case of ten some for the case of the |

1:29:43 | angle you have the option to discard the |

1:29:48 | intermediate file was from the for us then maybe you that there are also you |

1:29:52 | will recalculate then when you need that so you basically just have the |

1:29:59 | in memory for one dollar score one on this time |

1:30:04 | that's the floor one have the same thing about a little bit better because data |

1:30:09 | to discard the corporate like to the cu memory which is generally bigger |

1:30:15 | there you family |

1:30:17 | so |

1:30:19 | in that case |

1:30:21 | or to use this we can |

1:30:25 | we you over the inputs a until probably layer and all the pooling layer we |

1:30:31 | put all these |

1:30:33 | close together so that we have now a kind of |

1:30:40 | tests or with the old adding store |

1:30:44 | and then that can be processed normally |

1:30:50 | and then you would just calculated los and ask for the right so or at |

1:30:55 | least one okay so that to think carefully |

1:31:02 | this of course also has the advantage that we can have same and different directions |

1:31:08 | well we may things like |

1:31:11 | but for a bit complicated or maybe not even possible |

1:31:18 | i'm not showing the congo |

1:31:21 | these people sees me see so many other things hours and makes is very difficult |

1:31:27 | to see what's going on |

1:31:30 | i have it does not seventeen |

1:31:32 | is |

1:31:34 | scripts |

1:31:36 | but the i was hoping to write some small for example but they didn't manage |

1:31:42 | to do it in time |

1:31:47 | okay so that's one three |

1:31:50 | a second |

1:31:53 | tree is related to parallelization |

1:32:03 | so |

1:32:06 | suppose that we have some or detection like this because feature but and then we |

1:32:11 | are probably |

1:32:13 | and then we have some processing all them things and finally scoring |

1:32:19 | no if we want to |

1:32:22 | well normally if we want to do parallelization will be training for some multiclass okay |

1:32:27 | it doesn't really a problem because we just is to give the day on different |

1:32:31 | workers each of them calculate some radians and we can actually right yes |

1:32:35 | or we can not irish the updated models |

1:32:39 | but in this case seems this scoring large when we do use the verification lost |

1:32:45 | in the scoring or we would like to have a comparison of all trials all |

1:32:50 | possible trials |

1:32:51 | so we need to do |

1:32:53 | time delay and the things on individual workers the sound of all the and endings |

1:32:59 | to the master where do this scoring |

1:33:03 | no we do back propagation a to them but he's and then we sell those |

1:33:10 | tries to each worker |

1:33:13 | and the they can continue the |

1:33:17 | the back propagation |

1:33:20 | the thing is this is not exactly and by normal to the case when you |

1:33:24 | have |

1:33:26 | calculated the loss here then you |

1:33:30 | a propagation but also the includes what is known has included which was just everybody's |

1:33:37 | then you basically they try to loss with respect to and endings |

1:33:42 | and how to use that s two |

1:33:47 | continue the back propagation on |

1:33:50 | the individual nodes |

1:33:56 | one single tree to do this is defined like in a sequence only a loss |

1:34:01 | like this here so i define a new loss which is yes |

1:34:05 | this is the remote zero |

1:34:08 | see the cost |

1:34:11 | with respect to the embedding elements which is what we have to change |

1:34:16 | problem most or no |

1:34:18 | times now ready or just like doesn't all probably like this |

1:34:23 | and if we know |

1:34:26 | optimize these loss you will get |

1:34:28 | what we won't be cost |

1:34:31 | let's consider right and the order derivative of these loss increased a cell to some |

1:34:36 | model parameter of the neural network |

1:34:39 | okay we apply here |

1:34:41 | just take this started in here |

1:34:45 | here is something that it has on these |

1:34:48 | there are so we are right yes here and this is i certainly exactly the |

1:34:53 | loss |

1:34:54 | the relative that the are |

1:34:57 | off looking for so |

1:35:01 | the remote view |

1:35:03 | for these loss with respect to model or anything will be exactly the same passed |

1:35:08 | a law that we are interested e |

1:35:10 | is possible that some newer tutees has |

1:35:15 | what is actually just do this without using some tree i'm not sure that |

1:35:21 | this was as though to achieve this |

1:35:25 | it's |

1:35:29 | final tree |

1:35:30 | ease |

1:35:34 | related to |

1:35:36 | something the holocaust repair saturated rental units |

1:35:42 | so |

1:35:45 | right is the sum operation function so let us remember we have a fine transformation |

1:35:51 | formal by so |

1:35:54 | activation function and if it's the revenue proposal is one of the |

1:36:01 | problem on then |

1:36:03 | whenever the goal is always below sea able to these rental will or when everything |

1:36:09 | but this is close to the red will put zero so if or includes or |

1:36:14 | below zero then this rhino is basically never all putting anything in because it's a |

1:36:20 | vector |

1:36:22 | useless |

1:36:23 | and we there is also the opposite problem if they but is always a zero |

1:36:27 | then there are n is just a linear units so we really models |

1:36:34 | the includes threatens to be |

1:36:37 | in a |

1:36:40 | be sometimes |

1:36:42 | positive and sometimes negative then the railways brady units |

1:36:48 | nonlinearly and |

1:36:51 | the network is doing something interesting |

1:36:53 | so how we have these is that usually checks if read a unit has problem |

1:37:00 | like this and in that case |

1:37:03 | they will ask |

1:37:04 | some a little also |

1:37:07 | to test a |

1:37:09 | so that everybody will see with respect to set |

1:37:14 | a problem to do this in some of the standard neural network is that we |

1:37:19 | don't really we can't really we don't have an easy way to manipulate this stuff |

1:37:24 | that |

1:37:26 | which is used in the back propagation |

1:37:28 | so we will be set to manipulate the derivatives with respect to model parameters directly |

1:37:35 | and |

1:37:38 | seeing |

1:37:41 | how |

1:37:43 | these relations lou |

1:37:46 | we wanted us from the data that and |

1:37:49 | the derivative with respect to be easy just |

1:37:52 | is the as we were asked thing is achieved in a place you can just |

1:37:56 | at that it leads to this |

1:37:58 | do not here we usually get from model to |

1:38:02 | and similarly for the way it's is just as we also need to multiply these |

1:38:08 | articles and a because that's called it remotely calculate |

1:38:13 | so these for some small three weeks and there may be summary i can say |

1:38:18 | that is quite helpful to when you were neural network to |

1:38:23 | based on the back propagation probably so that you know what's going on |

1:38:29 | and then you can easily too small fixes like is |

1:38:34 | so that's |

1:38:35 | or well from the hands on session thank you for attention and by |