0:00:01i everyone mm and you know
0:00:04and i work the
0:00:06brno university of technology and amelia and i will be giving this the control room
0:00:13and two and speaker verification
0:00:20so the topics
0:00:22to discuss in this tutorial is we'll start with some background and definition l and
0:00:28when training
0:00:30and then discuss some alternative training proceed years but mean which often use
0:00:36and then talk about the motivation for ends when training
0:00:40and continue with some difficult this of and when training
0:00:45and then
0:00:46talk about that were reviewing sound
0:00:50existing work on and then speaker recognition but not that you in like grade the
0:00:57and then we will rubber with some summary and all but and
0:01:03i was would like to give some acknowledgement and assigns to my colleagues from but
0:01:07in the media
0:01:10by who i'm
0:01:11it's cost bayes topics a lot
0:01:15so let's start we recognition
0:01:19this is
0:01:21i kind of typical
0:01:23mm at the recognition scenario and
0:01:26in steps to marry we assume we have some features x and some labels line
0:01:30and we wish to find some function which is parameterized by
0:01:35second let's say
0:01:37and it which
0:01:38given the
0:01:41features critiques
0:01:42some label or predict the label
0:01:46which should be close or equal to the true
0:01:53to be more precise
0:01:54we would like to me prediction to be such that some loss function which compares
0:02:00predicted label with the true label is as small as possible on unseen data
0:02:06and the loss functions for example if we do call a classification it can be
0:02:11something that used
0:02:13zero if the predicted label is same as the true label and one
0:02:16otherwise the basis i kind of error or
0:02:18not case
0:02:23of course ideally what we want to do is to
0:02:28minimize the expected loss on unseen test data which we could calculate like bass
0:02:36and here we use capital x and y to denote that they are unseeing random
0:02:41but since we don't know the probability distribution of
0:02:44x and y we cannot do this
0:02:46exactly or explicitly
0:02:51in the supervised learning problem we have access to some training data which would be
0:02:56many examples of features and labels we can complete not the most set
0:03:02p check the average loss on the training data and we are trying to minimize
0:03:08and then we hope that this we
0:03:11this procedure here means that we will also get a low loss on
0:03:15unseen test data
0:03:18and this is a call empirical risk minimisation
0:03:21and use expected to work uses
0:03:25the classifier that we use this not to our four
0:03:30the to be precise something would be to dimension should be if units and it
0:03:34also requires that the distribution of the loss
0:03:38not to have attained but to for typical scenarios this
0:03:42really into it improves in your is expected to work
0:03:49so then let's talk about speaker recognition
0:03:53as probably most
0:03:55in the audience here knows we have these three some tasks of speaker recognition
0:04:01it's speaker identification
0:04:04which basically is used to classify close to all speakers of this is a very
0:04:11i recognition
0:04:14scenario and then we have speaker verification where we deal we
0:04:17open set as we say
0:04:19so the speakers that we may see in testing
0:04:22or not the same as we have access to in training when building the model
0:04:27and our task is typically to say whether two segments utterances are from the same
0:04:32speaker or not
0:04:34and then there's also speaker diarization which is
0:04:38to assign basically you know in a long recording each time you mean you need
0:04:43to a speaker
0:04:47so here i will focus on speaker verification because the speaker identification task is
0:04:53quite easy you know at least conceptually
0:04:57and the speaker diarization is card and then approaches are still in very rarely station
0:05:03or although some great
0:05:05stuff as has been done
0:05:07it's maybe too early to focus on that you know tutorial
0:05:20if a classifier
0:05:22i'll codes
0:05:23not a heart the heart prediction like it's this class or in this class but
0:05:28rather probability of different classes
0:05:31so we would like some
0:05:34classifier that uses an estimate of the probability of some label given the data
0:05:39in the case of speaker verification with are rather prefer it all put the log-likelihood
0:05:46because from that we can
0:05:51the probability of a class given the labour i classes here is just target over
0:05:57but we can
0:05:59do this based on a specified prior probability
0:06:03so it uses a bit more flexibility in how to use this
0:06:13but some talk about and training
0:06:16and my impression is that it's not completely or well defined in the literature
0:06:23but it seems to enable
0:06:26these two
0:06:30first all parameters of the system
0:06:34should be trained jointly and that could be anything from feature extraction to producing some
0:06:38speaker inventing
0:06:40to the back in the comparison of speaker and endings and increasing the score
0:06:46a second aspect is that
0:06:48and then system should be trained specifically for the and
0:06:51intended task in which in our case would be verification
0:06:58one could go even more stricter say that it should match to extract evaluation metrics
0:07:02but we are interested in for example in right
0:07:07in this tutorial i will try to
0:07:20these criterias are or what is it can be
0:07:24to impose this criteria or what doesn't mean if we don't do it
0:07:33let's look at what would
0:07:37typical and when speaker verification architecture
0:07:41look like and
0:07:43well i process first i know this was first attempted for speaker verification in two
0:07:47thousand sixteen
0:07:49in the paper mentioned here the mortal
0:07:54it will be some so we start with some
0:07:57enrollment utterance so as
0:07:59here it's three and we have some test utterance
0:08:02all of these goes through some embedding extracting neural networks
0:08:06reducing in many different architectures there
0:08:09we produced and bindings which are fixed size
0:08:14utterance representations
0:08:16one for each utterance of in three now enrollment and endings and one test reading
0:08:22and then we will create one and rollers model by some kind of pulling for
0:08:26example taking the meeting
0:08:28of the and warm of them buildings
0:08:31and then we have some similarity measure and in the and
0:08:35a score comes out that says
0:08:38the log-likelihood ratio for four
0:08:41the hypothesis that these
0:08:43test segments
0:08:44it's from the same speaker as this enrollment segments
0:08:50all of these models should all these parts of the speaker model should be
0:09:02to be a bit fair and maybe a for historical interest we should say that
0:09:08this is
0:09:10no a
0:09:12new idea
0:09:15we had it's already in nineteen ninety three maybe that's their list i'm aware of
0:09:21at least
0:09:22and the one paper at the time was about
0:09:26handwritten signature recognition and another paper was about the fingerprint recognition
0:09:33but they used exactly this idea
0:09:40okay so we talk about and
0:09:43training and modeling
0:09:46so what would be the alternative
0:09:49one thing would be
0:09:51generative modeling so we train a generative model
0:09:55means a model that can generate the data both the observations x and
0:10:02labels line and it can you was
0:10:08it can also give us
0:10:10probability of or probability density for such a observations
0:10:16me typically training with maximum likelihood and if the model is correctly specified for example
0:10:22of the data really comes from a normal distribution and we have assumed that
0:10:26in our model are then
0:10:29with enough training data we will find the correct parameters but the
0:10:33that is no
0:10:35and it's may be worth pointing out that
0:10:37and the lars from such a model is the best
0:10:40we can have its
0:10:43so to have access to the log-likelihood ratios from
0:10:47from the model that really generated today that is
0:10:51then we can make the model decision for classification verification is a long
0:10:58classifier would have was more
0:11:04problem with this is that when the
0:11:07more than
0:11:09assumptions are not correct then the parameters we find with maximum likelihood may not be
0:11:14optimal for classification
0:11:17and sometimes maximum likelihood training is also difficult
0:11:25other approaches will be some type of discriminative training so and then training can be
0:11:30seen as a where is a lot one type of discriminative training but other discriminative
0:11:36approaches we can tries to train the neural network where the embedding extractor for speaker
0:11:41identification which seems to be the most
0:11:45popular approach right now
0:11:48and then we will use output of some intermediate layer as somebody and train and
0:11:54i'm not either
0:11:55back end on top of that
0:11:58then there is this a course of the metric learning which
0:12:07mean kind of train the embedding extractor together with a distance matrix with sometimes can
0:12:12be simple
0:12:14so in principle the inventing and kind of distance metric or back end
0:12:19trained jointly
0:12:21but typically not for the speaker verification task
0:12:24so this is kind of and then training according to the first criteria but not
0:12:28according to the second
0:12:36when we know that we will
0:12:38is costs
0:12:40why the end-to-end training would be preferable
0:12:45we had two things one is that we should train models jointly and the other
0:12:48thing is that which are trained for the
0:12:50intended task
0:12:56in the case of joint training is actually quite obvious selects the consider
0:13:01system consisting of two modules a and b and we have fit that a which
0:13:05is the parameters of model a and b which is the
0:13:08only there's of what would be if we just first training module a and then
0:13:14module b
0:13:15it is essentially like doing
0:13:18one iteration of
0:13:20coordinate descent or block coordinate descent
0:13:22so we train model
0:13:25and we get here we train one ubm we get here
0:13:29but we will not get for them that's not to the optimum which would be
0:13:34so of course we could trade continue
0:13:39do a few more iterations
0:13:40and we might end up in the
0:13:43optimal and this is actually kind of in principle equivalent to a joint optimization
0:13:51when we have right kind of a non-convex model as one we may not actually
0:13:55get the same
0:13:57right optimum but as if we did
0:14:00all the parameters in one go what would happen also depending on which optimize the
0:14:05we used so
0:14:06in principle
0:14:08this is
0:14:12why or so joint training would be like
0:14:16really make sure that you find the optimal
0:14:19also both
0:14:20models and that's clearly better than just training one
0:14:25first one and then the other ones
0:14:28so i think there is no really argument here
0:14:31that the these part of and then training is justified
0:14:36the joint training of for more details
0:14:42the task specific training the idea that we should training for
0:14:51the intended task so if we do
0:14:55you our application we want to do speaker verification why we should training for verification
0:15:00and not for identification for example
0:15:05first mission say that
0:15:10we have some guarantee that this idea of minimizing loss on training data
0:15:14we need was good performance on test a the empirical risk minimisation idea
0:15:20and the only guarantee we have there is
0:15:26this in this case the only holds if we are training for four we for
0:15:30the metric that we are interested in with the task of very interested in
0:15:35if we
0:15:36trained for one task and or
0:15:39you can evaluate
0:15:40on another source we don't really have any guarantee that
0:15:45we find the optimal model parameters for this task
0:15:49but one can of course ask shouldn't is really work anyway training for
0:15:55and use the model for verification "'cause" it's kind of similar tasks
0:16:00it does as we know
0:16:02so but let's just discuss a little bit what could
0:16:05go wrong
0:16:08or why it wouldn't be optimal
0:16:16so here is kind of toy example
0:16:20we are looking at one dimensional inventing so we imagine that these have been
0:16:25where rather the distribution of one dimensional and endings
0:16:31so the embedding space is here and each of these colour represent the
0:16:38distribution of impending is for some speakers of you is one speaker or will is
0:16:42another speaker and so one
0:16:46of course this is a little bit that we are
0:16:49shape of the distributions i showed it alright okay kind of for simplicity
0:16:54so in this kind of for example we assume that the mean of the
0:16:59speakers are used a new that when you call distance like this
0:17:09what would be the identification error in this case
0:17:13so whenever we observe an amending we will assign it to the closest speaker
0:17:20if we
0:17:22observed on a bending in this region we will assign it so that no speaker
0:17:26if we also observe it here
0:17:28we will assign its to this end
0:17:31this and
0:17:32you green
0:17:36and of course it means that sometimes it will be the blue speaker
0:17:43when something sampled from the blue speaker will be here but we will assign its
0:17:47the v is
0:17:48style speaker area
0:17:50so we will have some error in this situation
0:17:55if we consider only the neighboring speakers the error rate will be
0:17:59a twelve point two percent in this example
0:18:10what would be the verification error rate
0:18:15if we consider
0:18:16for this type of data
0:18:19we will assume that we
0:18:21have speakers
0:18:23which are you can be installed is to muted
0:18:26like well
0:18:27these stars
0:18:30now the target trial we will sample
0:18:35and bending from one speaker
0:18:37and see if they are closer to each other than some threshold
0:18:41based happen to the optimal special for this iteration
0:18:46and if the
0:18:48they are after that first we that i think so that you
0:18:54thank you
0:18:56but i
0:18:58if available
0:19:02the case
0:19:05and for nontarget trials
0:19:11here in this image we could see
0:19:14it would have an error rate of fourteen percent
0:19:17again i'm only actually considering that the non-target trials are from neighboring speakers
0:19:26that's why they're rate is high
0:19:35i'm only changing this is to use a little bit
0:19:39the within speaker is to me you show so
0:19:43as before
0:19:45the speaker means are on the same distance
0:19:48like this
0:19:50we have made them little bit more narrow here the within speaker distribution a little
0:19:55bit more broad here
0:19:57the overall variance the within speaker variance this the same obtain a little bit different
0:20:02and we will see that identification error has increased to thirteen point seven percent
0:20:09whereas the verification error is that there
0:20:16more extreme situation we have made them
0:20:19the distributions equally sake or broad
0:20:23do those two mixtures
0:20:26now id and the means speaker means are all the same distance
0:20:31like this
0:20:32but the within speaker variance is
0:20:35well in the within speaker variance is also the same as before
0:20:40and here it would actually get
0:20:44identification error
0:20:46but you will have worse
0:20:48verification error or in any of the other example and it's because
0:20:53if you sample a target trial you we very often have
0:20:57and endings that are far from each other and similarly
0:21:01for a non-target trials will very often have weddings that are close to each other
0:21:07so this
0:21:10should illustrate that
0:21:14the within speaker distribution that is optimal for identification is not the same is not
0:21:20necessarily the distribution that is optimal for verification
0:21:27okay so
0:21:29as another example
0:21:31let us consider triplet loss which is another popular
0:21:40could i
0:21:42so it looks like this that
0:21:44each training example you have
0:21:48and bending for some speaker which we call the anchor invading
0:21:52and then you have an embedding from the same speaker in which all the positive
0:21:55example and animating from another speaker we should call the
0:21:59negative example
0:22:00and basically we want the distance between the anchor and the positive example can be
0:22:07and the anchor between the at the distance between the anchor and the negative example
0:22:11to be big
0:22:15if this distance is bigger than
0:22:18this class and
0:22:20then these loss is gonna be zero
0:22:27this is not
0:22:29ideal the an ideal criteria for speaker verification and two show this i have a
0:22:34rather complicated feed your here the illustrates
0:22:40three speakers
0:22:41and the embedding some three speakers in a
0:22:45two dimensional space
0:22:47so we have
0:22:48the speaker may
0:22:50with and buildings
0:22:52distributed in this area
0:22:55speaker be with the meetings in this area and speaker c with them endings in
0:22:59this area
0:23:04we are using some and go from speaker to a the worst case would be
0:23:08to use it here on the border
0:23:10and then the biggest this test for a positive example would be to have it
0:23:15here on the other side
0:23:17and the biggest the smallest this there's to a negative example would be to take
0:23:21something here
0:23:23so simply we want this
0:23:27and distance with the positive example
0:23:30here class some margin to be smaller than the distance from the
0:23:35negative example of anchor
0:23:37so it's okay
0:23:38in this situation
0:23:41consider then speaker seen which hasn't b
0:23:46is the fusion of data now if we have i'm gonna here
0:23:51we need
0:23:53distance to the next speaker the closest speaker to be
0:23:58be here then the internal distance
0:24:00class some margie
0:24:03and that's the case in this figure so that replied loss is completely fine with
0:24:08this situation
0:24:10but if we want to use
0:24:12we do
0:24:13verification on data that is distributed in this way then we should
0:24:21at all well if we want to have good
0:24:24performance of target trials from speakers t
0:24:27we need to accept
0:24:30trials as target trials whenever we have a smaller distance then this otherwise we will
0:24:34have some error or for target trials of speakers e
0:24:38but this means that if we have a threshold like this year we will have
0:24:42would be in confusion between
0:24:45speaker a and b
0:24:49again of course they could be ways to compensate for this environment or another but
0:24:53it's just to show that like to sign
0:24:57metric is not
0:24:58gonna lead to optimal
0:25:00performance for
0:25:06so if we try to summarise a little bit about the idea of task specific
0:25:12minimizing identification error wouldn't necessarily the minimal verification error or
0:25:18but of course i was showing these on kind of toy examples and the reality
0:25:22is much more complicated
0:25:25usually don't optimize classification error but they're all the cross entropy
0:25:29or something like that
0:25:31and we may use some loss to encourage more jean
0:25:36between the speaker and endings
0:25:39and maybe these assumptions that the made about the
0:25:42distributions here are
0:25:44well to compute more realistic at all
0:25:53so the maybe not completely clear
0:25:56what would happen we knew test speakers that were not in the training set as
0:26:01so i one and then to say is that this should not be interpreted as
0:26:05some kind of proof that other object is would fan maybe they would even be
0:26:09really good
0:26:12yes to use training data be that it's not really
0:26:17completely just defined to use them
0:26:20and this is of course something that ideally should be studied much more
0:26:24in future
0:26:31and so we discuss that the and then training has some and good motivation
0:26:39but still it's not really the most popular strategy for building speaker recognition systems today
0:26:46at least in my impression it is my impression is that the multiclass training is
0:26:50still the most popular
0:26:55why is that well there are many difficulties with the and when training
0:26:59it seems
0:27:01no e
0:27:02he's more prone to overfitting
0:27:05we have additions we statistical dependence of training
0:27:08trials which are we go more into detail in
0:27:12i of the dislike
0:27:16they're also maybe questionable how to do how should be trained based in the system
0:27:21when we want to
0:27:23and many enrollment utterances also to be mentioned of it
0:27:28but one
0:27:35the issue
0:27:36one of the issues with using a cane of verification objective let's call it that
0:27:41when we are comparing draw
0:27:43two utterances and wondered say whether it's the same speaker or not
0:27:48is that
0:27:51the day that
0:27:52we e
0:27:54statistical independence i mean same y
0:27:57well you know minutes about
0:27:59so this is
0:28:01generally these idea of training of minimizing some training also assumes that
0:28:07the training data
0:28:09are independent samples from whatever distribution comes from
0:28:14and this is often the case i mean we have data that has been independently
0:28:21in speaker verification
0:28:23the data
0:28:28a pair also happens then roll utterance and the testing utterance and the label is
0:28:34indicating whether it's the target trial or a non-target trial
0:28:38so for location i mean use
0:28:41why equal one for target trial and one equal minus one for nontarget trials
0:28:46the issue here is that
0:28:49typically at least if we have limited amount of training data
0:28:53we create
0:28:54many trials
0:28:56from the same speaker from the same utterance of each of the speaker and utterances
0:29:01are used in many different right and then these
0:29:05date time is not
0:29:06these trials are not which is the training data
0:29:10is not
0:29:12statistically independent
0:29:14which is something that the training procedure assumes they are
0:29:22this can be a problem exactly how big the problem is
0:29:25i think it's still something that needs to be investigated more but let's elaborately to
0:29:30be what about what happens
0:29:38here i brought adjust the training objective that we would use in the for a
0:29:43kind of a verification loss when we train the systems and in verification
0:29:48so it looks
0:29:49complicated than being but it's not really anything special is yes the average training loss
0:29:56target trials here and the average training loss of
0:30:00non-target trials here and they are weighted with a fact or
0:30:05probability of target trials and probability of non-target trials which are
0:30:10some parameter that we use that to
0:30:14dear the system to fit
0:30:15better for the application that we are interested in
0:30:19and again
0:30:22what we hope is that this would minimize the expected loss
0:30:28target trials and non-target trials
0:30:32weighted we these
0:30:34probability of target trials and non-target trials
0:30:38on some unseen data
0:30:40this loss function here is often the cross entropy but could be other things
0:30:49so what are the desirable properties of training objective
0:30:59we have
0:31:00are hat which is the
0:31:03and directional for training the loss
0:31:07since the training data
0:31:09can be assumed to be generated from some probability distribution this or have is also
0:31:14a random variable
0:31:18and we won't these
0:31:20to be close
0:31:21to the
0:31:23expect that
0:31:29where the expectation is calculated according for the true probability distribution of the data
0:31:35and for every value of
0:31:37fit that because
0:31:39in that case
0:31:47the expected loss is this black line here
0:31:57well let's say we are
0:31:59we have some training set the blue one
0:32:02and we check the average loss as a function of data
0:32:06it may look like this
0:32:09another training set it may look like this the red line and the third one
0:32:13would be
0:32:14the power of one so the point is that it's a little bit random and
0:32:17it's not gonna be exactly like the expected loss
0:32:22but ideally it should be close to this one because if we find a filter
0:32:26that minimize the training loss for example here for the in the case of the
0:32:29red training set
0:32:32e we know that okay it will be also a good value for the
0:32:38expected loss which means that the loss on things test data
0:32:43so we want
0:32:47training loss
0:32:48for some as a function of the parameter in grammar the model parameters
0:32:53can be close to the expected loss for one values of the
0:33:03in order to study the effect of
0:33:08statistical dependences in the training data in this context
0:33:12right the
0:33:14training objective slightly more general than before
0:33:20use the same as before but yes that's for each trial
0:33:23we have a way to be done
0:33:25and if we set the to when one over and then it would be the
0:33:30same as before but now we consider that we can choose some other value of
0:33:37try and weights
0:33:38in the training data
0:33:39training trials
0:33:44we won't
0:33:45the training objective so the average training loss to have an expected value which is
0:33:52same as the expected value
0:33:56of the loss of test data so it should be an unbiased estimator of the
0:34:03the test loss or the expected loss
0:34:07and we also want these want to be good in the sense that it has
0:34:10a small variance
0:34:18well the expected value of the training loss is just calculated like this so we
0:34:23end up with the expected value of a loss
0:34:26and this is exactly are
0:34:28what we what we usually denoted or
0:34:30so in order for these to be
0:34:32unbiased we simply want the sum of the weights to be one
0:34:39and of course this would be the case when we use the standard choice of
0:34:45meta which is one over and the number of
0:34:49in the training data
0:34:53the variance
0:34:55of this empirical loss
0:34:58is gonna look like this
0:34:59it's the
0:35:00weight vector or for all the trials
0:35:03and so on the matrix
0:35:06times the weight vector
0:35:09and this matrix is the covariance matrix for the loss of all trials with the
0:35:14with this little t so that easy the one for the target trials or
0:35:18minus one for the non-target trials
0:35:21and one could derive that
0:35:23the optimal
0:35:24choice of
0:35:26he does that would minimize this variance
0:35:29and i look like this
0:35:36so this is what we can call them you training objective
0:35:40a best linear
0:35:42unbiased estimate
0:35:44that's the meaning of you so this is the best linear unbiased estimate of
0:35:50test loss
0:35:51using the training data to estimate what
0:35:53well the test loss would be
0:36:00details about this is that we don't really need covariance between the most of the
0:36:05raw the correlation
0:36:08we assume the diagonal elements in section matrix is
0:36:14then it turns out like this
0:36:18and in practice we would assume that
0:36:22and lennon's in this covariance matrix does not depend on cedar which
0:36:26could be questioned
0:36:32the objective that we discussed is not really specific the speaker verification in this is
0:36:37that whenever you have a
0:36:39dependence is in the training data can you could
0:36:42use this idea
0:36:43but for
0:36:45the structure of this the covariance matrix
0:36:49between the training which describes the covariances of the loss of the training data
0:36:54that depends on the problem the specific problem that you're studying
0:36:58so now we will look into how to
0:37:01creating search a matrix for speaker verification
0:37:06so here
0:37:07we will use
0:37:09i two denotes the
0:37:12i utterances of speaker x
0:37:16so we will assume that
0:37:19correlation coefficients
0:37:21hands on what trials i mean comments so for example
0:37:24the here we have
0:37:26trial of speaker a utterance one speaker to a utterance to and some loss of
0:37:31that and the all several also speaker eight utterance long speaker eight
0:37:36utterance three and some loss of that
0:37:38and they have some correlation
0:37:40it because
0:37:42they involve the same speaker
0:37:45so we assume there is a correlation
0:37:48coefficient denoted c
0:37:50at least eight here
0:37:52so in total we have these kind of situation in verification if we consider target
0:38:00there you could have the situation that's
0:38:02well okay let's look here
0:38:05to target trials which have one utterance in common this is speak a target trial
0:38:10of speaker eight
0:38:11and here we have buttons one of those two and here you have buttons one
0:38:15utterance trees is also has a long using both
0:38:17trials there is some correlation between these trite
0:38:22there is no common utterance but the speaker still the same and this is as
0:38:26opposed to this situation where
0:38:28you have
0:38:30trial of speaker a and the trial of speaker a they have nothing in common
0:38:34so we assume here the correlation is zero
0:38:37for such trials
0:38:39for the non-target trials you have more complicated situation but all possible situations are listed
0:38:47for example
0:38:48you may have that
0:38:50the speaker is you have one
0:38:54utterance in common
0:38:58so we have this utterance in common and in addition to that
0:39:02these speaker is in common that's what they mean with this notation here
0:39:08and so one
0:39:14and if we have such weights one can derive
0:39:18the all the words such correlation push coefficients we can drive the optimal weights for
0:39:24a speaker with this many utterances
0:39:27is gonna look like this
0:39:32the exact form is maybe not so important but just
0:39:34we should note that one could
0:39:37the right
0:39:38how to
0:39:39given the way to these speaker and it depends on how many utterances
0:39:44the speaker s
0:39:47for the non-target trials to formalize more complex
0:39:51it would depend on me if the trial involves speaker names p can be it
0:39:55depends on how many
0:39:56utterances speech to speaker as
0:40:03then comes they show how to estimate correlation coefficients one could look at some recorrelation
0:40:09of some trained model
0:40:12or we couldn't
0:40:14learned them somehow
0:40:16or which we will mention briefly later or we can just make some assumption and
0:40:21into neat so for example one simple assumption is the set
0:40:25this for score coefficient of target trials are five and this one which we assume
0:40:30should be smaller so i'll four square
0:40:32and then
0:40:35to an affine this range and similarly for the non-target trials
0:40:44just to get some idea of how we would change the weight for the target
0:40:49for target trials
0:40:51we see here that this is the number of utterances for the speaker
0:40:56on the y-axis here we have their corresponding weights
0:41:02and for different values of these correlations so if the correlation is
0:41:07a small
0:41:11even when we have many utterances up to twenty here we will still give reasonable
0:41:16way to each utterance
0:41:19but if the correlation is a large
0:41:22then we will not give so much weight to
0:41:25but each utterance when a speaker as many utterances
0:41:29which means that the total
0:41:32wait for this speaker is not gonna increased a much even if it has a
0:41:35lot of
0:41:48in the past i was exploring little bits how
0:41:52these kind of correlations really are
0:41:55this was on the i-vector system with clearly a and the scores
0:42:01here in the first
0:42:05i in this
0:42:07column here
0:42:08it's a
0:42:09okay lda model trained with em algorithm and then the score samples and instigated system
0:42:14i find calibration
0:42:18and the other column here is for discriminatively trained p lda
0:42:22so the main thing top so here is that we
0:42:25to have
0:42:26correlations between trials that's how for example an utterance in common answer one
0:42:32in correlations can be quite large in some situations
0:42:38so these
0:42:40problems seem to exist
0:42:44and doing this kind of correlation composition main goals this is like again on the
0:42:49kind of discriminative
0:42:50clearly a
0:42:57e does have a bit
0:43:05so it's something
0:43:11possibly take into account
0:43:13the course of ssl it's four db lda but the where we train a p
0:43:17lda model
0:43:18using all the trials in the training set
0:43:21that can be construct and then training set but of course the same
0:43:25problem with the dependence exist all seen and system
0:43:40no some problems that the we could encounter if we tried to do this
0:43:45well mister the
0:43:47results or the
0:43:50compensation formless that we derive
0:43:52was assuming that
0:43:54all trials
0:43:55stuff can be created from the training set or used equally often which is the
0:43:58case if you train a backend likely p lda
0:44:02discriminatively and you use all the trials
0:44:05a we
0:44:08well we train a kind of and system with involving neural networks
0:44:14we use media bashers so one could achieve this situation by
0:44:20making a
0:44:21list of trials
0:44:25then we just sample trials from years okay here is a trial is this speaker
0:44:29compared to this final trial is the speaker compared to this one as a long
0:44:33and this is
0:44:34long list of all trials that can be formed and then we just
0:44:41select some of them into the mini batch
0:44:44the point is of course that if we have these speakers like this
0:44:47in the mini batch and we compare this one with this one
0:44:50this one we this one and so long
0:44:53we are not using all the trials that we have
0:44:56we have for example not comparing this one with this one in the mini batch
0:45:01recall and that's maybe a bit the waste because we are anyway using this deep
0:45:06neural network to produce them paintings and so once we can just as well
0:45:12produced and reading or will use all of them in the in the scoring part
0:45:15as well
0:45:17well then
0:45:17we will have a little bit different
0:45:22of the trials
0:45:24globally compared to what we had before
0:45:27so the former lastly that we derived wouldn't be exactly valid in this situation
0:45:36question then it is if we do decide that all the segments
0:45:42can that be extract them ratings for
0:45:44that we have in the mini batch if we want to use all of them
0:45:48was in the scoring what you how are we gonna select
0:45:52the data for the mini batch
0:45:54they can be different strategies here
0:45:57we could consider for example
0:46:00strategy a
0:46:03select some speakers
0:46:05and then for each speaker we take all the day the segments that they have
0:46:08let's say that these rates speaker has
0:46:11three segments and these yellow speaker has
0:46:14for speaker for segments
0:46:17and then all
0:46:21we can consider only five so we can have
0:46:26segment one of the red speaker scored against segment to segment one scored against segment
0:46:30three as a long
0:46:33we don't use the diagonal because we don't consider
0:46:39try segment scored against themselves
0:46:42and the course here is just the same as here
0:46:46a scoring segment two
0:46:48i guess segment one
0:46:52this would be one way another way would be constructed you be
0:47:00select speakers but then just select to utterance for each speaker in the mini batch
0:47:10you will have just one target right for each speaker
0:47:14it differs here is that
0:47:16we have
0:47:17we are gonna have
0:47:19fewer target trials
0:47:21overall in the mini batch but one of them will be from different speakers and
0:47:24we will add target five from more speakers
0:47:31not exactly clear what would be the right thing but some little bit informal experiments
0:47:36we have done
0:47:37so just of this strategy b is a better
0:47:46then again the formulas that we'd right before how to weight strives on not completely
0:47:51the they were not the right on the assumption that we are doing like this
0:47:55so they are not
0:48:00and they need to be modified to be it and i mean come to that
0:48:03in a minute
0:48:07the second problem that can occur in and when training is that
0:48:12in respect of these issues is that
0:48:17we do want
0:48:19to use
0:48:20what we do want to have a system that can deal with the session enrollment
0:48:24and it
0:48:26of course of the session trials can be incorporated
0:48:30it work can be handled with dances and system as we discussed in the initial
0:48:36by having some pruning armour enrollment utterance
0:48:40but how to create a training date time is again a little bit the
0:48:48already in the case of single session tries we had a complicated situation how many
0:48:54different kind of dependent system can occurrence along and in them with the session case
0:48:59it's gonna be even more
0:49:01complicated because you can have situations like
0:49:08for example these two could be the enrollment and this is the test and another
0:49:12trial where
0:49:13these two are the enrollment
0:49:15this is the test then you have one optimizing common here
0:49:19we're gonna have a more extreme situation where both enrollment utterances
0:49:24in to try to solve the same but the test utterance is different
0:49:27so the number of possible a dependence is that can occur is way more complex
0:49:32and i think it's
0:49:33very difficult to derive some kind of formal or how the trials should be weighted
0:49:41so to deal both with the mini batch the fact that we're using mini batch
0:49:46as and to move the session trials and to estimate proper trial weights
0:49:52for that maybe one strategy can be to learn them hand this is not something
0:49:56i tried i just think it's
0:49:57something that maybe should be tried
0:50:01so we can define
0:50:02i training loss
0:50:04again as average of losses over the training data with some weights
0:50:09and the we also neon use a development loss with some
0:50:14which is an average over
0:50:16another set of the average of most over the development set
0:50:22and these weights here should depend only on number of utterances of the speaker
0:50:31or speakers involved in that right
0:50:35then one can imagine some scheme like these
0:50:39we send both training and development data through the and then we get the neural
0:50:44network and we get some
0:50:47training loss and some
0:50:49and development lost
0:50:53as usual be estimate the
0:50:56the grand here we take the gradient with respect to the model parameter off
0:51:03for the training lost
0:51:05and it
0:51:06right in is not a function of the weights the trial weights
0:51:11and we can update
0:51:13the model parameters still keeping in mind that these are then value is a function
0:51:18of the
0:51:21the trial weights
0:51:23the training try and weights
0:51:25and then
0:51:27we can
0:51:28on the development sets
0:51:31the gradient
0:51:33with respect to these training weights
0:51:36and then
0:51:37use this to update
0:51:40the training try and weights
0:51:46a second
0:51:49to explore
0:51:51or like a final note on these
0:51:57depend statistical dependence issue is that
0:52:00we just
0:52:02discussed some ideas for balancing the training data the training trials for better optimization
0:52:08but for example in the case when all speakers have the same
0:52:12number of utterances
0:52:14this rebalancing has no effect
0:52:17still of course there are dependence is there is a one would think shouldn't we
0:52:20do something more than just we balance the training data
0:52:24and one possibility that i think would we will worth
0:52:29is to
0:52:34we assume the following
0:52:37the covariance of
0:52:39to what's a scores of the
0:52:42of a trial of speaker at
0:52:45which has
0:52:45one utterance
0:52:47in common should be bigger than
0:52:49the covariance between two trials
0:52:51of these
0:52:52speaker which has
0:52:54no often as in common
0:52:56which should be bigger than the covariance between
0:53:02target trials of different speaker this should be zero actually
0:53:06so one could consider two regularized the model to be in that way
0:53:14so now
0:53:17after discussing the issues with
0:53:21and hence training
0:53:23then i will briefly mention some of the
0:53:27eight pairs
0:53:29or some papers
0:53:32on and trend
0:53:33training and i this should not be considered as i kind of literature review or
0:53:38describing the best architectures or anything like that
0:53:42it is
0:53:45just a few selected paper that illustrate some point source on them
0:53:53some of which and some good take away messages about and find training
0:53:59so this paper called and point text dependent speaker verification as follows i know was
0:54:04the first the paper on and ten training in speaker verification
0:54:09and it also networks like this or some architecture like this feature goes in the
0:54:14throes on
0:54:15and neural network and in the end we are doing
0:54:21we this network is gonna say
0:54:24is it the same
0:54:26speaker or not
0:54:28the important thing here is that
0:54:32input is fixed
0:54:36so the inputs to the neural network as the feature dimension times the number of
0:54:45the duration that is
0:54:48and there was no temporal pooling which is
0:54:52the done in many other situations
0:54:55and this is suitable
0:54:58when you do text dependent speaker verification as they did in this paper
0:55:02so because this means that
0:55:05the network is kind of aware of the word and phoneme order
0:55:11i would say that the main conclusion from this paper is that
0:55:15the verification loss was better than the identification lost
0:55:19especially when you have been the amounts of training data for small amount of training
0:55:24data guys
0:55:25not as big difference
0:55:28and the one can also say that t-norm could
0:55:32too large extent to make these two things
0:55:35this colossus more the models trained with these two moses more similar
0:55:42but i still won't say that this kind of suggested verification loss is beneficial
0:55:48if you have large amounts of training data
0:55:55so this is another paper
0:55:59there wasn't doing in
0:56:01text-independent speaker verification and here
0:56:05different from the other is that they do have a temporal pooling layer
0:56:12that would kind of remove the dependence on wonder of the input
0:56:17the to some extent at least and is maybe a more suitable architecture for text
0:56:22independent speaker verification
0:56:25and this was compared to i-vector p lda baseline down here to it was found
0:56:30that really large amount of training data is needed even to be something like an
0:56:36the lda system
0:56:44and this is
0:56:46some study that we did and
0:56:51it was
0:56:52use also again text independent speaker recognition or verification
0:56:58but trained on smaller amount of data and to make it work we instead constrained
0:57:04these neural network here this big and time system to behave
0:57:08something like a
0:57:10another i-vector and p lda baseline so we cannot constrain did not to be two
0:57:16different from the
0:57:18i-vector purely a baseline
0:57:23we found there that training model blocks jointly with their verification also was improving
0:57:33so as can be seen here
0:57:36little bit regrettably we data as a separate
0:57:40clearly whether that improvement came from the fact that we were doing joint training
0:57:45or the fact that we were
0:57:50using the verification loss
0:57:55another interesting thing here is that
0:57:59we found that
0:58:00training we verification most requires very large batches
0:58:05and this was an experiment done only on the
0:58:09scoring art and of course lda discriminatively lda
0:58:12so if we train is gonna be p lda with
0:58:16a and b if yes using full batches
0:58:21so not i mean you match
0:58:24training scheme
0:58:26you achieve some
0:58:28like this on the development set
0:58:31and this dash
0:58:33blue line
0:58:34whereas if we trained with adam with mini batch just for different slices front end
0:58:39up to five thousand
0:58:41we see that we need really be batches to actually
0:58:45get close to be of q s
0:58:48trained model which was trained on full marshes
0:58:50so that kind of little bit suggests that you really need to have many trials
0:58:55within the mini batch for you know what of four
0:58:59training these kind of
0:59:02system with a verification lots which is a bit of a problem and maybe a
0:59:06challenge to deal with
0:59:07in future
0:59:12this is some more recent paper and the interesting point of this paper was that
0:59:17they didn't train the whole system
0:59:20all the way from the waveform is that this from features as the other
0:59:29but it was
0:59:31i couldn't to
0:59:33understand completely the improvement came from the from the fact that they were
0:59:37training from the waveform or if it was because of
0:59:41the choice of architecture and so one
0:59:45but it's interesting that
0:59:48systems and going
0:59:49all the way from waveform to the and
0:59:53can work well
0:59:58and this is paper
1:00:00for this year's
1:00:02in their speech it's interesting because
1:00:08it's one of the more recent studies that the really proposed or showed some good
1:00:13performance of using verification loss
1:00:17here it was a joint
1:00:20but i can have more details training so they were training using both identification was
1:00:24and verification lost
1:00:28and that's actually something i have tried to another and any
1:00:32benefit from we but one thing they did here was to
1:00:36start with a large weight for that it is indication of austin gradually
1:00:40increase the weight for the verification will also make this is the interesting and maybe
1:00:47actually the right way to go
1:00:49i'm curious about it
1:00:55now comes just little bits summary of this talk
1:00:59we discussed about the motivation for and two and
1:01:06we said that it has some good motivation
1:01:10we show that's on
1:01:13we will refer to some
1:01:16experimental results the of also another first
1:01:19which shows that it seems to work quite well for text-dependent task with large amount
1:01:24of training data
1:01:27in such case it's probably prefer able to preserve the temporal structure to avoid
1:01:33the temporal pooling
1:01:35in text-independent benchmark one would need to strongly like a regular station or a mix
1:01:42the training objective in order to benefit from
1:01:45and when training and typically we would want to do some temporal pooling their
1:01:54one couldn't guess that and twenty training would be preferable choice in scenarios where we
1:02:00have many training speaker with few utterances we have less of the statistical dependence in
1:02:09something that to me seems to be or button questions is and which would be
1:02:14great if someone it explore
1:02:19it is difficult actually to train and then system especially for the text independent
1:02:24tell us
1:02:25so this is because of overfitting so training convergence this dependency issue we discussed
1:02:32it's not really clear i would say
1:02:36practical question is how to adapt search systems because see this more blockwise systems we
1:02:43would of the nine at the back end
1:02:45well could be trained the system in a way that we don't need adaptation
1:02:53and also how could we input some human knowledge about speech into these training and
1:02:58we need it
1:03:00something we know about the data distribution or number of phonemes or
1:03:07and we discuss that maybe
1:03:12training a model for speaker identification is not ideal for speaker verification but is there
1:03:18some way to
1:03:21to find and bindings that are good for all these tasks
1:03:27another interesting quick question is
1:03:32how well
1:03:34the llr is that comes from
1:03:36and to end
1:03:39actually could simulate the true llr
1:03:44so in other words what kind of
1:03:49distributions could be
1:03:51arbitrary accurately simulate or modeled by these architectures
1:03:57so completely clear out there
1:04:00okay so
1:04:02thank you for your attention
1:04:05by right
1:04:10hello this is you'll huh and no i really present the hassan session for that
1:04:19and that speaker verification concordia
1:04:26e these informal do not work well i don't know book
1:04:31well i'm not really run cold war used rate it let's see
1:04:37i mean "'cause"
1:04:40one and talk about ease
1:04:43two things first
1:04:44i will go through the call that are using their
1:04:48most of my experiments
1:04:53after that i mean how well if you can do tricks to solve the batteries
1:04:58implementation issues
1:05:02that i have used and
1:05:06okay so
1:05:11the call for and final system so this is a call that i started work
1:05:17on during my forestalled a but from to those in sixteen the person time t
1:05:23initially horse in the on all but the now consider a while
1:05:30and idea sees the
1:05:32time to switch to or a data tensor able to or like torture or something
1:05:41the links of the repository is here
1:05:44and most stuff in this repository is no and is more most states there are
1:05:52four multiclass the weighting well mostly to use a little because training where maybe in
1:05:59combination with other stuff
1:06:01but the
1:06:03don't know much on a
1:06:05that's uses
1:06:07you're and then training with the verification loss
1:06:11the paper is that we're of only stores actually based on hold close to the
1:06:15on the one i think it's not so much point two
1:06:19maintain that are in more
1:06:23but i do have a one screen here that you to the verification lost in
1:06:29combination with the identification lost so that's description we will look at
1:06:37and generally
1:06:41well it's a this first i'm trying to point out things in this call that
1:06:46i think yes certainly well known and worked well and are known and also mention
1:06:51what they we show
1:06:52really them differently
1:06:54to maybe give so
1:06:57well at least i can say from like stressful as good an allpass time
1:07:03small toolkit for speaker verification
1:07:09i know that i didn't see and then if we hear from and the verification
1:07:15lost to that identification the most
1:07:18and contrary to the paper and mentioned in the tutorial
1:07:23it could be that these quite complicated scheme for changing the balance between the losses
1:07:30throughout the training is really ladies this may be something i don't look at some
1:07:41and this screen i think units
1:07:45you want to try to instances where only you know little normal way
1:07:49the in the local but you don't want running in the not here unique feel
1:07:54a little bit with the intention because
1:07:58right in
1:08:00cantonese in such a way that it's
1:08:02here three but
1:08:07some small adjustment might be needed if you actually want to run it here
1:08:18i tried in these in when organising my experiments to high in the way that
1:08:25there is one screen where everything that is specifically the experiment is set so that
1:08:32includes which data to use and the configuration of the more balanced along
1:08:38i was really i
1:08:42an efficient lighting to have
1:08:44input arguments to this researchers we should be to use as long because anyway you
1:08:50were wireless always have to change something in this creation
1:08:56for a new experiments are then you can just as long routine often a and
1:09:02so on
1:09:03a wrestler
1:09:06but other things that a little bit more face from extend this experiment this is
1:09:12just the loaded from this good
1:09:15such as model on different architectures as long
1:09:25so usually i use these underscore for denoted sensible variables underscore v for placeholders
1:09:34so long
1:09:36the kind of
1:09:40models are
1:09:43similar to here as models are then maybe a little bit less
1:09:49fancy if you're
1:09:58i didn't use here us here initially because when i started with this years ago
1:10:03cares more flexible enough there were i quite agree pure only those of recruited two
1:10:11neatly with this but i know it is definitely flexible enough
1:10:31for example here is this is five where features are things that
1:10:37things maybe some one would think is that are those on a you all remember
1:10:43about a
1:10:45seems anyway necessary to change things in this problem for every experiment i prefer you
1:10:51their thing here
1:10:53so you're somebody stole training data
1:10:56how long as the shortest and a longer segments are trained on
1:11:03some other patterns related to training batch size
1:11:08maximum number of the box
1:11:12number of bashes in an input so i don't really define
1:11:18yep or as warm day a by defining that's the second number of patches that
1:11:23the wine in it in a minute
1:11:29also patience probably most of your familiar with it is worth mentioning
1:11:34you train or
1:11:35what it is score
1:11:37so the next part of the screen is the bar for defining how to load
1:11:46and prepare data
1:11:48and here is long important points is the
1:11:53so you the bashers we will
1:11:56well gee chunks of feature from different utterances so randomly selected segments
1:12:05if you know say that from a normal hardest and randomly select different segments from
1:12:13different utterances
1:12:15this will be nice too small i was to sell
1:12:21so often
1:12:23you can would meeting it is time varying or case at a time or can
1:12:28compare a
1:12:29many lashes
1:12:33so that's one way he in all my service so i is the to the
1:12:40data on missus the and then can be loaded as you wish feature shows can
1:12:47be loaded randomly fast enough for that
1:12:50so this is
1:12:53good because it allows for a lot much more flexibility in experiments for example sometimes
1:12:59you may want to load to segments from the same as is that what one
1:13:06to go for some for some experiments
1:13:10or sometimes you just want to change the duration of the segments
1:13:18use our case then you have to prepare and you are case for this
1:13:22so i don't say that
1:13:24using is the ease
1:13:28and then just load features a single going is
1:13:33very good thing and as the c is really good however to invest see if
1:13:38you want to
1:13:39it can of experiments
1:13:44i define some functions for example low fee training process given some
1:13:52given and some list of finals this one we load the data and that could
1:13:59so if you want remotes parcels these batteries specifically as long again
1:14:05if find that here but if you want to do for example of the thing
1:14:08i mentioned too low to segments from the same utterances that one then you would
1:14:13have to change the function here
1:14:15so this was quite the
1:14:18useful way of organising is that for me at least in my experiments
1:14:26i also another important thing in this for easter creates on dictionary a religious train
1:14:32is sixty four conversation other missionaries of for example a closest eager not be
1:14:38and thus to fine off a thing and the law
1:14:43and that's
1:14:46created here
1:14:51and he's
1:14:55no means are used to create a media batches
1:14:59and a little bit later down here i create a generator for media batches and
1:15:04it takes the this stationary off
1:15:09mappings across a speaker mapping as a long and i have different the generators depending
1:15:15on what kind of media matches i won't for example you want
1:15:19randomly selected speakers and older data are going to the actual remote randomly selected speakers
1:15:24and for example two apples each or something like that
1:15:30so that's its shape by changing on a gender
1:15:39then the next step is to
1:15:42so that the modal
1:15:44and here i'm using here
1:15:47t v in a artificial light expect or other comics or
1:15:52and it i also a det lda model
1:15:57a half to the school and endings from this
1:16:03or text editors still called
1:16:09we should
1:16:11do kind of verification
1:16:19i mentioned is minor differences from the holiday architecture is that i found it necessary
1:16:24to have some kind of normalization layer alter the temporal coolly better or more just
1:16:32at feast elicitation but estimated on the data that supports in the beginning works fine
1:16:36as well
1:16:38i guess line is needed here could be because we use a simpler optimize the
1:16:44we use just stochastic gradient descent as compared to colour the use that are most
1:16:48of the monster
1:16:54so in this conan columns
1:17:00definition of the are they show like number of layers their sizes
1:17:07activation functions
1:17:09and so
1:17:11whether we should have a normalization of features normalization all the are truly
1:17:18and whether they these
1:17:27or you don't face of the data being initial last
1:17:37auctions for regular stations the lower
1:17:41we initialize the model here and we provide
1:17:47when you do this at the rate or the generator for the
1:17:52they the training data and this is used to initialize to model the normalisation layers
1:17:58this is something that creates be a mess and i probably wouldn't song i
1:18:04differently if i work right and you are
1:18:09maybe some knowingly initialization and that's around a few
1:18:18in the before starting the can you just
1:18:21initialize the layers the normalisation layers
1:18:30you if we apply a smaller to today the which is in this place holders
1:18:41what comes out will be this and endings the classifications
1:18:47so or
1:18:49then ratings in this particular we will send them to
1:18:55in the lda model
1:18:58basically here
1:18:59we make some settings for here
1:19:05probabilistic lda model we can get the score
1:19:10and for all pairwise comparisons
1:19:14it in the dash and also loss for that can provide
1:19:18labels for it
1:19:22so next car is to and are defined lost and train functions along
1:19:31we have lost as a weighted keisha lost it has lost and a single because
1:19:36the verification loss
1:19:38well here's in binary and their average fits weights in the original one point five
1:19:44and still one seventy five respectively
1:19:47and maybe one important thing use here we these forces are normalized in there and
1:19:54from be so that's
1:19:58log of their probability in the case of so long we're number of speakers
1:20:05i mean for around a classification of random quotes
1:20:09and the reason to do this is
1:20:13if the model is yes initialized or just a round of relations that the loss
1:20:18maybe one or approximately well
1:20:21and we do the same thing for the verification loss
1:20:28you this means that all these also source data you know similar way and it
1:20:33becomes easier to choose to interpolate between them
1:20:40and the end of these the screen we define a training function which takes
1:20:45the data for actually in school and to one article
1:20:50the more
1:20:53next for please for a
1:20:59defining functions for a set i think parameters locating parameters for the more
1:21:07i function two
1:21:10change of the easy to shake some kind of validation lots of the each block
1:21:21so this starts just for setting parameters and getting parameters
1:21:28maybe no so importance
1:21:31it can find
1:21:34function for changing the validation was here
1:21:39finally the training is to combine these
1:21:41function here which takes these
1:21:45function and therefore
1:21:47changing validation loss takes many other parameters
1:21:51and things that the undefined
1:22:03for example in function for training and so on
1:22:07so these the way we trained here is basically so
1:22:12alternately she for which was defined as
1:22:16alright for however bashers
1:22:20and this is because we don't really have a case you just complete equal continues
1:22:24every random statements
1:22:26this is
1:22:29as long as they won't work so there's really clear idea what is the what
1:22:33is data
1:22:35but anyway
1:22:38we do training if he doesn't include one the one additional also be a good
1:22:46try a few more times o and two patients number of times and you is
1:22:51to include that we will
1:22:54research around there's to the best on the whole the learning rate increase but okay
1:23:00i don't know this is the best
1:23:02"'kay" be seen but as for well enough for me
1:23:14yes for the whole piece
1:23:18going on i would like to mention a few weeks
1:23:23not very complicated things
1:23:25it was maybe slightly difficult for me to figure out
1:23:31they are related to back propagation and the things i wanted to modify their
1:23:41so let's just first briefly review the back propagation algorithm
1:23:52you know that the neural network is just
1:23:55some serious of affine transformation followed by nonlinearity then again affine transformation and again only
1:24:01the install
1:24:03so that's a result in some you will be applied affine transformation
1:24:09i guess is set here and then we apply some nonlinearity and
1:24:14i mean yes the a that's going to and we do that over and over
1:24:18and that's called a final
1:24:22i'll put four and then we have some cost function
1:24:26i'm on that for example cross entropy
1:24:29and we if we you know function composition bit is the reading here's basically means
1:24:34the compositional g and h is just like
1:24:39an h on the data and energy and still then we know that we can
1:24:43write the whole neural network s
1:24:46applying the first affine transformation of the input
1:24:50next door first the nonlinearity
1:24:54all the way
1:24:55but the output
1:24:57it can be written like these
1:24:59and is also easy to write well the
1:25:04gradient of the
1:25:06loss with respect to that you could point using the chain rule i is
1:25:11so it's just
1:25:13basically everybody will see with respect to improve this just
1:25:19change like this study video scene with respect to a time period well a i'm
1:25:23this dataset i install
1:25:25so i have this
1:25:27funny thing brackets here just and you know that these are
1:25:32just covariance so the multivariate shaver looks
1:25:37same as the second one just that we need to use digital us instead of
1:25:43this is not normal productive
1:25:48so forceful
1:25:56relative lc with respect to a
1:26:01this is i criterion is really right because it's a vector so
1:26:07when all these elements like these here
1:26:12criminal a with respect was
1:26:15easy just gonna be a diagonal probably unlike is because f is the functional design
1:26:20elements bias
1:26:22and the other one three that you off
1:26:26san interesting to a
1:26:28if we look at this thing here we will see maybe a little bit for
1:26:33this is just the weight matrix
1:26:36so then back propagation is
1:26:39okay we start by calculating the
1:26:42d c
1:26:45this is a i
1:26:48and that's just these two
1:26:51and then
1:26:52we can
1:26:54continue with
1:26:56get it is easy with respect to some other set i by just taking that
1:27:02are that we have and multiply for example we these two then we get an
1:27:06extra and still
1:27:08so it's
1:27:10but course process like that so that yes you lost the remote people loss with
1:27:16respect to include in the of what we want this of course with respect to
1:27:19model parameters which is that
1:27:22biases in the weights
1:27:24which we have
1:27:26here and here
1:27:29those are given by these extensions here
1:27:34for the biases is just these
1:27:38a second down here
1:27:40for the weights model can claim that the corresponding part of the weight matrix
1:27:49this is just sorry within corresponding part of the
1:27:55ye activation and a here we are interested in contributing with respect to this also
1:28:00we need more like the corresponding part this
1:28:07okay so no i'm talking about when we are fresh test
1:28:13and here we also to really good references for these if you want to
1:28:19further into it
1:28:24no where we have
1:28:27mentioned this
1:28:28i would say
1:28:30well buffy different issues that i run into their that require some
1:28:36little bit of thinking in relation to this
1:28:40first thing is that you see here that in order to calculate the derivative existing
1:28:47weights you need the our schools of each layer is a here
1:28:53and so that means that we need to see all of those memory from the
1:28:58forward also needed you the main memory okay we look that passed and if you
1:29:03have to be batches many utterances also long utterances this can become too much
1:29:10it can go up to many gigabytes several makes sense for example
1:29:17or larger batches
1:29:24the no and sensible well as on printing home way of getting around this
1:29:31and that is that you
1:29:37or where the data
1:29:39then they have the option in case of ten some for the case of the
1:29:43angle you have the option to discard the
1:29:48intermediate file was from the for us then maybe you that there are also you
1:29:52will recalculate then when you need that so you basically just have the
1:29:59in memory for one dollar score one on this time
1:30:04that's the floor one have the same thing about a little bit better because data
1:30:09to discard the corporate like to the cu memory which is generally bigger
1:30:15there you family
1:30:19in that case
1:30:21or to use this we can
1:30:25we you over the inputs a until probably layer and all the pooling layer we
1:30:31put all these
1:30:33close together so that we have now a kind of
1:30:40tests or with the old adding store
1:30:44and then that can be processed normally
1:30:50and then you would just calculated los and ask for the right so or at
1:30:55least one okay so that to think carefully
1:31:02this of course also has the advantage that we can have same and different directions
1:31:08well we may things like
1:31:11but for a bit complicated or maybe not even possible
1:31:18i'm not showing the congo
1:31:21these people sees me see so many other things hours and makes is very difficult
1:31:27to see what's going on
1:31:30i have it does not seventeen
1:31:36but the i was hoping to write some small for example but they didn't manage
1:31:42to do it in time
1:31:47okay so that's one three
1:31:50a second
1:31:53tree is related to parallelization
1:32:06suppose that we have some or detection like this because feature but and then we
1:32:11are probably
1:32:13and then we have some processing all them things and finally scoring
1:32:19no if we want to
1:32:22well normally if we want to do parallelization will be training for some multiclass okay
1:32:27it doesn't really a problem because we just is to give the day on different
1:32:31workers each of them calculate some radians and we can actually right yes
1:32:35or we can not irish the updated models
1:32:39but in this case seems this scoring large when we do use the verification lost
1:32:45in the scoring or we would like to have a comparison of all trials all
1:32:50possible trials
1:32:51so we need to do
1:32:53time delay and the things on individual workers the sound of all the and endings
1:32:59to the master where do this scoring
1:33:03no we do back propagation a to them but he's and then we sell those
1:33:10tries to each worker
1:33:13and the they can continue the
1:33:17the back propagation
1:33:20the thing is this is not exactly and by normal to the case when you
1:33:26calculated the loss here then you
1:33:30a propagation but also the includes what is known has included which was just everybody's
1:33:37then you basically they try to loss with respect to and endings
1:33:42and how to use that s two
1:33:47continue the back propagation on
1:33:50the individual nodes
1:33:56one single tree to do this is defined like in a sequence only a loss
1:34:01like this here so i define a new loss which is yes
1:34:05this is the remote zero
1:34:08see the cost
1:34:11with respect to the embedding elements which is what we have to change
1:34:16problem most or no
1:34:18times now ready or just like doesn't all probably like this
1:34:23and if we know
1:34:26optimize these loss you will get
1:34:28what we won't be cost
1:34:31let's consider right and the order derivative of these loss increased a cell to some
1:34:36model parameter of the neural network
1:34:39okay we apply here
1:34:41just take this started in here
1:34:45here is something that it has on these
1:34:48there are so we are right yes here and this is i certainly exactly the
1:34:54the relative that the are
1:34:57off looking for so
1:35:01the remote view
1:35:03for these loss with respect to model or anything will be exactly the same passed
1:35:08a law that we are interested e
1:35:10is possible that some newer tutees has
1:35:15what is actually just do this without using some tree i'm not sure that
1:35:21this was as though to achieve this
1:35:29final tree
1:35:34related to
1:35:36something the holocaust repair saturated rental units
1:35:45right is the sum operation function so let us remember we have a fine transformation
1:35:51formal by so
1:35:54activation function and if it's the revenue proposal is one of the
1:36:01problem on then
1:36:03whenever the goal is always below sea able to these rental will or when everything
1:36:09but this is close to the red will put zero so if or includes or
1:36:14below zero then this rhino is basically never all putting anything in because it's a
1:36:23and we there is also the opposite problem if they but is always a zero
1:36:27then there are n is just a linear units so we really models
1:36:34the includes threatens to be
1:36:37in a
1:36:40be sometimes
1:36:42positive and sometimes negative then the railways brady units
1:36:48nonlinearly and
1:36:51the network is doing something interesting
1:36:53so how we have these is that usually checks if read a unit has problem
1:37:00like this and in that case
1:37:03they will ask
1:37:04some a little also
1:37:07to test a
1:37:09so that everybody will see with respect to set
1:37:14a problem to do this in some of the standard neural network is that we
1:37:19don't really we can't really we don't have an easy way to manipulate this stuff
1:37:26which is used in the back propagation
1:37:28so we will be set to manipulate the derivatives with respect to model parameters directly
1:37:43these relations lou
1:37:46we wanted us from the data that and
1:37:49the derivative with respect to be easy just
1:37:52is the as we were asked thing is achieved in a place you can just
1:37:56at that it leads to this
1:37:58do not here we usually get from model to
1:38:02and similarly for the way it's is just as we also need to multiply these
1:38:08articles and a because that's called it remotely calculate
1:38:13so these for some small three weeks and there may be summary i can say
1:38:18that is quite helpful to when you were neural network to
1:38:23based on the back propagation probably so that you know what's going on
1:38:29and then you can easily too small fixes like is
1:38:34so that's
1:38:35or well from the hands on session thank you for attention and by