| 0:00:15 | all right about multi class discriminative training of i-vector language recognition this morning are now | 
|---|
| 0:00:21 | on the crate from john hopkins university | 
|---|
| 0:00:23 | and like to acknowledge some interesting discussions during this work with | 
|---|
| 0:00:27 | my current colleague daniel my previous colleagues dog in l yet | 
|---|
| 0:00:31 | duggan pedro sorry and more recently with nico | 
|---|
| 0:00:37 | so | 
|---|
| 0:00:39 | as an introduction | 
|---|
| 0:00:41 | you guys know i think we had one discussion this morning that | 
|---|
| 0:00:45 | language id using i-vectors is state-of-the-art system | 
|---|
| 0:00:49 | what i wanna talk about is some particular aspects of it is typically done as | 
|---|
| 0:00:54 | a two-stage process where we use a classifier is the first thing even after we've | 
|---|
| 0:00:58 | got the i-vectors first we build a classifier | 
|---|
| 0:01:01 | and then we separately build a backend which does the calibration and perhaps fusion as | 
|---|
| 0:01:06 | well so i wanna talk about two aspects that are a little different of that | 
|---|
| 0:01:10 | first i wanna talk about what if we try to have one system that does | 
|---|
| 0:01:14 | the discrimination | 
|---|
| 0:01:15 | the classification and the calibration once they're using discriminative training nobody ever said we have | 
|---|
| 0:01:21 | used to systems back to back what we do it all together | 
|---|
| 0:01:24 | and then the secondly i wanna talk about an open set extension to what is | 
|---|
| 0:01:28 | usually a closed set language recognition task | 
|---|
| 0:01:34 | so in the top i will start with a description of the gaussian model in | 
|---|
| 0:01:38 | the i-vector space it something that many be seen before but i need to talk | 
|---|
| 0:01:42 | about some particular aspects of it in order to get into the details here | 
|---|
| 0:01:47 | also talk about how that relates to the open set case in that case of | 
|---|
| 0:01:50 | go into some of the bayesian stuff that we do in speaker recognition and how | 
|---|
| 0:01:54 | that could or couldn't be relevant in language recognition what the differences are | 
|---|
| 0:01:59 | then i will talk about the two key things here which is the discriminative training | 
|---|
| 0:02:03 | that i'm using in particular which is based on mmi and then i'll talk about | 
|---|
| 0:02:07 | how i do the out of set model | 
|---|
| 0:02:12 | so as a signal processing guy i like to thank of | 
|---|
| 0:02:15 | this as an additive gaussian noise model and signal processing this is one of the | 
|---|
| 0:02:19 | most basic things that we see | 
|---|
| 0:02:21 | so | 
|---|
| 0:02:22 | it in this context what we're talking about is that the observed i-vector you see | 
|---|
| 0:02:26 | was generated from a language so it should look like the language vector mean but | 
|---|
| 0:02:31 | it's corrupted by an additive gaussian noise | 
|---|
| 0:02:34 | which we typically call a channel for lack of a better word | 
|---|
| 0:02:38 | so this model here from a pattern recognition point of view we have a unknown | 
|---|
| 0:02:43 | lean of each of our classes | 
|---|
| 0:02:45 | we have a channel which is gaussian looks the same for all of the classes | 
|---|
| 0:02:49 | that means that our classifier is a shared covariance gaussian model | 
|---|
| 0:02:54 | and each language model is described by its mean | 
|---|
| 0:02:58 | and that shared covariance is a channel or a within class covariance | 
|---|
| 0:03:06 | so the building language recognition system then we need a training process in the scoring | 
|---|
| 0:03:11 | process | 
|---|
| 0:03:12 | training means we need to learn this shared within class covariance and then for each | 
|---|
| 0:03:16 | language we need to learn what its mean looks like | 
|---|
| 0:03:19 | and testing again is this gaussian scoring | 
|---|
| 0:03:22 | and i guess unlike some people and stream are not particularly uncomfortable with closed set | 
|---|
| 0:03:27 | detection | 
|---|
| 0:03:28 | and that gives you a sort of funny looking form bayes rule the target if | 
|---|
| 0:03:33 | it is this class then that's just the likelihood of this class | 
|---|
| 0:03:37 | that's easy | 
|---|
| 0:03:38 | but the non-target means that it's one of the other classes and then you need | 
|---|
| 0:03:41 | some implicit prior of the distribution of the other classes | 
|---|
| 0:03:45 | which for the other eases design where you can use a flat prior | 
|---|
| 0:03:49 | given that is not the target | 
|---|
| 0:03:55 | so that the key question then for building language model is how do we estimate | 
|---|
| 0:04:00 | the mean estimating the mean of a gaussian is not one of the most complicated | 
|---|
| 0:04:04 | things and statistics but there are multiple ways to do it of course this paper | 
|---|
| 0:04:08 | or thing to do with just take the sample mean maximum likelihood | 
|---|
| 0:04:11 | and that's mainly what i'm gonna end up using here in this work but i | 
|---|
| 0:04:14 | wanna imprecise there are other things you could do and in speaker recognition we do | 
|---|
| 0:04:19 | not do that we do something more complicated | 
|---|
| 0:04:22 | the next a more sophisticated thing is map adaptation that we all know from gmm | 
|---|
| 0:04:26 | ubms and dogs work | 
|---|
| 0:04:28 | but you can do that in this context as well it's very simple formula that | 
|---|
| 0:04:33 | requires however that you have a second covariance matrix which we can probably across class | 
|---|
| 0:04:37 | covariance which is the prior distribution of what all models could look like | 
|---|
| 0:04:42 | where in this case the distribution of what means are drawn from | 
|---|
| 0:04:47 | and then finally from there you can go instead of taking point estimate you can | 
|---|
| 0:04:51 | go to a bayesian approach where you don't actually estimate the mean for each class | 
|---|
| 0:04:56 | you estimate the posterior distribution of the mean of each class given the training data | 
|---|
| 0:05:00 | for that class | 
|---|
| 0:05:02 | and in that case | 
|---|
| 0:05:04 | you can see that posterior distribution and then you could be scoring with what's called | 
|---|
| 0:05:08 | the predictive distribution which is | 
|---|
| 0:05:10 | a bigger gaussian the fatter gaussian it includes the within class covariance | 
|---|
| 0:05:15 | but also has an additional term which is the uncertainty of how much you didn't | 
|---|
| 0:05:18 | show me that particular class | 
|---|
| 0:05:25 | one little trick that i only learned recently no we started learned a lot sooner | 
|---|
| 0:05:28 | it's a | 
|---|
| 0:05:30 | develop many years ago have a reference in the book but it's really handy for | 
|---|
| 0:05:33 | all these kind of systems is | 
|---|
| 0:05:36 | everybody knows you can buy wise one covariance matrix recognizer data such that the covariance | 
|---|
| 0:05:40 | you can but a linear transform and datasets the covariance which by the fact you | 
|---|
| 0:05:45 | can do it for two | 
|---|
| 0:05:46 | and since we have to this is really helpful | 
|---|
| 0:05:49 | and i have a formulas in the paper it's actually not very heart | 
|---|
| 0:05:54 | and you end up with a linear transform where within classes identity which we're often | 
|---|
| 0:05:58 | used to w c and for example compasses that | 
|---|
| 0:06:01 | but across class is also diagonal and it's sorted in order so the most important | 
|---|
| 0:06:05 | dimensions are first | 
|---|
| 0:06:07 | and the beautiful global transformation | 
|---|
| 0:06:10 | it means that you can do linear discriminant analysis you can do dimension reduction easily | 
|---|
| 0:06:16 | in the space just by picking the most interesting mentions the person | 
|---|
| 0:06:20 | and it's also a reminder that when you say you do lda in your system | 
|---|
| 0:06:26 | these little careful because lda | 
|---|
| 0:06:29 | there's a number of ways to formulate lda they all give the same subspace but | 
|---|
| 0:06:33 | they don't give the same transformation within that subspace | 
|---|
| 0:06:37 | because that's not part of the criterion | 
|---|
| 0:06:40 | and that's what this doesn't give the same subspace but it's not the same linear | 
|---|
| 0:06:43 | transformation | 
|---|
| 0:06:48 | so i'm gonna that some experiments here i'll start with some simple ones of the | 
|---|
| 0:06:51 | want to the discriminative training next | 
|---|
| 0:06:54 | where using acoustic i-vectors i think maybe it was mentioned here | 
|---|
| 0:06:58 | the main thing we're gonna a lid system is you need to do shifted delta | 
|---|
| 0:07:01 | cepstra and you need to do vocal tract length normalisation might not do speaker | 
|---|
| 0:07:07 | i'm gonna present lre eleven "'cause" it's the most recent lre but as the kind | 
|---|
| 0:07:12 | of hinted i'm not gonna use pair detection "'cause" i'm not a big fan of | 
|---|
| 0:07:15 | pair detection | 
|---|
| 0:07:18 | somebody's the over metric c average | 
|---|
| 0:07:21 | but you get similar performance rankings | 
|---|
| 0:07:24 | when you pair | 
|---|
| 0:07:25 | detection as well | 
|---|
| 0:07:27 | and | 
|---|
| 0:07:30 | within lre | 
|---|
| 0:07:31 | you build your own train and that's of these are of lincoln's training data sets | 
|---|
| 0:07:35 | that are currently | 
|---|
| 0:07:36 | zero mean | 
|---|
| 0:07:40 | so i mentioned just as generative gaussian models i mentioned that you can do ml | 
|---|
| 0:07:45 | and you can do these other things i mentioned ml map of a | 
|---|
| 0:07:48 | have a nice applied here we just three things but is actually not those three | 
|---|
| 0:07:50 | things | 
|---|
| 0:07:52 | so you have to pay attention but didn't describe what this is | 
|---|
| 0:07:57 | but ml so what i'm doing here is there is no back and there is | 
|---|
| 0:08:00 | just bayes rule applied "'cause" that's the formula that i showed you to the generative | 
|---|
| 0:08:04 | model of gaussian | 
|---|
| 0:08:06 | and these numbers for people who do our reads these are not very good numbers | 
|---|
| 0:08:09 | but this is what happened straight out of the generative model | 
|---|
| 0:08:13 | and what i'm showing is c average didn't in c average | 
|---|
| 0:08:17 | so means the average means you had a heart detection hard-decisioning on the detection | 
|---|
| 0:08:23 | so the ml system | 
|---|
| 0:08:24 | is the baseline | 
|---|
| 0:08:27 | if you do this is the bayesian system so where you make the bayesian estimation | 
|---|
| 0:08:31 | of the being then you actually in the end don't actually have the same covariance | 
|---|
| 0:08:34 | for every class "'cause" they had different counts and that gives different a predictive uncertainty | 
|---|
| 0:08:39 | but in a factor are very similar because in language recognition | 
|---|
| 0:08:42 | you have many instances per class so it almost degenerates to the same thing | 
|---|
| 0:08:47 | the reason i didn't show map is "'cause" it's in between those two and there's | 
|---|
| 0:08:50 | not much space in between those two so it's not a very interesting thing | 
|---|
| 0:08:54 | this last one is kind of interesting in that | 
|---|
| 0:08:59 | it's not right but it actually works better | 
|---|
| 0:09:01 | from a calibration | 
|---|
| 0:09:03 | well as you say calibration that you think that it works better in the bayes | 
|---|
| 0:09:07 | rule | 
|---|
| 0:09:08 | what i've done here is what we typically do in speaker recognition where you use | 
|---|
| 0:09:10 | the right map what you pretend that there's only one cut instead of keeping the | 
|---|
| 0:09:14 | correct count of the number of cuts | 
|---|
| 0:09:16 | and that gives you in terms of the predicted distribution that gives you a greater | 
|---|
| 0:09:20 | uncertainty that and a wider covariance | 
|---|
| 0:09:23 | and so happens that actually works a little better in this case | 
|---|
| 0:09:29 | but i | 
|---|
| 0:09:31 | once you put a back into the in the system which is what everybody's usually | 
|---|
| 0:09:34 | showing with some then these differences really disappears so i'm gonna use ml systems for | 
|---|
| 0:09:40 | the rest of the discriminative training work | 
|---|
| 0:09:43 | as i said these numbers are very good there about three times as bad as | 
|---|
| 0:09:46 | a state-of-the-art | 
|---|
| 0:09:48 | what usually done it is with additionally trained back end the simplest one i think | 
|---|
| 0:09:52 | john had was the full tell the scalar multiclass thing that a coded decoded before | 
|---|
| 0:09:58 | that's logistic regression | 
|---|
| 0:09:59 | you can do a full of logistic regression with a matrix instead of with a | 
|---|
| 0:10:02 | scalar you can put a gaussian backend in front | 
|---|
| 0:10:09 | a logistic regression which is something that we tried for or you can use a | 
|---|
| 0:10:13 | discrimate we train gaussian as the back and which is something we were doing it | 
|---|
| 0:10:17 | lincoln for quite awhile | 
|---|
| 0:10:19 | and these systems all work much better and pretty similar to each other | 
|---|
| 0:10:24 | you can also build a classifier to be discriminative one of the more common things | 
|---|
| 0:10:28 | to do is an svm one verses rest | 
|---|
| 0:10:31 | that's not that still doesn't solve the final task but it can help | 
|---|
| 0:10:35 | and if you do one verses rest logistic regression you also still need to back | 
|---|
| 0:10:39 | end or you can do recently uniquer has been doing a multiclass | 
|---|
| 0:10:44 | training of the classifier itself followed by multiclass backend | 
|---|
| 0:10:47 | but what i wanna talk about is trying to do everything together one training of | 
|---|
| 0:10:51 | the multiclass system that won't need its own separate back end ready to apply bayes | 
|---|
| 0:10:56 | rule straight out | 
|---|
| 0:10:57 | and | 
|---|
| 0:10:59 | it's not commonly used in backends but in our field mmi is a very common | 
|---|
| 0:11:03 | thing in a given in the gmm world in the speech recognition work | 
|---|
| 0:11:07 | the criterion if you're not familiar with it | 
|---|
| 0:11:10 | it is another name for the cross entropy which is the same metric that logistic | 
|---|
| 0:11:14 | regression uses | 
|---|
| 0:11:15 | it is a multiclass are your probabilities correct kind of a metric | 
|---|
| 0:11:21 | and it is a this is a closed set | 
|---|
| 0:11:24 | discriminative training of classes against each other | 
|---|
| 0:11:28 | the update equations | 
|---|
| 0:11:30 | are you haven't seen are kind of cool and they're kind of different it's a | 
|---|
| 0:11:33 | it's a little bit of a we're derivation compare the gradient descent that everybody's two | 
|---|
| 0:11:38 | it can be interpreted of like a gradient descent with kind of a magical step | 
|---|
| 0:11:42 | size | 
|---|
| 0:11:44 | but it's quite effective and the weights it's always done in speech recognition is | 
|---|
| 0:11:49 | since you were doing this to a gaussian system you start with an ml version | 
|---|
| 0:11:52 | of the gaussian and then you discriminatively updated so to speak | 
|---|
| 0:11:56 | a that makes the converse is much easier | 
|---|
| 0:11:59 | it gives an actual regularisation because you're starting with something that is already a reasonable | 
|---|
| 0:12:03 | solution and in fact the simplest form of regularization is just to not let it | 
|---|
| 0:12:07 | run very long it is also a lot cheaper | 
|---|
| 0:12:10 | and it also gives you something you can tie back and put a penalty function | 
|---|
| 0:12:14 | that says don't be too different from the ml solution | 
|---|
| 0:12:17 | so regularization is it is and straightforward thing to do an mmi | 
|---|
| 0:12:22 | and this diagonal covariance transformation that i was talking about is really helpful there here | 
|---|
| 0:12:27 | because | 
|---|
| 0:12:28 | then we can only discriminately update these diagonal covariances instead of full covariances | 
|---|
| 0:12:33 | so we have fewer parameters than a full matrix logistic regression but more parameters the | 
|---|
| 0:12:38 | lowest logistic burst | 
|---|
| 0:12:45 | so now these are pretty much state-of-the-art numbers now remember the previous number that couldn't | 
|---|
| 0:12:50 | were up here essentially | 
|---|
| 0:12:54 | so this is the ml gaussian followed by an mmi gaussian backend in the score | 
|---|
| 0:12:59 | space which is kind of our dpot way of doing things when i was at | 
|---|
| 0:13:03 | lincoln | 
|---|
| 0:13:05 | this for score is kind of a disappointment which is what if you take the | 
|---|
| 0:13:08 | training set and you discrimate we trained with them in mine and they don't have | 
|---|
| 0:13:12 | a back here | 
|---|
| 0:13:13 | it is in fact | 
|---|
| 0:13:15 | considerably better than the ml system really of its equivalent would which i started | 
|---|
| 0:13:20 | but is nowhere near where we wanna be obviously | 
|---|
| 0:13:23 | so | 
|---|
| 0:13:24 | why not | 
|---|
| 0:13:25 | and | 
|---|
| 0:13:28 | one of the core of our e | 
|---|
| 0:13:30 | that | 
|---|
| 0:13:32 | is more data dependent i think then realistic | 
|---|
| 0:13:36 | is the dataset actually looks different than the training set | 
|---|
| 0:13:39 | so this is only done on the training set it's not using any dev set | 
|---|
| 0:13:43 | at all | 
|---|
| 0:13:44 | the most obvious thing is that the training set and that at the data set | 
|---|
| 0:13:47 | in the test set are all thirty seconds approximately the training set is whatever sides | 
|---|
| 0:13:52 | of conversations that happen to be so that's an obvious mismatch selected the training set | 
|---|
| 0:13:57 | and truncated everything to be thirty seconds instead of the entire sorry | 
|---|
| 0:14:00 | drawing away data in that way turned out to be very helpful because it's now | 
|---|
| 0:14:03 | what better match to what the test data looks like | 
|---|
| 0:14:06 | but not everything i wanted so then i to the thirty second training set | 
|---|
| 0:14:10 | concatenated together with the dev set which is a thirty second set | 
|---|
| 0:14:14 | used the entire set at once | 
|---|
| 0:14:17 | for training the system and that in fact works as well as in and slightly | 
|---|
| 0:14:22 | better | 
|---|
| 0:14:22 | then the two different us as the system followed by | 
|---|
| 0:14:26 | discriminant right by a backend | 
|---|
| 0:14:32 | so i looked at the number of different ways | 
|---|
| 0:14:37 | permutations of this mmi system the anybody who's done it for gmm mmi is no | 
|---|
| 0:14:41 | you can you can | 
|---|
| 0:14:42 | train this that or the other and various things that | 
|---|
| 0:14:45 | and that the simplest thing to do is just to do the means only and | 
|---|
| 0:14:48 | that is fairly effective at the moment | 
|---|
| 0:14:52 | you can train the mean and the within class covariance which is | 
|---|
| 0:14:56 | and of course in the clothes that system the across class covariance is not coming | 
|---|
| 0:15:00 | into play it's only the within class covariance which is having five | 
|---|
| 0:15:05 | one thing that i found kind of interesting used to instead of training the entire | 
|---|
| 0:15:08 | covariance matrix to train the scale factor which scales the covariance that's to a little | 
|---|
| 0:15:14 | bit simpler system with fewer parameters | 
|---|
| 0:15:16 | and you can also play with the sequential system | 
|---|
| 0:15:20 | and in particular i found interesting to do the scale factor first and then the | 
|---|
| 0:15:24 | means just in terms of the it it's really | 
|---|
| 0:15:29 | that will given the end the same solution but | 
|---|
| 0:15:32 | when you only do a limited number of iterations to starting point in the sequence | 
|---|
| 0:15:36 | does affect you get | 
|---|
| 0:15:39 | so | 
|---|
| 0:15:42 | again these same sorts a lot this is what happens if you do so this | 
|---|
| 0:15:47 | is now purely no back-end and the discriminately train classifier itself if you do need | 
|---|
| 0:15:51 | only | 
|---|
| 0:15:53 | your partisan system is not terribly good but you're means the average is pretty close | 
|---|
| 0:15:59 | so that is an indication | 
|---|
| 0:16:01 | what is calibration mean in a multiclass detection | 
|---|
| 0:16:05 | task is kind of controversy all but | 
|---|
| 0:16:08 | one thing that i think i can say comfortably is whenever you see this happen | 
|---|
| 0:16:13 | it means that you're not calibrated | 
|---|
| 0:16:15 | the fact that they might not doesn't necessarily mean that you are calibrated "'cause" bayes | 
|---|
| 0:16:18 | rule is more complicated than that but | 
|---|
| 0:16:20 | but this means that it is clearly not calibrated | 
|---|
| 0:16:23 | so once we do something to the variance this is doing the mean and the | 
|---|
| 0:16:26 | entire variance this is doing the mean and the scale factor is very except same | 
|---|
| 0:16:30 | time | 
|---|
| 0:16:31 | and this is due in a two stage process or of the scale factor of | 
|---|
| 0:16:34 | the various followed by the mean | 
|---|
| 0:16:36 | all of those | 
|---|
| 0:16:37 | work much better so in order to get calibration you need to actually adjust the | 
|---|
| 0:16:41 | covariance matrix which kinda makes sense you need to scale factor or something | 
|---|
| 0:16:45 | and | 
|---|
| 0:16:46 | once you fine tune on the numbers as we typically do when we're actually working | 
|---|
| 0:16:50 | on these kind of task | 
|---|
| 0:16:52 | been actually see that the two stage process it is the baddest the best one | 
|---|
| 0:16:56 | and it is better than error | 
|---|
| 0:16:58 | our a two-step process that we used to have before of separate system followed by | 
|---|
| 0:17:03 | back in | 
|---|
| 0:17:06 | okay so that's the discriminative training part the other thing i want to talk about | 
|---|
| 0:17:09 | is the out of set problem that has mentioned in a question earlier | 
|---|
| 0:17:15 | because oftentimes were interested in something where there's it could be another language is not | 
|---|
| 0:17:20 | one of the closer | 
|---|
| 0:17:23 | the nice thing about our two covariance mathematics that we've been using for speaker recognition | 
|---|
| 0:17:28 | is it has in front of you a model for what out of set is | 
|---|
| 0:17:31 | supposed to the | 
|---|
| 0:17:33 | already mentioned that essentially that if you have a gaussian | 
|---|
| 0:17:35 | distribution of what all models look like then an out of set languages are randomly | 
|---|
| 0:17:40 | drawn language from that who | 
|---|
| 0:17:42 | and that's represented by the gaussian distribution | 
|---|
| 0:17:46 | then at test time | 
|---|
| 0:17:48 | you have again and even bigger gaussian because the uncertainty is both the channel plus | 
|---|
| 0:17:55 | which language was | 
|---|
| 0:17:56 | so now you have | 
|---|
| 0:17:59 | the out of set is also a gaussian bided have the bigger covariance then all | 
|---|
| 0:18:03 | the others have a share variance which is smaller so it you no longer have | 
|---|
| 0:18:07 | this a linear system | 
|---|
| 0:18:09 | when you make a comparison | 
|---|
| 0:18:11 | this is the most general formula when you have | 
|---|
| 0:18:15 | and open-set problem which is both out of set and closed set | 
|---|
| 0:18:19 | this is how you would combine them this is what i had before the sort | 
|---|
| 0:18:22 | of bayes rule a quick competition of all the other closer classes this is the | 
|---|
| 0:18:26 | new distribution the out of set distribution | 
|---|
| 0:18:29 | if you wanna pure out of set problem which is what i'm gonna talk about | 
|---|
| 0:18:32 | here you just take it needs to be out of set is one but in | 
|---|
| 0:18:35 | fact you could make a mix distribution well | 
|---|
| 0:18:38 | okay so i wanna talk about the out of set | 
|---|
| 0:18:42 | just a touch on is what i have now | 
|---|
| 0:18:44 | if i where to do the bayesian numerator for each class that i mentioned before | 
|---|
| 0:18:48 | and then this denominator | 
|---|
| 0:18:51 | then i have what would you like to call | 
|---|
| 0:18:53 | bayesian speaker comparison | 
|---|
| 0:18:55 | jones narrative paper about four | 
|---|
| 0:18:59 | it is the same answer as p lda or the two covariance model | 
|---|
| 0:19:04 | and i'd like to | 
|---|
| 0:19:06 | emphasise that | 
|---|
| 0:19:07 | they're set up differently so the numerator and denominator are different in these two mathematics | 
|---|
| 0:19:12 | but the ratio is the same thing "'cause" it's a models and is the same | 
|---|
| 0:19:16 | correct answer | 
|---|
| 0:19:18 | i think you know formalism like i'm talking about here i find it much easier | 
|---|
| 0:19:21 | to understand it in this context | 
|---|
| 0:19:23 | the philosophy | 
|---|
| 0:19:25 | and | 
|---|
| 0:19:27 | daniel i've spent a lot of time on this can see that only a few | 
|---|
| 0:19:30 | of the a guy from this perspective point of view | 
|---|
| 0:19:33 | but in this terminology we say that we have a model for each class and | 
|---|
| 0:19:38 | the covariances are hyper parameters in this terminology you guys like to say that there | 
|---|
| 0:19:43 | is no model | 
|---|
| 0:19:44 | and the parameters of the system are the covariance matrices again is the same | 
|---|
| 0:19:49 | system it's a different same answer to different perspective but when we're talking about close | 
|---|
| 0:19:53 | that and ml models | 
|---|
| 0:19:55 | i know how to say that in this context and i don't know so well | 
|---|
| 0:19:58 | how to say that | 
|---|
| 0:19:59 | in the p lda one | 
|---|
| 0:20:02 | so discriminative training of the out that i described this is the out of set | 
|---|
| 0:20:06 | model but as i've said now i have this mmi hammer in my toolbox | 
|---|
| 0:20:10 | and this is just one more covariance that i can train so i've got an | 
|---|
| 0:20:14 | across class mean and covariance | 
|---|
| 0:20:17 | the ml out of set system just takes that all of these where the sample | 
|---|
| 0:20:22 | covariance matrices so this p | 
|---|
| 0:20:24 | but i can | 
|---|
| 0:20:26 | do an mmi updated this out of set classes well the simplest way for me | 
|---|
| 0:20:30 | to do this is to take the | 
|---|
| 0:20:32 | the closed set system there are presented and then separately | 
|---|
| 0:20:36 | frees the closed set models and then separately update the out of set model given | 
|---|
| 0:20:40 | the closed set models | 
|---|
| 0:20:42 | i can do that with the by scoring would one verses rest instead of scoring | 
|---|
| 0:20:47 | with bayes rule and doing a round robin on the same training set so | 
|---|
| 0:20:52 | the advantage of this is i can actually build a system without ever actually having | 
|---|
| 0:20:56 | any out-of-class data probably do better if i really did have out-of-class data but in | 
|---|
| 0:21:00 | this case i don't and i can build a perfectly legitimate system | 
|---|
| 0:21:03 | so | 
|---|
| 0:21:06 | the performance of this system whatever done here | 
|---|
| 0:21:10 | is scored this lre even though there is no out of set data scoring without | 
|---|
| 0:21:14 | bayes rule where the system is then allowed to know what the other classes were | 
|---|
| 0:21:21 | and so that the simulation one open set | 
|---|
| 0:21:23 | scoring function | 
|---|
| 0:21:25 | the ml version of this | 
|---|
| 0:21:27 | the actual c average is actually a the chart it is that's kind of bad | 
|---|
| 0:21:31 | numbers that i started with four | 
|---|
| 0:21:33 | the mmi training of the closed set system | 
|---|
| 0:21:37 | and then the | 
|---|
| 0:21:39 | mel version of the across class covariance in fact is already a lot better so | 
|---|
| 0:21:43 | whatever's happening in the closed set discriminative training is actually helping the | 
|---|
| 0:21:47 | open set scoring as well but explicitly retraining | 
|---|
| 0:21:51 | the out of set covariance matrix | 
|---|
| 0:21:53 | with the same mechanism mel scale factor then the me | 
|---|
| 0:21:58 | in fact if the pretty reasonably | 
|---|
| 0:22:01 | and the system which is not obviously on calibrated | 
|---|
| 0:22:05 | and it's pretty reasonable performance | 
|---|
| 0:22:07 | the closed set scoring performance is still down here but this is gone a lot | 
|---|
| 0:22:11 | better and it's perfectly feasible | 
|---|
| 0:22:14 | so | 
|---|
| 0:22:16 | the two contributions here where the single system concept of we don't have to do | 
|---|
| 0:22:20 | system design and then back end we can discriminatively trained system to already be calibrated | 
|---|
| 0:22:26 | and we can model out of set using the same mathematics that we have in | 
|---|
| 0:22:31 | speaker recognition | 
|---|
| 0:22:33 | but a simpler version "'cause" we don't need to be used bayesian in this case | 
|---|
| 0:22:36 | and i think can also be discriminatively updated so that we can that be reasonably | 
|---|
| 0:22:40 | calibrated for the open set | 
|---|
| 0:22:42 | task as well | 
|---|
| 0:23:06 | so thanks island | 
|---|
| 0:23:08 | the very nice to see that you unified those two parts of the system | 
|---|
| 0:23:13 | i which we could do that than in speaker recognition | 
|---|
| 0:23:17 | so my question is your | 
|---|
| 0:23:21 | your maximum likelihood | 
|---|
| 0:23:23 | across class covariance so you've got twenty four languages to work within a six hundred | 
|---|
| 0:23:28 | dimensional | 
|---|
| 0:23:30 | i-vectors so | 
|---|
| 0:23:31 | how did you estimated or a sign that | 
|---|
| 0:23:34 | parameter | 
|---|
| 0:23:36 | it is the sample covariance so everything here was done with the dimension reduction in | 
|---|
| 0:23:41 | the front | 
|---|
| 0:23:42 | to twenty three dimensions | 
|---|
| 0:23:45 | i'm sorry that's why i illustration | 
|---|
| 0:23:48 | already specified that there would be twenty three dimensions | 
|---|
| 0:23:52 | and anything that has a prior is limited to twenty three dimensions | 
|---|
| 0:23:58 | okay in this i i'd since i just took the sample covariance matrix at regularized | 
|---|
| 0:24:02 | it somehow you can make it | 
|---|
| 0:24:05 | appear to be bigger to be full size | 
|---|
| 0:24:09 | okay so | 
|---|
| 0:24:10 | well those formulas you showed with the covariances that happens in twenty three dimensional space | 
|---|
| 0:24:15 | yes | 
|---|
| 0:24:25 | so in this case you doing lda and then i'm not my tanks she's got | 
|---|
| 0:24:31 | at the same as doing question back and another calibration | 
|---|
| 0:24:36 | well as lda and regression back at you that this evaluation bloodless | 
|---|
| 0:24:41 | this was your computing the sample covariance as ones in the full space | 
|---|
| 0:24:46 | but | 
|---|
| 0:24:47 | the across class is only rank twenty three | 
|---|
| 0:24:51 | so you take that the six hundred dimensional within class and map-adapted twenty three but | 
|---|
| 0:24:56 | yes | 
|---|
| 0:24:59 | so if you do lda and regression but gaussian backend is the same subspace | 
|---|
| 0:25:04 | use lda | 
|---|
| 0:25:06 | yes if you product of lda | 
|---|
| 0:25:09 | and twenty three or you get the gore since any get twenty four scores | 
|---|
| 0:25:13 | is almost the same thing so you're still doing to state to steps | 
|---|
| 0:25:20 | it's still just two steps | 
|---|
| 0:25:22 | in my view are the ml estimation which in this case forces you to be | 
|---|
| 0:25:26 | twenty three dimensional | 
|---|
| 0:25:28 | and then | 
|---|
| 0:25:29 | the update of those equations | 
|---|
| 0:25:32 | but lda english and | 
|---|
| 0:25:34 | back and there is similarity very close | 
|---|
| 0:25:38 | well like the way we would have done a system before would be lda and | 
|---|
| 0:25:43 | then gaussian in that space and then | 
|---|
| 0:25:47 | mmi training in the score space | 
|---|
| 0:25:51 | the likelihood ratios of the first thing this is mmi training in the i-vector space | 
|---|
| 0:25:55 | directly | 
|---|
| 0:25:57 | but | 
|---|
| 0:25:58 | these are not very complicated mathematics of things are pretty closely related yes | 
|---|
| 0:26:05 | so when you did the joint diagonalization | 
|---|
| 0:26:08 | in there and then you | 
|---|
| 0:26:10 | work with diagonal covariance matrices but then you're also updating the covariance matrices training | 
|---|
| 0:26:16 | is that diagonalisation still valid then | 
|---|
| 0:26:18 | i mean you do the static one projection was that mean then when you forced | 
|---|
| 0:26:22 | to be it back it's sort of like saying i mean | 
|---|
| 0:26:25 | the entire thing can be mapped back to | 
|---|
| 0:26:28 | by undoing the diagonalisation into a full covariance so you in some sense you are | 
|---|
| 0:26:33 | still updating a full covariance with your only updating in a constraint what | 
|---|
| 0:26:38 | so the matrix is still an apple size but the number of parameters that you | 
|---|
| 0:26:42 | discriminatively updated is not the full set | 
|---|
| 0:26:58 | so if i guess i remember correctly so you're doing actually closed set | 
|---|
| 0:27:03 | twenty three or twenty four language that is that correct twenty four language right so | 
|---|
| 0:27:08 | is it possible i mean i don't want change your problem but if you were | 
|---|
| 0:27:12 | to look at a subset so you're gonna pick twelve each and take the others | 
|---|
| 0:27:16 | are is completely open set data so you to screw training it only on a | 
|---|
| 0:27:20 | portion of we don't have actions we have said data | 
|---|
| 0:27:23 | you have some sense of how strong your solution would be | 
|---|
| 0:27:28 | if you didn't have access to those similar sounds that languages that you want to | 
|---|
| 0:27:32 | reject | 
|---|
| 0:27:34 | i think it's an interesting thought that | 
|---|
| 0:27:38 | you could more extensively test this out of set hypothesis by doing a whole one | 
|---|
| 0:27:43 | out or something and round robin in that and i think that isn't it interesting | 
|---|
| 0:27:47 | idea but i haven't | 
|---|
| 0:27:48 | have done | 
|---|