0:00:15all right about multi class discriminative training of i-vector language recognition this morning are now
0:00:21on the crate from john hopkins university
0:00:23and like to acknowledge some interesting discussions during this work with
0:00:27my current colleague daniel my previous colleagues dog in l yet
0:00:31duggan pedro sorry and more recently with nico
0:00:37so
0:00:39as an introduction
0:00:41you guys know i think we had one discussion this morning that
0:00:45language id using i-vectors is state-of-the-art system
0:00:49what i wanna talk about is some particular aspects of it is typically done as
0:00:54a two-stage process where we use a classifier is the first thing even after we've
0:00:58got the i-vectors first we build a classifier
0:01:01and then we separately build a backend which does the calibration and perhaps fusion as
0:01:06well so i wanna talk about two aspects that are a little different of that
0:01:10first i wanna talk about what if we try to have one system that does
0:01:14the discrimination
0:01:15the classification and the calibration once they're using discriminative training nobody ever said we have
0:01:21used to systems back to back what we do it all together
0:01:24and then the secondly i wanna talk about an open set extension to what is
0:01:28usually a closed set language recognition task
0:01:34so in the top i will start with a description of the gaussian model in
0:01:38the i-vector space it something that many be seen before but i need to talk
0:01:42about some particular aspects of it in order to get into the details here
0:01:47also talk about how that relates to the open set case in that case of
0:01:50go into some of the bayesian stuff that we do in speaker recognition and how
0:01:54that could or couldn't be relevant in language recognition what the differences are
0:01:59then i will talk about the two key things here which is the discriminative training
0:02:03that i'm using in particular which is based on mmi and then i'll talk about
0:02:07how i do the out of set model
0:02:12so as a signal processing guy i like to thank of
0:02:15this as an additive gaussian noise model and signal processing this is one of the
0:02:19most basic things that we see
0:02:21so
0:02:22it in this context what we're talking about is that the observed i-vector you see
0:02:26was generated from a language so it should look like the language vector mean but
0:02:31it's corrupted by an additive gaussian noise
0:02:34which we typically call a channel for lack of a better word
0:02:38so this model here from a pattern recognition point of view we have a unknown
0:02:43lean of each of our classes
0:02:45we have a channel which is gaussian looks the same for all of the classes
0:02:49that means that our classifier is a shared covariance gaussian model
0:02:54and each language model is described by its mean
0:02:58and that shared covariance is a channel or a within class covariance
0:03:06so the building language recognition system then we need a training process in the scoring
0:03:11process
0:03:12training means we need to learn this shared within class covariance and then for each
0:03:16language we need to learn what its mean looks like
0:03:19and testing again is this gaussian scoring
0:03:22and i guess unlike some people and stream are not particularly uncomfortable with closed set
0:03:27detection
0:03:28and that gives you a sort of funny looking form bayes rule the target if
0:03:33it is this class then that's just the likelihood of this class
0:03:37that's easy
0:03:38but the non-target means that it's one of the other classes and then you need
0:03:41some implicit prior of the distribution of the other classes
0:03:45which for the other eases design where you can use a flat prior
0:03:49given that is not the target
0:03:55so that the key question then for building language model is how do we estimate
0:04:00the mean estimating the mean of a gaussian is not one of the most complicated
0:04:04things and statistics but there are multiple ways to do it of course this paper
0:04:08or thing to do with just take the sample mean maximum likelihood
0:04:11and that's mainly what i'm gonna end up using here in this work but i
0:04:14wanna imprecise there are other things you could do and in speaker recognition we do
0:04:19not do that we do something more complicated
0:04:22the next a more sophisticated thing is map adaptation that we all know from gmm
0:04:26ubms and dogs work
0:04:28but you can do that in this context as well it's very simple formula that
0:04:33requires however that you have a second covariance matrix which we can probably across class
0:04:37covariance which is the prior distribution of what all models could look like
0:04:42where in this case the distribution of what means are drawn from
0:04:47and then finally from there you can go instead of taking point estimate you can
0:04:51go to a bayesian approach where you don't actually estimate the mean for each class
0:04:56you estimate the posterior distribution of the mean of each class given the training data
0:05:00for that class
0:05:02and in that case
0:05:04you can see that posterior distribution and then you could be scoring with what's called
0:05:08the predictive distribution which is
0:05:10a bigger gaussian the fatter gaussian it includes the within class covariance
0:05:15but also has an additional term which is the uncertainty of how much you didn't
0:05:18show me that particular class
0:05:25one little trick that i only learned recently no we started learned a lot sooner
0:05:28it's a
0:05:30develop many years ago have a reference in the book but it's really handy for
0:05:33all these kind of systems is
0:05:36everybody knows you can buy wise one covariance matrix recognizer data such that the covariance
0:05:40you can but a linear transform and datasets the covariance which by the fact you
0:05:45can do it for two
0:05:46and since we have to this is really helpful
0:05:49and i have a formulas in the paper it's actually not very heart
0:05:54and you end up with a linear transform where within classes identity which we're often
0:05:58used to w c and for example compasses that
0:06:01but across class is also diagonal and it's sorted in order so the most important
0:06:05dimensions are first
0:06:07and the beautiful global transformation
0:06:10it means that you can do linear discriminant analysis you can do dimension reduction easily
0:06:16in the space just by picking the most interesting mentions the person
0:06:20and it's also a reminder that when you say you do lda in your system
0:06:26these little careful because lda
0:06:29there's a number of ways to formulate lda they all give the same subspace but
0:06:33they don't give the same transformation within that subspace
0:06:37because that's not part of the criterion
0:06:40and that's what this doesn't give the same subspace but it's not the same linear
0:06:43transformation
0:06:48so i'm gonna that some experiments here i'll start with some simple ones of the
0:06:51want to the discriminative training next
0:06:54where using acoustic i-vectors i think maybe it was mentioned here
0:06:58the main thing we're gonna a lid system is you need to do shifted delta
0:07:01cepstra and you need to do vocal tract length normalisation might not do speaker
0:07:07i'm gonna present lre eleven "'cause" it's the most recent lre but as the kind
0:07:12of hinted i'm not gonna use pair detection "'cause" i'm not a big fan of
0:07:15pair detection
0:07:18somebody's the over metric c average
0:07:21but you get similar performance rankings
0:07:24when you pair
0:07:25detection as well
0:07:27and
0:07:30within lre
0:07:31you build your own train and that's of these are of lincoln's training data sets
0:07:35that are currently
0:07:36zero mean
0:07:40so i mentioned just as generative gaussian models i mentioned that you can do ml
0:07:45and you can do these other things i mentioned ml map of a
0:07:48have a nice applied here we just three things but is actually not those three
0:07:50things
0:07:52so you have to pay attention but didn't describe what this is
0:07:57but ml so what i'm doing here is there is no back and there is
0:08:00just bayes rule applied "'cause" that's the formula that i showed you to the generative
0:08:04model of gaussian
0:08:06and these numbers for people who do our reads these are not very good numbers
0:08:09but this is what happened straight out of the generative model
0:08:13and what i'm showing is c average didn't in c average
0:08:17so means the average means you had a heart detection hard-decisioning on the detection
0:08:23so the ml system
0:08:24is the baseline
0:08:27if you do this is the bayesian system so where you make the bayesian estimation
0:08:31of the being then you actually in the end don't actually have the same covariance
0:08:34for every class "'cause" they had different counts and that gives different a predictive uncertainty
0:08:39but in a factor are very similar because in language recognition
0:08:42you have many instances per class so it almost degenerates to the same thing
0:08:47the reason i didn't show map is "'cause" it's in between those two and there's
0:08:50not much space in between those two so it's not a very interesting thing
0:08:54this last one is kind of interesting in that
0:08:59it's not right but it actually works better
0:09:01from a calibration
0:09:03well as you say calibration that you think that it works better in the bayes
0:09:07rule
0:09:08what i've done here is what we typically do in speaker recognition where you use
0:09:10the right map what you pretend that there's only one cut instead of keeping the
0:09:14correct count of the number of cuts
0:09:16and that gives you in terms of the predicted distribution that gives you a greater
0:09:20uncertainty that and a wider covariance
0:09:23and so happens that actually works a little better in this case
0:09:29but i
0:09:31once you put a back into the in the system which is what everybody's usually
0:09:34showing with some then these differences really disappears so i'm gonna use ml systems for
0:09:40the rest of the discriminative training work
0:09:43as i said these numbers are very good there about three times as bad as
0:09:46a state-of-the-art
0:09:48what usually done it is with additionally trained back end the simplest one i think
0:09:52john had was the full tell the scalar multiclass thing that a coded decoded before
0:09:58that's logistic regression
0:09:59you can do a full of logistic regression with a matrix instead of with a
0:10:02scalar you can put a gaussian backend in front
0:10:09a logistic regression which is something that we tried for or you can use a
0:10:13discrimate we train gaussian as the back and which is something we were doing it
0:10:17lincoln for quite awhile
0:10:19and these systems all work much better and pretty similar to each other
0:10:24you can also build a classifier to be discriminative one of the more common things
0:10:28to do is an svm one verses rest
0:10:31that's not that still doesn't solve the final task but it can help
0:10:35and if you do one verses rest logistic regression you also still need to back
0:10:39end or you can do recently uniquer has been doing a multiclass
0:10:44training of the classifier itself followed by multiclass backend
0:10:47but what i wanna talk about is trying to do everything together one training of
0:10:51the multiclass system that won't need its own separate back end ready to apply bayes
0:10:56rule straight out
0:10:57and
0:10:59it's not commonly used in backends but in our field mmi is a very common
0:11:03thing in a given in the gmm world in the speech recognition work
0:11:07the criterion if you're not familiar with it
0:11:10it is another name for the cross entropy which is the same metric that logistic
0:11:14regression uses
0:11:15it is a multiclass are your probabilities correct kind of a metric
0:11:21and it is a this is a closed set
0:11:24discriminative training of classes against each other
0:11:28the update equations
0:11:30are you haven't seen are kind of cool and they're kind of different it's a
0:11:33it's a little bit of a we're derivation compare the gradient descent that everybody's two
0:11:38it can be interpreted of like a gradient descent with kind of a magical step
0:11:42size
0:11:44but it's quite effective and the weights it's always done in speech recognition is
0:11:49since you were doing this to a gaussian system you start with an ml version
0:11:52of the gaussian and then you discriminatively updated so to speak
0:11:56a that makes the converse is much easier
0:11:59it gives an actual regularisation because you're starting with something that is already a reasonable
0:12:03solution and in fact the simplest form of regularization is just to not let it
0:12:07run very long it is also a lot cheaper
0:12:10and it also gives you something you can tie back and put a penalty function
0:12:14that says don't be too different from the ml solution
0:12:17so regularization is it is and straightforward thing to do an mmi
0:12:22and this diagonal covariance transformation that i was talking about is really helpful there here
0:12:27because
0:12:28then we can only discriminately update these diagonal covariances instead of full covariances
0:12:33so we have fewer parameters than a full matrix logistic regression but more parameters the
0:12:38lowest logistic burst
0:12:45so now these are pretty much state-of-the-art numbers now remember the previous number that couldn't
0:12:50were up here essentially
0:12:54so this is the ml gaussian followed by an mmi gaussian backend in the score
0:12:59space which is kind of our dpot way of doing things when i was at
0:13:03lincoln
0:13:05this for score is kind of a disappointment which is what if you take the
0:13:08training set and you discrimate we trained with them in mine and they don't have
0:13:12a back here
0:13:13it is in fact
0:13:15considerably better than the ml system really of its equivalent would which i started
0:13:20but is nowhere near where we wanna be obviously
0:13:23so
0:13:24why not
0:13:25and
0:13:28one of the core of our e
0:13:30that
0:13:32is more data dependent i think then realistic
0:13:36is the dataset actually looks different than the training set
0:13:39so this is only done on the training set it's not using any dev set
0:13:43at all
0:13:44the most obvious thing is that the training set and that at the data set
0:13:47in the test set are all thirty seconds approximately the training set is whatever sides
0:13:52of conversations that happen to be so that's an obvious mismatch selected the training set
0:13:57and truncated everything to be thirty seconds instead of the entire sorry
0:14:00drawing away data in that way turned out to be very helpful because it's now
0:14:03what better match to what the test data looks like
0:14:06but not everything i wanted so then i to the thirty second training set
0:14:10concatenated together with the dev set which is a thirty second set
0:14:14used the entire set at once
0:14:17for training the system and that in fact works as well as in and slightly
0:14:22better
0:14:22then the two different us as the system followed by
0:14:26discriminant right by a backend
0:14:32so i looked at the number of different ways
0:14:37permutations of this mmi system the anybody who's done it for gmm mmi is no
0:14:41you can you can
0:14:42train this that or the other and various things that
0:14:45and that the simplest thing to do is just to do the means only and
0:14:48that is fairly effective at the moment
0:14:52you can train the mean and the within class covariance which is
0:14:56and of course in the clothes that system the across class covariance is not coming
0:15:00into play it's only the within class covariance which is having five
0:15:05one thing that i found kind of interesting used to instead of training the entire
0:15:08covariance matrix to train the scale factor which scales the covariance that's to a little
0:15:14bit simpler system with fewer parameters
0:15:16and you can also play with the sequential system
0:15:20and in particular i found interesting to do the scale factor first and then the
0:15:24means just in terms of the it it's really
0:15:29that will given the end the same solution but
0:15:32when you only do a limited number of iterations to starting point in the sequence
0:15:36does affect you get
0:15:39so
0:15:42again these same sorts a lot this is what happens if you do so this
0:15:47is now purely no back-end and the discriminately train classifier itself if you do need
0:15:51only
0:15:53your partisan system is not terribly good but you're means the average is pretty close
0:15:59so that is an indication
0:16:01what is calibration mean in a multiclass detection
0:16:05task is kind of controversy all but
0:16:08one thing that i think i can say comfortably is whenever you see this happen
0:16:13it means that you're not calibrated
0:16:15the fact that they might not doesn't necessarily mean that you are calibrated "'cause" bayes
0:16:18rule is more complicated than that but
0:16:20but this means that it is clearly not calibrated
0:16:23so once we do something to the variance this is doing the mean and the
0:16:26entire variance this is doing the mean and the scale factor is very except same
0:16:30time
0:16:31and this is due in a two stage process or of the scale factor of
0:16:34the various followed by the mean
0:16:36all of those
0:16:37work much better so in order to get calibration you need to actually adjust the
0:16:41covariance matrix which kinda makes sense you need to scale factor or something
0:16:45and
0:16:46once you fine tune on the numbers as we typically do when we're actually working
0:16:50on these kind of task
0:16:52been actually see that the two stage process it is the baddest the best one
0:16:56and it is better than error
0:16:58our a two-step process that we used to have before of separate system followed by
0:17:03back in
0:17:06okay so that's the discriminative training part the other thing i want to talk about
0:17:09is the out of set problem that has mentioned in a question earlier
0:17:15because oftentimes were interested in something where there's it could be another language is not
0:17:20one of the closer
0:17:23the nice thing about our two covariance mathematics that we've been using for speaker recognition
0:17:28is it has in front of you a model for what out of set is
0:17:31supposed to the
0:17:33already mentioned that essentially that if you have a gaussian
0:17:35distribution of what all models look like then an out of set languages are randomly
0:17:40drawn language from that who
0:17:42and that's represented by the gaussian distribution
0:17:46then at test time
0:17:48you have again and even bigger gaussian because the uncertainty is both the channel plus
0:17:55which language was
0:17:56so now you have
0:17:59the out of set is also a gaussian bided have the bigger covariance then all
0:18:03the others have a share variance which is smaller so it you no longer have
0:18:07this a linear system
0:18:09when you make a comparison
0:18:11this is the most general formula when you have
0:18:15and open-set problem which is both out of set and closed set
0:18:19this is how you would combine them this is what i had before the sort
0:18:22of bayes rule a quick competition of all the other closer classes this is the
0:18:26new distribution the out of set distribution
0:18:29if you wanna pure out of set problem which is what i'm gonna talk about
0:18:32here you just take it needs to be out of set is one but in
0:18:35fact you could make a mix distribution well
0:18:38okay so i wanna talk about the out of set
0:18:42just a touch on is what i have now
0:18:44if i where to do the bayesian numerator for each class that i mentioned before
0:18:48and then this denominator
0:18:51then i have what would you like to call
0:18:53bayesian speaker comparison
0:18:55jones narrative paper about four
0:18:59it is the same answer as p lda or the two covariance model
0:19:04and i'd like to
0:19:06emphasise that
0:19:07they're set up differently so the numerator and denominator are different in these two mathematics
0:19:12but the ratio is the same thing "'cause" it's a models and is the same
0:19:16correct answer
0:19:18i think you know formalism like i'm talking about here i find it much easier
0:19:21to understand it in this context
0:19:23the philosophy
0:19:25and
0:19:27daniel i've spent a lot of time on this can see that only a few
0:19:30of the a guy from this perspective point of view
0:19:33but in this terminology we say that we have a model for each class and
0:19:38the covariances are hyper parameters in this terminology you guys like to say that there
0:19:43is no model
0:19:44and the parameters of the system are the covariance matrices again is the same
0:19:49system it's a different same answer to different perspective but when we're talking about close
0:19:53that and ml models
0:19:55i know how to say that in this context and i don't know so well
0:19:58how to say that
0:19:59in the p lda one
0:20:02so discriminative training of the out that i described this is the out of set
0:20:06model but as i've said now i have this mmi hammer in my toolbox
0:20:10and this is just one more covariance that i can train so i've got an
0:20:14across class mean and covariance
0:20:17the ml out of set system just takes that all of these where the sample
0:20:22covariance matrices so this p
0:20:24but i can
0:20:26do an mmi updated this out of set classes well the simplest way for me
0:20:30to do this is to take the
0:20:32the closed set system there are presented and then separately
0:20:36frees the closed set models and then separately update the out of set model given
0:20:40the closed set models
0:20:42i can do that with the by scoring would one verses rest instead of scoring
0:20:47with bayes rule and doing a round robin on the same training set so
0:20:52the advantage of this is i can actually build a system without ever actually having
0:20:56any out-of-class data probably do better if i really did have out-of-class data but in
0:21:00this case i don't and i can build a perfectly legitimate system
0:21:03so
0:21:06the performance of this system whatever done here
0:21:10is scored this lre even though there is no out of set data scoring without
0:21:14bayes rule where the system is then allowed to know what the other classes were
0:21:21and so that the simulation one open set
0:21:23scoring function
0:21:25the ml version of this
0:21:27the actual c average is actually a the chart it is that's kind of bad
0:21:31numbers that i started with four
0:21:33the mmi training of the closed set system
0:21:37and then the
0:21:39mel version of the across class covariance in fact is already a lot better so
0:21:43whatever's happening in the closed set discriminative training is actually helping the
0:21:47open set scoring as well but explicitly retraining
0:21:51the out of set covariance matrix
0:21:53with the same mechanism mel scale factor then the me
0:21:58in fact if the pretty reasonably
0:22:01and the system which is not obviously on calibrated
0:22:05and it's pretty reasonable performance
0:22:07the closed set scoring performance is still down here but this is gone a lot
0:22:11better and it's perfectly feasible
0:22:14so
0:22:16the two contributions here where the single system concept of we don't have to do
0:22:20system design and then back end we can discriminatively trained system to already be calibrated
0:22:26and we can model out of set using the same mathematics that we have in
0:22:31speaker recognition
0:22:33but a simpler version "'cause" we don't need to be used bayesian in this case
0:22:36and i think can also be discriminatively updated so that we can that be reasonably
0:22:40calibrated for the open set
0:22:42task as well
0:23:06so thanks island
0:23:08the very nice to see that you unified those two parts of the system
0:23:13i which we could do that than in speaker recognition
0:23:17so my question is your
0:23:21your maximum likelihood
0:23:23across class covariance so you've got twenty four languages to work within a six hundred
0:23:28dimensional
0:23:30i-vectors so
0:23:31how did you estimated or a sign that
0:23:34parameter
0:23:36it is the sample covariance so everything here was done with the dimension reduction in
0:23:41the front
0:23:42to twenty three dimensions
0:23:45i'm sorry that's why i illustration
0:23:48already specified that there would be twenty three dimensions
0:23:52and anything that has a prior is limited to twenty three dimensions
0:23:58okay in this i i'd since i just took the sample covariance matrix at regularized
0:24:02it somehow you can make it
0:24:05appear to be bigger to be full size
0:24:09okay so
0:24:10well those formulas you showed with the covariances that happens in twenty three dimensional space
0:24:15yes
0:24:25so in this case you doing lda and then i'm not my tanks she's got
0:24:31at the same as doing question back and another calibration
0:24:36well as lda and regression back at you that this evaluation bloodless
0:24:41this was your computing the sample covariance as ones in the full space
0:24:46but
0:24:47the across class is only rank twenty three
0:24:51so you take that the six hundred dimensional within class and map-adapted twenty three but
0:24:56yes
0:24:59so if you do lda and regression but gaussian backend is the same subspace
0:25:04use lda
0:25:06yes if you product of lda
0:25:09and twenty three or you get the gore since any get twenty four scores
0:25:13is almost the same thing so you're still doing to state to steps
0:25:20it's still just two steps
0:25:22in my view are the ml estimation which in this case forces you to be
0:25:26twenty three dimensional
0:25:28and then
0:25:29the update of those equations
0:25:32but lda english and
0:25:34back and there is similarity very close
0:25:38well like the way we would have done a system before would be lda and
0:25:43then gaussian in that space and then
0:25:47mmi training in the score space
0:25:51the likelihood ratios of the first thing this is mmi training in the i-vector space
0:25:55directly
0:25:57but
0:25:58these are not very complicated mathematics of things are pretty closely related yes
0:26:05so when you did the joint diagonalization
0:26:08in there and then you
0:26:10work with diagonal covariance matrices but then you're also updating the covariance matrices training
0:26:16is that diagonalisation still valid then
0:26:18i mean you do the static one projection was that mean then when you forced
0:26:22to be it back it's sort of like saying i mean
0:26:25the entire thing can be mapped back to
0:26:28by undoing the diagonalisation into a full covariance so you in some sense you are
0:26:33still updating a full covariance with your only updating in a constraint what
0:26:38so the matrix is still an apple size but the number of parameters that you
0:26:42discriminatively updated is not the full set
0:26:58so if i guess i remember correctly so you're doing actually closed set
0:27:03twenty three or twenty four language that is that correct twenty four language right so
0:27:08is it possible i mean i don't want change your problem but if you were
0:27:12to look at a subset so you're gonna pick twelve each and take the others
0:27:16are is completely open set data so you to screw training it only on a
0:27:20portion of we don't have actions we have said data
0:27:23you have some sense of how strong your solution would be
0:27:28if you didn't have access to those similar sounds that languages that you want to
0:27:32reject
0:27:34i think it's an interesting thought that
0:27:38you could more extensively test this out of set hypothesis by doing a whole one
0:27:43out or something and round robin in that and i think that isn't it interesting
0:27:47idea but i haven't
0:27:48have done