0:00:24This whole session should be about compact representations for speaker identification and the first talk
0:00:30is... the title of the first talk is A small footprint i-vector extractor
0:00:39So, I repeat, this talk
0:00:44by a small footprint
0:00:46i-vector extractor
0:00:47one that
0:00:50amount to memory
0:01:00the troubles we have
0:01:01basic problem is that these kind of algorithms for extracting i-vec
0:01:06are quadratic
0:01:07both of them, both the memory and
0:01:10computational reqiurements
0:01:13we present
0:01:15is a follow-up on
0:01:18paper that Ondrej
0:01:20Glembek presented at the last year's ICAASSP,
0:01:28The approximation on how i-vectors could be extrac
0:01:32with minimal memory
0:01:33overhead, but after
0:01:34after discussing that work, I intend to
0:01:39is to show how that idea
0:01:49The principal motivation for doing this work
0:01:53it's well known that
0:01:56you can introduce approximations at run time as
0:01:59we have minor degradation in the
0:02:02in the recognition performance
0:02:06However, these approximations generally do cause problems if you want to do training. So, the
0:02:12motivation for aiming at the exact posterior computations was to be able to do training
0:02:20and, in particular, to be able to do training on a very large scale. Traditionally,
0:02:26most work with i-vectors has been done with dimensions four or six hundred. In other
0:02:34areas of pattern recognition principal components analyzers have much higher dimension than they constructed, so
0:02:43purpose of the paper was to be able to run experiments with very high- dimensional
0:02:50i-vector extractors.
0:02:52As it happens, this didn't pay off. But the experiments needed to be done, you
0:02:57know, in any case.
0:03:03Okay, so the ... the point of that i-vectors, then, is that they
0:03:09they provide a compact representation of an utterance
0:03:14typically, vector of four or eight hundred dimensions independently of the length of the utterance.
0:03:19So, that the time dimension is banished alltogether, which greatly simplifies the problem.
0:03:29Essentially, it now becomes a traditional
0:03:33biometric pattern recognition problem without the complication introduced by
0:03:40arbitrary duration. So, many standard techniques apply and joint factor analysis becomes vastly simpler; so
0:03:48simple that is now has another name it's called probabilistic linear
0:03:53discriminant analysis
0:03:56And, of course, the simplicity of this representation has that
0:04:01well, fruitful research in other areas, like language recognition. Even speaker diarization for i-vectors
0:04:11can be extracted from short speaker turns as short as just one second.
0:04:20So, the basic idea is there is an implicit assumption that, given an utterance, it
0:04:25can be represented by a Gaussian mixture model.
0:04:30If the
0:04:33if that GMM were observable,
0:04:35then the problem of extracting the i-vector would simply be a matter of applying a
0:04:41standard probabilistic principle components analysis
0:04:47to the GMM supervector
0:04:50So, the basic assumption is that of the supervectors lie below dimensional space, the basis
0:04:56of that space
0:04:59is known as the eigenvoices and the coordinates of the supervector relative to that basis
0:05:06is the i-vector representation. So, the idea is that the components of the i-vector should
0:05:15represent high level aspects of the utterance, which are independent of the phonetic content.
0:05:22Because all of this apparatus is built
0:05:24of the UBM, UBM can play the role of modelling the phonetic variability in the
0:05:32and the i-vector then should capture things like speaker characteristics
0:05:39room impulse response
0:05:40and the other global aspects of the
0:05:48So, the problem that arises is that the
0:05:54the GMM supervector is not observable. The way to get around the problem is by
0:06:01thinking of the Baum-Welch statistics.
0:06:04It's typically collected with the universal
0:06:07background model as ... summarising a noisy observation of the... of the GMM supervector.
0:06:21From the mathematical point of view. The only difference between this situation and a standard
0:06:28probabilistic principal components analysis is that in the standard situation you get to observe every
0:06:36component of the vector exactly once and this situation, you observe different parts of the
0:06:41vector different number of times.
0:06:45Other than that, there is nothing misterious in the derivation.
0:06:49So, this is the mathematical model, the supervectors
0:06:55are assumed to be confined in a multidimensional
0:07:00subspace or the supervector space.
0:07:04The vector y is assumed in the prior, a standard, normal distribution. Now, the problem
0:07:10is given Baum-Welch statistics to produce a point estimeate of y and tyhat is the
0:07:16i-vector representation if the utterance.
0:07:21You can also write the terms of the individual
0:07:23components of the GML
0:07:28the standard assumption is that the covariance matrix here remains unchanged, it's the same for
0:07:35all utterances.
0:07:39Attempting to make that utterance- dependent seems to lead to insuperable problems in practice,
0:07:46nobody, to my knowledge has
0:07:46ever made the progress with that problem.
0:07:55One aspect that is common to most implementations is that
0:08:02some of the parameters, mainly the mean vectors and the covariance matrices
0:08:07is copied, technically, from the UBM, into the
0:08:12probabilistic model
0:08:15That, actually, leads to a slight improvement in the performance
0:08:19I'll report to some results
0:08:25The main advantage, though, is that you can simplify the implementation by performance of affine
0:08:30transformation of the parameters
0:08:35which enables you to take the mean vectors to be zero and covariance matrices
0:08:39to be the identity
0:08:41and that enables you to handle UBMs with full covariance matrices
0:08:50in the simplest way.
0:08:51It's well known that using
0:08:53covariance matrices
0:08:55does help.
0:08:59So, these are the standard equations for extracting the i-vectors, assuming that the model parameters
0:09:07are known.
0:09:09the matrices be
0:09:13problem is accumulating this... this matrix here.
0:09:19Those are the zero order statistics,
0:09:24that are extracted with the UBM.
0:09:29The standard procedure is to precompute the terms here These matrices,
0:09:34these here, they're symmetric matrices.
0:09:37So, you only need the
0:09:41proper triangle.
0:09:41The problem, then, in the memory point of view, is that... because these are quadratic
0:09:48in the i-vector dimension, you have to pay a heavy
0:09:52price in terms of memory.
0:09:57So, those are some
0:09:58typical figures
0:09:59for fairly standard sort of
0:10:08that is
0:10:11These are the standard training algorithms; the only point I wanted to make in
0:10:17putting up this equation here is that: both in training and in extracting the i-vector,
0:10:25which was
0:10:26the previous slide, the principal
0:10:30computation is a matter of calculating the
0:10:34posterior distribution of that factor y.
0:10:39So, that is the problem,
0:10:40calculating the posterior distribution of y.
0:10:45Not just the point
0:10:54So, the contribution of this paper
0:10:56is to use a variational Bayes implementation of the probability model
0:11:03in order to solve this
0:11:06particular problem of the
0:11:12So, the standard assumption is to assume that the
0:11:18posterior distribution that you're interested in factorizes; in other words, that you have a statistical
0:11:25independent assumption relation
0:11:27that you can impose
0:11:31Estimating these terms here's carried out by a standard variational Bayes update procedure, which you
0:11:44within the reference
0:11:44of the Bishop's book.
0:11:48This notation here means you take the vector y and
0:11:55calculate an expectation over all components, rather than particular components
0:12:01that you happen to be interested in when you're updating
0:12:02the particular term.
0:12:09These updated rules are guaranteed to increase the variational
0:12:12lower bound and that's useful
0:12:20So, this is an iterative method,
0:12:22you have to single iteration, what consists of
0:12:27cycling over the
0:12:29components of the i-vector.
0:12:34This is just to explain that the computation's actually brought down to something very simple
0:12:41assumptions are
0:12:46factors in the variational factorization are also Gaussian. To get the normals
0:12:52you just
0:12:58and the point about the memory then is just as the... in the full posterior
0:13:05calculation. Pre-computing these matrices here enables you to speed up the computation at a constant
0:13:16The things you have to be pre-compute here are just the diagonal versions of these
0:13:22things here and for that the memory overhead is negligible.
0:13:27So, this is all based on the assumption that the
0:13:32posterior that we can assume a diagonal posterior covariance matrix,
0:13:44is's explained in the paper why the variational Bayes method
0:13:49, even if that assumption turns out to be wrong,
0:13:53the variational Bayes is guaranteed to find the point estimate of the i-vector
0:14:00See? So, the only
0:14:03error that's introduced here
0:14:05is in the posterior covariance matrix,
0:14:07it's assumed to be diagonal
0:14:11There's no error
0:14:13in the point estimate of the posterior, thus it's
0:14:27If you're familiar with the numerical
0:14:28the mechanics correspond to something known as the
0:14:37which in this case happens to be guaranteed
0:14:40versions happens to be guaranteed because
0:14:43variational Bayes
0:14:45So the method is exact, the only real issue is how efficient
0:14:50it is. That turns out to raise the question of how
0:14:55is the
0:14:57assumption that
0:14:58the covariance matrix can be treated as diagonal.
0:15:06Two points here, to bear in mind.
0:15:10In order to show why the assumptions are
0:15:14First is that the i-vector model is not uniquely defined.
0:15:18You can perform a rotation in the i-vector coordinates
0:15:25provided that you perform a corresponding transformation on the
0:15:29eigenvoices, the model remains unchanged.
0:15:35prior factor of the
0:15:36why it continues to be the center of whole distrib
0:15:42You have freedom in
0:15:43rotating the
0:15:46the basis.
0:15:47The other point, this was the point that Ondrej Glembek
0:15:55named in his
0:15:57ICAASP paper last year. That, in general, this is a good approximation to the posterior
0:16:04precision matrix,
0:16:06provided you have sufficient data. So, those W's there are just the mixture
0:16:11weights in the
0:16:16unversal background model
0:16:18and talking the number of frames
0:16:20in, for example,
0:16:21in scenario like core condition it may be somethin
0:16:29so that you have sufficinetly many
0:16:31frames that this
0:16:32approximation here
0:16:34would be reasoned.
0:16:38If you combine those two things together,
0:16:41okay? You can say that by diagonalizing this sum here you
0:16:46form this sum just once, using the
0:16:51the mixture weights. Then you will produce a basis of the i-vector space with respect
0:16:58to which all the posterior
0:17:00precision matrices are approximately diagonal.
0:17:04That's the justification for the
0:17:06diagonal assumption. You have to use a preferred
0:17:10in order to
0:17:13do the calculations.
0:17:15And using this... using this basis guarantees that the variational Bayes algorithm
0:17:21will converge very quickly.
0:17:22Typically, three iterations are enough,
0:17:26three iterations independently
0:17:27of the
0:17:29rank of the dimensionality
0:17:30of the i-vector.
0:17:34And that's the basis of my contention that this algorythm's
0:17:39computational requirements are
0:17:40linear in the
0:17:47So, memory overhead is negligible and
0:17:49the computation
0:17:52scales linearly
0:17:53rather than quadratic.
0:18:01If you're using this,
0:18:04the preferred basis is going to change
0:18:09, so you should not overlook that.
0:18:17Whenever you have a variational Bayes method, you have a variational lower bound, which is
0:18:23very similar to auxiliary function and
0:18:29which weighting the auxiliary function
0:18:31which is guaranteed to increase
0:18:33on successive iterations of your
0:18:43So, it's useful to be able to evaluate this, and
0:18:47the formula is given in the paper.
0:18:50It's guaranteed to increase on successive
0:18:52iterations of variational Bayes.
0:18:57used for debuging. In principle,
0:18:59it can be used to monitor convergence, but it actually turns out that the
0:19:02overhead of using it for that purpose
0:19:06slows down the algorythm,
0:19:07it's not used for that in practice.
0:19:14I think is that it can be used to monitor convergence when
0:19:17you are training an i-vector extractor with variat
0:19:23The point here is that the
0:19:25exact evidence, which is
0:19:28the thing you woudl normally use
0:19:29to monitor convergence,
0:19:34cannot be used
0:19:36in this particular case
0:19:39if you're assuming that the posterior
0:19:44is diagonal, then you have to
0:19:46modify the calculation
0:20:03Okay, so a few examples
0:20:05of questions that I dealt with in the paper.
0:20:10One is how accurate is
0:20:12variational Bayes algorythm?
0:20:15To be clear here, there is no issue
0:20:21at run time, you are guaranteed to get the exact
0:20:21point-estimate of your i-vector,
0:20:25provided you
0:20:27monitor iterations
0:20:33The only issue is
0:20:36the recent approximation, when you treat the posterior precision or covariance matrix as the
0:20:47And those posterior precisions
0:20:49do interrupt the
0:20:50training model, so it's concievable. But using the
0:20:59assumption on the posterior precisions could affec
0:21:00the way training behaves.
0:21:02So, that's one point
0:21:04that needs to be checked.
0:21:07I mentioned at the beginning,
0:21:09this is well known, but
0:21:11I think needed to be tested.
0:21:14If you make the simplifying transformation, which allows you to take the mean vectors to
0:21:21be zero or the covariance
0:21:27You're copying some parameters
0:21:29from the UBM into the probabilistic model for
0:21:34There is a question as to
0:21:38plausable reason
0:21:46How efficient is variational Bayes?
0:21:49Obviously, there's going to be some price to be paid.
0:21:53The standard implementation
0:21:57you can recude computational burden
0:21:59at the cost of
0:22:00several gigabytes of
0:22:07so you no longer have opportunity
0:22:08of using all that memory.
0:22:12The question about efficiency.
0:22:15And finally, there is an issue of
0:22:20training very high dimensional i-vector extractors.
0:22:23You cannot train very high dimensional
0:22:24i-vectors exactly using
0:22:31variational approach.
0:22:32Bayes approach does enable you to do it,
0:22:34but there is an impairment of doing that.
0:22:42Ok, so the testbed was
0:22:45female det two trials
0:22:46this is a matter of telephone speech,
0:22:50extended core condition of the NIST
0:22:52two thousand and ten
0:22:59Extended core condition of the millions of
0:23:03very much large number of trials
0:23:05than the original
0:23:07evaluation protocol.
0:23:13The standard front end, the standard UBM
0:23:15diagonal covariance matrices, trained on the usual
0:23:25In other respects, the classifier
0:23:27was quite standard
0:23:30the used heavy-tailed PLDA
0:23:45These were
0:23:49results obtained with JFA executables,which is
0:23:54the way i-vectors were originally
0:23:56built and they were just
0:24:00produced as Benchmark
0:24:01There was a problem, the activity detection explai
0:24:06error rates were a little higher than expected.
0:24:10With variaional Bayes I actually got marginally better
0:24:13results, which turned out to be a more effective c
0:24:21covariance matrices. Copying the covariance matrices
0:24:27effect is to reduce to other estimated variances
0:24:42I need to get to
0:24:47My figures for extracting a four hundred-dimensional i-vector extractor
0:24:52are typically about half a second.
0:24:54Almost all of the time is spent in
0:24:55BLAS routines, accumulating the posterior covariance matrices
0:24:59to take seventy five percent of the time.
0:25:06An estimate a of quarter of second, which
0:25:07suggests that compiler optimization may
0:25:09be helpful, everything is going on inside the clas
0:25:17For the variational Bayes method, I've got an estimate of
0:25:22point nine seconds instead of point five.
0:25:26I've fixed the number of iterations at five.
0:25:33Variational Bayes method really comes to it's own
0:25:35when you work
0:25:37with higher-dimensional of
0:25:39i-vector extractors.
0:25:47It's one last table. I did try and put several
0:25:51dimensions, up to sixteen
0:25:52hundred and got a very good indurance and performa
0:26:05Okay, thank you.
0:26:17Well, it depends on the ... on the dimensionality of the i-vector extractor.
0:26:25A couple of gigabytes, it's as big as the large vocabulary continuous speech recognizer, it's
0:26:30not as intelligent, but it's as big.
0:26:37Just the eigenvoices, okay, together with the stuff you have to pre-compute in order to
0:26:44extract the i-vectors efficiently.
0:26:53Yeah, that what require
0:26:57You still have to store the eigenvoices, but thing that you did not know, that's
0:27:02not the big part. The big part is a bunch of a triangular matrices that
0:27:07we store in order to calculate i-vectors efficiently, using the standard approach.
0:27:14The point of this was to use the variational Bayes to void that computation.
0:27:23I'm afraid that we've just spent the time for questions here and so I... I
0:27:27guess that Patrick will get lots of questions offline and maybe even at the end
0:27:31of this talk they can answer those questions together with Sandro, who is giving the
0:27:35next talk