0:00:24 | This whole session should be about compact representations for speaker identification and the first talk |
---|---|

0:00:30 | is... the title of the first talk is A small footprint i-vector extractor |

0:00:39 | So, I repeat, this talk |

0:00:44 | by a small footprint |

0:00:46 | i-vector extractor |

0:00:47 | one that |

0:00:50 | amount to memory |

0:01:00 | the troubles we have |

0:01:01 | basic problem is that these kind of algorithms for extracting i-vec |

0:01:06 | are quadratic |

0:01:07 | both of them, both the memory and |

0:01:10 | computational reqiurements |

0:01:13 | we present |

0:01:15 | is a follow-up on |

0:01:18 | paper that Ondrej |

0:01:20 | Glembek presented at the last year's ICAASSP, |

0:01:28 | The approximation on how i-vectors could be extrac |

0:01:32 | with minimal memory |

0:01:33 | overhead, but after |

0:01:34 | after discussing that work, I intend to |

0:01:39 | is to show how that idea |

0:01:49 | The principal motivation for doing this work |

0:01:53 | it's well known that |

0:01:56 | you can introduce approximations at run time as |

0:01:59 | we have minor degradation in the |

0:02:02 | in the recognition performance |

0:02:06 | However, these approximations generally do cause problems if you want to do training. So, the |

0:02:12 | motivation for aiming at the exact posterior computations was to be able to do training |

0:02:20 | and, in particular, to be able to do training on a very large scale. Traditionally, |

0:02:26 | most work with i-vectors has been done with dimensions four or six hundred. In other |

0:02:34 | areas of pattern recognition principal components analyzers have much higher dimension than they constructed, so |

0:02:42 | the |

0:02:43 | purpose of the paper was to be able to run experiments with very high- dimensional |

0:02:50 | i-vector extractors. |

0:02:52 | As it happens, this didn't pay off. But the experiments needed to be done, you |

0:02:57 | know, in any case. |

0:03:03 | Okay, so the ... the point of that i-vectors, then, is that they |

0:03:09 | they provide a compact representation of an utterance |

0:03:14 | typically, vector of four or eight hundred dimensions independently of the length of the utterance. |

0:03:19 | So, that the time dimension is banished alltogether, which greatly simplifies the problem. |

0:03:29 | Essentially, it now becomes a traditional |

0:03:33 | biometric pattern recognition problem without the complication introduced by |

0:03:40 | arbitrary duration. So, many standard techniques apply and joint factor analysis becomes vastly simpler; so |

0:03:48 | simple that is now has another name it's called probabilistic linear |

0:03:53 | discriminant analysis |

0:03:56 | And, of course, the simplicity of this representation has that |

0:04:01 | well, fruitful research in other areas, like language recognition. Even speaker diarization for i-vectors |

0:04:11 | can be extracted from short speaker turns as short as just one second. |

0:04:20 | So, the basic idea is there is an implicit assumption that, given an utterance, it |

0:04:25 | can be represented by a Gaussian mixture model. |

0:04:30 | If the |

0:04:33 | if that GMM were observable, |

0:04:35 | then the problem of extracting the i-vector would simply be a matter of applying a |

0:04:41 | standard probabilistic principle components analysis |

0:04:47 | to the GMM supervector |

0:04:50 | So, the basic assumption is that of the supervectors lie below dimensional space, the basis |

0:04:56 | of that space |

0:04:59 | is known as the eigenvoices and the coordinates of the supervector relative to that basis |

0:05:06 | is the i-vector representation. So, the idea is that the components of the i-vector should |

0:05:15 | represent high level aspects of the utterance, which are independent of the phonetic content. |

0:05:22 | Because all of this apparatus is built |

0:05:24 | of the UBM, UBM can play the role of modelling the phonetic variability in the |

0:05:31 | utterance |

0:05:32 | and the i-vector then should capture things like speaker characteristics |

0:05:39 | room impulse response |

0:05:40 | and the other global aspects of the |

0:05:45 | utterance |

0:05:48 | So, the problem that arises is that the |

0:05:54 | the GMM supervector is not observable. The way to get around the problem is by |

0:06:01 | thinking of the Baum-Welch statistics. |

0:06:04 | It's typically collected with the universal |

0:06:07 | background model as ... summarising a noisy observation of the... of the GMM supervector. |

0:06:20 | the |

0:06:21 | From the mathematical point of view. The only difference between this situation and a standard |

0:06:28 | probabilistic principal components analysis is that in the standard situation you get to observe every |

0:06:36 | component of the vector exactly once and this situation, you observe different parts of the |

0:06:41 | vector different number of times. |

0:06:45 | Other than that, there is nothing misterious in the derivation. |

0:06:49 | So, this is the mathematical model, the supervectors |

0:06:55 | are assumed to be confined in a multidimensional |

0:07:00 | subspace or the supervector space. |

0:07:04 | The vector y is assumed in the prior, a standard, normal distribution. Now, the problem |

0:07:10 | is given Baum-Welch statistics to produce a point estimeate of y and tyhat is the |

0:07:16 | i-vector representation if the utterance. |

0:07:21 | You can also write the terms of the individual |

0:07:23 | components of the GML |

0:07:28 | the standard assumption is that the covariance matrix here remains unchanged, it's the same for |

0:07:35 | all utterances. |

0:07:36 | the |

0:07:39 | Attempting to make that utterance- dependent seems to lead to insuperable problems in practice, |

0:07:46 | nobody, to my knowledge has |

0:07:46 | ever made the progress with that problem. |

0:07:55 | One aspect that is common to most implementations is that |

0:08:02 | some of the parameters, mainly the mean vectors and the covariance matrices |

0:08:07 | is copied, technically, from the UBM, into the |

0:08:12 | probabilistic model |

0:08:15 | That, actually, leads to a slight improvement in the performance |

0:08:19 | I'll report to some results |

0:08:22 | later |

0:08:25 | The main advantage, though, is that you can simplify the implementation by performance of affine |

0:08:30 | transformation of the parameters |

0:08:35 | which enables you to take the mean vectors to be zero and covariance matrices |

0:08:39 | to be the identity |

0:08:41 | and that enables you to handle UBMs with full covariance matrices |

0:08:50 | in the simplest way. |

0:08:51 | It's well known that using |

0:08:53 | covariance matrices |

0:08:55 | does help. |

0:08:59 | So, these are the standard equations for extracting the i-vectors, assuming that the model parameters |

0:09:07 | are known. |

0:09:09 | the matrices be |

0:09:10 | the |

0:09:13 | problem is accumulating this... this matrix here. |

0:09:19 | Those are the zero order statistics, |

0:09:24 | that are extracted with the UBM. |

0:09:29 | The standard procedure is to precompute the terms here These matrices, |

0:09:34 | these here, they're symmetric matrices. |

0:09:37 | So, you only need the |

0:09:41 | proper triangle. |

0:09:41 | The problem, then, in the memory point of view, is that... because these are quadratic |

0:09:48 | in the i-vector dimension, you have to pay a heavy |

0:09:52 | price in terms of memory. |

0:09:57 | So, those are some |

0:09:58 | typical figures |

0:09:59 | for fairly standard sort of |

0:10:01 | configuration. |

0:10:08 | that is |

0:10:11 | These are the standard training algorithms; the only point I wanted to make in |

0:10:17 | putting up this equation here is that: both in training and in extracting the i-vector, |

0:10:25 | which was |

0:10:26 | the previous slide, the principal |

0:10:30 | computation is a matter of calculating the |

0:10:34 | posterior distribution of that factor y. |

0:10:39 | So, that is the problem, |

0:10:40 | calculating the posterior distribution of y. |

0:10:45 | Not just the point |

0:10:54 | So, the contribution of this paper |

0:10:56 | is to use a variational Bayes implementation of the probability model |

0:11:03 | in order to solve this |

0:11:06 | particular problem of the |

0:11:08 | doing |

0:11:12 | So, the standard assumption is to assume that the |

0:11:18 | posterior distribution that you're interested in factorizes; in other words, that you have a statistical |

0:11:25 | independent assumption relation |

0:11:27 | that you can impose |

0:11:31 | Estimating these terms here's carried out by a standard variational Bayes update procedure, which you |

0:11:40 | can |

0:11:41 | find |

0:11:44 | within the reference |

0:11:44 | of the Bishop's book. |

0:11:48 | This notation here means you take the vector y and |

0:11:53 | you |

0:11:55 | calculate an expectation over all components, rather than particular components |

0:12:01 | that you happen to be interested in when you're updating |

0:12:02 | the particular term. |

0:12:06 | the... |

0:12:09 | These updated rules are guaranteed to increase the variational |

0:12:12 | lower bound and that's useful |

0:12:17 | property |

0:12:20 | So, this is an iterative method, |

0:12:22 | you have to single iteration, what consists of |

0:12:27 | cycling over the |

0:12:29 | components of the i-vector. |

0:12:34 | This is just to explain that the computation's actually brought down to something very simple |

0:12:41 | assumptions are |

0:12:42 | Gaussian |

0:12:43 | the |

0:12:46 | factors in the variational factorization are also Gaussian. To get the normals |

0:12:52 | you just |

0:12:57 | expression |

0:12:58 | and the point about the memory then is just as the... in the full posterior |

0:13:05 | calculation. Pre-computing these matrices here enables you to speed up the computation at a constant |

0:13:15 | memory |

0:13:16 | The things you have to be pre-compute here are just the diagonal versions of these |

0:13:22 | things here and for that the memory overhead is negligible. |

0:13:27 | So, this is all based on the assumption that the |

0:13:32 | posterior that we can assume a diagonal posterior covariance matrix, |

0:13:40 | So, |

0:13:44 | is's explained in the paper why the variational Bayes method |

0:13:49 | , even if that assumption turns out to be wrong, |

0:13:53 | the variational Bayes is guaranteed to find the point estimate of the i-vector |

0:13:57 | exactly. |

0:14:00 | See? So, the only |

0:14:03 | error that's introduced here |

0:14:05 | is in the posterior covariance matrix, |

0:14:07 | it's assumed to be diagonal |

0:14:11 | There's no error |

0:14:13 | in the point estimate of the posterior, thus it's |

0:14:27 | If you're familiar with the numerical |

0:14:28 | the mechanics correspond to something known as the |

0:14:37 | which in this case happens to be guaranteed |

0:14:40 | versions happens to be guaranteed because |

0:14:43 | variational Bayes |

0:14:45 | So the method is exact, the only real issue is how efficient |

0:14:50 | it is. That turns out to raise the question of how |

0:14:55 | is the |

0:14:57 | assumption that |

0:14:58 | the covariance matrix can be treated as diagonal. |

0:15:06 | Two points here, to bear in mind. |

0:15:10 | In order to show why the assumptions are |

0:15:12 | reasonable. |

0:15:14 | First is that the i-vector model is not uniquely defined. |

0:15:18 | You can perform a rotation in the i-vector coordinates |

0:15:25 | provided that you perform a corresponding transformation on the |

0:15:29 | eigenvoices, the model remains unchanged. |

0:15:32 | The |

0:15:34 | posterior, |

0:15:35 | prior factor of the |

0:15:36 | why it continues to be the center of whole distrib |

0:15:42 | You have freedom in |

0:15:43 | rotating the |

0:15:46 | the basis. |

0:15:47 | The other point, this was the point that Ondrej Glembek |

0:15:55 | named in his |

0:15:57 | ICAASP paper last year. That, in general, this is a good approximation to the posterior |

0:16:04 | precision matrix, |

0:16:06 | provided you have sufficient data. So, those W's there are just the mixture |

0:16:11 | weights in the |

0:16:16 | unversal background model |

0:16:18 | and talking the number of frames |

0:16:20 | in, for example, |

0:16:21 | in scenario like core condition it may be somethin |

0:16:29 | so that you have sufficinetly many |

0:16:31 | frames that this |

0:16:32 | approximation here |

0:16:34 | would be reasoned. |

0:16:38 | If you combine those two things together, |

0:16:41 | okay? You can say that by diagonalizing this sum here you |

0:16:46 | form this sum just once, using the |

0:16:51 | the mixture weights. Then you will produce a basis of the i-vector space with respect |

0:16:58 | to which all the posterior |

0:17:00 | precision matrices are approximately diagonal. |

0:17:04 | That's the justification for the |

0:17:06 | diagonal assumption. You have to use a preferred |

0:17:09 | basis |

0:17:10 | in order to |

0:17:13 | do the calculations. |

0:17:15 | And using this... using this basis guarantees that the variational Bayes algorithm |

0:17:21 | will converge very quickly. |

0:17:22 | Typically, three iterations are enough, |

0:17:26 | three iterations independently |

0:17:27 | of the |

0:17:29 | rank of the dimensionality |

0:17:30 | of the i-vector. |

0:17:34 | And that's the basis of my contention that this algorythm's |

0:17:39 | computational requirements are |

0:17:40 | linear in the |

0:17:47 | So, memory overhead is negligible and |

0:17:49 | the computation |

0:17:52 | scales linearly |

0:17:53 | rather than quadratic. |

0:18:01 | If you're using this, |

0:18:04 | the preferred basis is going to change |

0:18:09 | , so you should not overlook that. |

0:18:17 | Whenever you have a variational Bayes method, you have a variational lower bound, which is |

0:18:23 | very similar to auxiliary function and |

0:18:29 | which weighting the auxiliary function |

0:18:31 | which is guaranteed to increase |

0:18:33 | on successive iterations of your |

0:18:39 | algotythm. |

0:18:43 | So, it's useful to be able to evaluate this, and |

0:18:47 | the formula is given in the paper. |

0:18:50 | It's guaranteed to increase on successive |

0:18:52 | iterations of variational Bayes. |

0:18:54 | So, |

0:18:57 | used for debuging. In principle, |

0:18:59 | it can be used to monitor convergence, but it actually turns out that the |

0:19:02 | overhead of using it for that purpose |

0:19:06 | slows down the algorythm, |

0:19:07 | it's not used for that in practice. |

0:19:14 | I think is that it can be used to monitor convergence when |

0:19:17 | you are training an i-vector extractor with variat |

0:19:23 | The point here is that the |

0:19:25 | exact evidence, which is |

0:19:28 | the thing you woudl normally use |

0:19:29 | to monitor convergence, |

0:19:34 | cannot be used |

0:19:36 | in this particular case |

0:19:39 | if you're assuming that the posterior |

0:19:44 | is diagonal, then you have to |

0:19:46 | modify the calculation |

0:20:03 | Okay, so a few examples |

0:20:05 | of questions that I dealt with in the paper. |

0:20:10 | One is how accurate is |

0:20:12 | variational Bayes algorythm? |

0:20:15 | To be clear here, there is no issue |

0:20:21 | at run time, you are guaranteed to get the exact |

0:20:21 | point-estimate of your i-vector, |

0:20:25 | provided you |

0:20:27 | monitor iterations |

0:20:33 | The only issue is |

0:20:36 | the recent approximation, when you treat the posterior precision or covariance matrix as the |

0:20:44 | diagonal. |

0:20:47 | And those posterior precisions |

0:20:49 | do interrupt the |

0:20:50 | training model, so it's concievable. But using the |

0:20:59 | assumption on the posterior precisions could affec |

0:21:00 | the way training behaves. |

0:21:02 | So, that's one point |

0:21:04 | that needs to be checked. |

0:21:07 | I mentioned at the beginning, |

0:21:09 | this is well known, but |

0:21:11 | I think needed to be tested. |

0:21:14 | If you make the simplifying transformation, which allows you to take the mean vectors to |

0:21:21 | be zero or the covariance |

0:21:22 | matrices |

0:21:27 | You're copying some parameters |

0:21:29 | from the UBM into the probabilistic model for |

0:21:33 | i-vectors. |

0:21:34 | There is a question as to |

0:21:37 | obviously |

0:21:38 | plausable reason |

0:21:46 | How efficient is variational Bayes? |

0:21:49 | Obviously, there's going to be some price to be paid. |

0:21:53 | The standard implementation |

0:21:57 | you can recude computational burden |

0:21:59 | at the cost of |

0:22:00 | several gigabytes of |

0:22:04 | memory, |

0:22:07 | so you no longer have opportunity |

0:22:08 | of using all that memory. |

0:22:12 | The question about efficiency. |

0:22:15 | And finally, there is an issue of |

0:22:20 | training very high dimensional i-vector extractors. |

0:22:23 | You cannot train very high dimensional |

0:22:24 | i-vectors exactly using |

0:22:31 | variational approach. |

0:22:32 | Bayes approach does enable you to do it, |

0:22:34 | but there is an impairment of doing that. |

0:22:42 | Ok, so the testbed was |

0:22:45 | female det two trials |

0:22:46 | this is a matter of telephone speech, |

0:22:50 | extended core condition of the NIST |

0:22:52 | two thousand and ten |

0:22:55 | evaluation. |

0:22:59 | Extended core condition of the millions of |

0:23:03 | very much large number of trials |

0:23:05 | than the original |

0:23:07 | evaluation protocol. |

0:23:13 | The standard front end, the standard UBM |

0:23:15 | diagonal covariance matrices, trained on the usual |

0:23:25 | In other respects, the classifier |

0:23:27 | was quite standard |

0:23:30 | the used heavy-tailed PLDA |

0:23:45 | These were |

0:23:49 | results obtained with JFA executables,which is |

0:23:54 | the way i-vectors were originally |

0:23:56 | built and they were just |

0:24:00 | produced as Benchmark |

0:24:01 | There was a problem, the activity detection explai |

0:24:06 | error rates were a little higher than expected. |

0:24:10 | With variaional Bayes I actually got marginally better |

0:24:13 | results, which turned out to be a more effective c |

0:24:21 | covariance matrices. Copying the covariance matrices |

0:24:24 | actually |

0:24:27 | effect is to reduce to other estimated variances |

0:24:42 | I need to get to |

0:24:45 | efficiency |

0:24:47 | My figures for extracting a four hundred-dimensional i-vector extractor |

0:24:52 | are typically about half a second. |

0:24:54 | Almost all of the time is spent in |

0:24:55 | BLAS routines, accumulating the posterior covariance matrices |

0:24:59 | to take seventy five percent of the time. |

0:25:06 | An estimate a of quarter of second, which |

0:25:07 | suggests that compiler optimization may |

0:25:09 | be helpful, everything is going on inside the clas |

0:25:17 | For the variational Bayes method, I've got an estimate of |

0:25:22 | point nine seconds instead of point five. |

0:25:26 | I've fixed the number of iterations at five. |

0:25:33 | Variational Bayes method really comes to it's own |

0:25:35 | when you work |

0:25:37 | with higher-dimensional of |

0:25:39 | i-vector extractors. |

0:25:47 | It's one last table. I did try and put several |

0:25:51 | dimensions, up to sixteen |

0:25:52 | hundred and got a very good indurance and performa |

0:26:05 | Okay, thank you. |

0:26:17 | Well, it depends on the ... on the dimensionality of the i-vector extractor. |

0:26:25 | A couple of gigabytes, it's as big as the large vocabulary continuous speech recognizer, it's |

0:26:30 | not as intelligent, but it's as big. |

0:26:37 | Just the eigenvoices, okay, together with the stuff you have to pre-compute in order to |

0:26:44 | extract the i-vectors efficiently. |

0:26:53 | Yeah, that what require |

0:26:57 | You still have to store the eigenvoices, but thing that you did not know, that's |

0:27:02 | not the big part. The big part is a bunch of a triangular matrices that |

0:27:07 | we store in order to calculate i-vectors efficiently, using the standard approach. |

0:27:14 | The point of this was to use the variational Bayes to void that computation. |

0:27:23 | I'm afraid that we've just spent the time for questions here and so I... I |

0:27:27 | guess that Patrick will get lots of questions offline and maybe even at the end |

0:27:31 | of this talk they can answer those questions together with Sandro, who is giving the |

0:27:35 | next talk |