0:00:21So, good afternoon and thank you, Patrick.
0:00:25Well, I am Carlos Vaqueros from Agnitio, from Spain, and I'm presenting our work on
0:00:30Datasets shift in PLDA based speaker verification,
0:00:34which is, actually, an analysis on
0:00:37several techniques
0:00:38that can be used in
0:00:42PLDA systems to indicate the effect of dataset shift. But also it's analysis on the
0:00:49limitations that the PLDA systems have, when dealing with dataset shift.
0:00:56So, dataset shift is the mismatch that may appear between the joint distributions of inputs
0:01:02and outputs
0:01:04for training and testing.
0:01:07Okay? In general, we have three types of dataset shift. First one will be covariate
0:01:12shift , which is the... which appears when there is
0:01:18when the distribution of the inputs, differ from training to testing and it's the most
0:01:25usual one, the most usual type of dataset shift, since it is related to channel
0:01:32variability, session variability or language mismatch. But there are also another types of
0:01:39dataset shift, for example prior probability shift, which is related to variations in the operating
0:01:46point;
0:01:47or concept shift, which is related to adversarial environments, that in speaker verification will be
0:01:55spoofing attempts.
0:01:57In this ... in this work we're focusing on covariate shift.
0:02:03Covariate shift has been widely studied in the speaker verification.
0:02:07We know that there are several techniques developed to compensate for channel/ session variability or
0:02:12language mismatch. But most of the sessions, most of these techniques work under the assumption
0:02:21that large datasets are available for training.
0:02:27The thing is: what happens in real situations, where we face a completely new and
0:02:31unknown situation and we don't have data to train these... these approaches? For example, here
0:02:37we have some results.
0:02:40We are considering the JFA system,
0:02:42that
0:02:46to face the condition one, the NIST, SRE, await, which is interview-interview, and we don't
0:02:52use any telephone, any microphone data for training channel. So, we can see that JFA
0:03:01if not using microphone data, is not much better classical map
0:03:05doesn't use any compensation at all
0:03:09So, once we have the microphone data, we get
0:03:14a huge improvement.
0:03:15So the thing is, what can we do we in real scenarios that are unknown
0:03:20and unseen?
0:03:22Well, if we don't have any data, it's hard to do anything, but usually we
0:03:26can, we can expect that some little amount of matched data is provided. So, there
0:03:34is the thing that we could do.
0:03:37We can define some probabilistic framework, so that it is possible to perform an adaptation,
0:03:46even a modeled train.
0:03:48When a mismatch development data and given some matched data, we... so we can adapt
0:03:54the model parameters and they can work as soon as possible in this new scenario.
0:04:01But to do this in a natural way and
0:04:09to derive it eassily, we should suspect that the
0:04:14the speaker verification system would be a monolithic system that provides a single probabilistic framework
0:04:20to compute the likelihood of the model parameters given the data.
0:04:27Well, the first approach is to JFA were monolythic, so they provided a framework in
0:04:34which algorithm worked, that defined this, the weight to adapt of these parameters. It could
0:04:42be possible to define weight to adapt these parameters, given a small amount of data.
0:04:47But, currents state-of-the-art PLDA systems, they are modular, so we have several model levels.
0:04:57We have started with the first level, the UBM, so we plan the UBm separately
0:05:03and it provides sufficient statistics. We used to train the i-vector extractor, a total variability
0:05:09subspace, and then we obtained i-vectors and we used them to train the PLDA model,
0:05:15but we used them as features. PLDA model has no knowledge of how these features
0:05:22were structured,
0:05:25just the prior distribution they have.
0:05:29So it it's ... this model has it's advantages, because it's very easy to
0:05:39to keep improving this model, since we can fix the UBM and work in the
0:05:44total variability matrices, which is fast to train and
0:05:50we can try many things and prove it. And also the i-vector extractor is fixed,
0:05:55we can work a lot and very quickly in PLDA model, and we keep improving
0:05:59it.
0:06:00But, in test of adapting this model to new situations, it's ... it has some
0:06:08problems. Because either we work in the highest level, in a highest model level, that
0:06:14is PLDA and we adapt the PLDA parameters to face the new situations
0:06:22or if we want to work in
0:06:25lower model levels, we will need to retrain
0:06:29the whole system.
0:06:31For example, if we have adapted the UBM, our i-vector extractor is not valid anymore,
0:06:35so we will need to retrain it on the whole data. And this is not
0:06:38feasible in many applications, for example an application that you want to learn online as
0:06:45you get more data, in new situation, so you...
0:06:51need to have all the development data every time we adapt the UBM, so that's
0:06:56not... it will take a long time to adapt it for even a few set
0:07:01of recording, a small set of recrding. So that's not feasible in many applications.
0:07:14Well, we... in any case, there are several techniques that we've done, that we can
0:07:20apply
0:07:21in a PLDA system. First thing we could do is, we can
0:07:28the UBM, attend to the subsequent model levels, but we will need to retrain the
0:07:34whole system.
0:07:35We can do it pooling all the available data, the development data and the matched
0:07:39data, or we could do it weighting of datasets. But, this will be not feasible
0:07:45in many applications.
0:07:47So, we can also work in the i-vector extractor. One thing that has been done
0:07:53is to
0:07:55is to train a new total variability matrix on the matched
0:08:01matched data.
0:08:03Stack it with the original total
0:08:06variability matrix.
0:08:07Well, this approach has some to work, but usually you need a quite large amount
0:08:15of data to train the match, total variability matrix. And also, it will require to
0:08:21retrain the PLDA model.
0:08:23It will have some problems. And also, become working
0:08:29in the PLDA
0:08:33PLDA model. Here, what we are proposing to do is simply use the length normalization.
0:08:41But using
0:08:43some sort of i-vector adaptation by centering
0:08:49using the i-vector mean from the matched dataset.
0:08:59What it has to say?
0:09:02Here
0:09:03it should be some reference to the word, the study done by Jesus, that is
0:09:10also another approach that could have to compensate for covariate shift
0:09:16in five to six percent and after another approach.
0:09:20So, this
0:09:22these problems, but always work in the PLDA model so the UBM and the i-vector
0:09:28extractor are modified.
0:09:32To test these techniques, what we do is we simulate covariate shift into variation language
0:09:39mismatch.
0:09:40So we assume that our system has been trained completely on English data.
0:09:44We will evaluate it in mismatched groups of languages. We will consider Chinese, Hindi-Urdu and
0:09:54Russian. As the development data we will use the NIST data from zero four to
0:10:03zero six, the Switchword data and Fisher data.
0:10:08Here we will have the number of session speakers that we have for each language
0:10:12for Chinese we'll have
0:10:14quite a large amount of data.
0:10:16For example, for Hindi-Urdu we don't have much development data.
0:10:21We will evaluate these approaches on the NIST SRE zero eight telephone- telephone condition. We
0:10:27will consider all to all trials
0:10:32sHere we have the number of models and speakers, it is
0:10:35language.
0:10:37In a speaker verification system we will consider an i-vector, PLDA system, gender-dependent i-vector extractor,
0:10:44dimension four hundred. And then, we'll consider a gender-dependent PLDA, which is a mixture of
0:10:50two PLDA models, one for... one trained male data, one trained with female data.
0:10:59With what... with full covariance matrix for the system component we have speaker subspace of
0:11:04dimension one hundred and twenty.
0:11:06And the result will... are analyzed in terms of EER and miniDCF. MiniDCF
0:11:16So the first thing we do is, we analyze the effect of covariate shift in
0:11:21the data. And what we have done is to analyze the i-vectors.
0:11:25We have different languages. So we have computed in Mahalanobis distance, been doing the
0:11:33population of English i-vectors are the
0:11:38other language, the population of other language's i-vectors. We have seen that these distances are
0:11:44very large. So, this means that when we are performing the i-vector land normalisation
0:11:54language which is different from English, we project it onto a small region of the
0:12:00hypersphere of unit radius. So, that... the distribution will not be suspected.
0:12:08The... all the i-vectors will be concentrated in a small region of the hyperextract.
0:12:13So this will have an effect in the accuracy, not only the distribution of i-vectors,
0:12:19because we are missing more information in the UBM But in the end, we see
0:12:25that it has
0:12:25an effect in the accuracy of the system, but we can see in this table
0:12:31only English data has been used for development
0:12:37the other languages
0:12:38worse results that English. It is true that we don't know the accuracy that we
0:12:47will get for these languages, provided that we have enough data to train a model,
0:12:53to train a complete evaluation system with them. But there's no reason also to believe
0:12:57that these languages are harder for speaker verification system that English. So we could expect
0:13:04to get an accuracy which is
0:13:07somehow similar, maybe better, maybe worse, but somehow similar.
0:13:10to English.
0:13:13Well, here we are comparing the minDCF obtained for the proposed techniques
0:13:22for the three languages and the three groups of languages at their best.
0:13:27So the first call for each language is the baseline, so you see
0:13:31English development data.
0:13:33And the second column is
0:13:37stacking to the... we use
0:13:41total variability matrices.
0:13:42The third is using i-vector adaptation. Fourth is using s-norm.
0:13:49but, we will
0:13:54And the last three collumns are combinations of these techniques.
0:13:58So, what... we can see that most of these techniques work in the sense that
0:14:02they improve the
0:14:06results of the system
0:14:06but improvement is quite small.
0:14:12if we wanted to reach some acccuracy close to English, which is
0:14:17here
0:14:18where we are still too far, we're still too far.
0:14:24So, this can be seen also in this DET curves
0:14:28where we are representing the DET curves of time for Chinese.
0:14:37We have the DET curve which is
0:14:38only using English data for involvement, the blue curve will use a match training data
0:14:45to
0:14:46perform i-vector adaptation
0:14:49the black curve will use match Chinese data
0:14:52to perform
0:14:53i-vector adaptation on s-norm.
0:14:55We get the
0:14:56we see that we get a slight improvement, but we are still too far from
0:15:02English
0:15:03So, that's from the results we would like to get.
0:15:10There is also another important fact that we introduce. The presence of covariate shift. We
0:15:16will find this misalignment in the score distributions.
0:15:22It's something that is widely known and
0:15:25you can see this effect here, in the example we have.
0:15:29We have represented the English and
0:15:30Chinese score distributions. We can see that the Chinese score distributions
0:15:35are
0:15:38shifted to the right
0:15:42higher scores, probably it's related also with the fact that
0:15:45the i-vectors are concentrated in the small region.
0:15:50you
0:15:53So,
0:15:54it's
0:15:56it's mandatory to use, it will have a little amount of data to use it
0:16:00for calibration.
0:16:02This is something that everybody knows and we have been doing for
0:16:08in all NIST evals, we always calibrate each condition separately. We use also techniques with
0:16:16side info
0:16:18for calibration that we, that we add the language, but it's important, the condition might...
0:16:25because if we only have a little amount of data, and we need to use
0:16:30independent...
0:16:32part of the data for calibration and for adaptation, we will not have much data
0:16:36for adaptation.
0:16:38So, here we are representing minDCF for our languages.
0:16:44And in the actual DCF we use English data for calibration, in red. That's DCF
0:16:50when you use
0:16:51we use matched data.
0:16:53It's
0:16:55it's mandatory to use matched data for calibration.
0:16:58So, as conclusions of this work,
0:17:00we'll say that dataset shift is usual in speaker recognition
0:17:08There are many techniques developed to compensate for this, but most of them need
0:17:14large amount of data to work properly.
0:17:17But in many real cases little data is provided.
0:17:21So, if we have monolithic systems, it will enable us to perform some sort of
0:17:28adaptation.
0:17:29But state-of-the-art techniques tend to modularities, since development is much easier, when we have a
0:17:36modular system.
0:17:37PLDA
0:17:38There are techniques that can work with this modular
0:17:43modular systems, but they obtain a slight increase in accuracy.
0:17:47There is still a huge gap to improve.
0:17:49And finally, it's important to keep in mind that matched data is mandatory for calibration,
0:17:55so we have
0:17:58small amount of data
0:17:59for adaptation, we will need to use part of this data for calibration.
0:18:04So, that's all, thank you very much.
0:18:28You mean, in this work?
0:18:40You mean this work or in the literature?
0:18:44I'm not sure, BUT you can see that, for example, YOUR i-vectors don't match your
0:18:51distribution, your expected prior distribution needs a new or
0:18:57or even at lower levels your statistics or
0:18:59or MFCC
0:19:04but yes
0:19:14but it would be interesting. I think the problem is that
0:19:19if you want to have a compensation
0:19:23basis, it would be interesting to have at some point JFA or maybe eigenchannel base
0:19:31system that is
0:19:34described as probabilistic framework that you could adapt, define some technique but
0:19:43interesting to do it.
0:20:10So you mean using a smaller
0:20:12dimensional i-vector extracor?
0:20:18okay
0:20:23But in any case, you will... if you adapt your i-vector extractor, you will need
0:20:28to retrain your PLDA system.
0:20:41Yeah yeah. Have you tried to remove the specific means
0:20:46or the specific channel conditions?
0:20:50for example
0:20:51microphone data
0:20:53or to remove
0:20:55telephone mean from the telephone data
0:20:56microphone mean from the microphone data?
0:21:00No, I haven't tried that.
0:21:03Sounds risky.
0:21:07It may work, but
0:21:09like assuming that there is no rotation in the i-vectors, so that's shift in the
0:21:16if there is rotation
0:21:18it will not work
0:21:22I don't know
0:21:23It is interesting to try. I've tried that and it was helping
0:21:27It was helping? Ok, that's interesting.
0:21:43okay
0:21:54Well, especially were in those languages, where I don't have much matched data yet. Yeah,
0:22:00that might be... I think it's in most languages pretty balanced, but there are some
0:22:07languages... I think I remember that, for example, Hindi had
0:22:11Hindi-Urdu had ...
0:22:14in detail... seven speakers. So that was
0:22:18I remember, but is probably... it is quite unbalanced, but maybe we have
0:22:22female speaker
0:22:47well
0:22:51okay
0:22:53Well, not for Chinese, for example. It depends on the language
0:22:59but
0:22:59I would say that i-vector adaptation is the one that
0:23:05rocks, so it always needs improvement
0:23:09It's not much, but
0:23:11yet
0:23:22The matched data.
0:23:24So, when I work I use...
0:23:28so these techniques try to use the matched data
0:23:33but in our web two group the
0:23:35accuracy of the system
0:23:44Not much, I don't think the improvement was indicative, if there was improvement. Maybe there
0:23:51was some losses.
0:24:51So you mean that
0:24:54if I get
0:24:56my model speakers from English, it will help also if we
0:25:00perform some of these techniques to adapt to them?
0:25:08okay
0:25:13okay
0:25:35I see that you can't do sometimes something without the data, because there are certain
0:25:41ways
0:25:45courses of variability
0:26:04variability in the first place
0:26:13general comment to
0:26:16all of us
0:26:44Yeah, ok, well in fact, there are techniques that provide more...that need the results presented
0:26:51in last of the speech
0:26:54is based on integrating out the
0:26:57PLDA parameters. So, to
0:27:01the uncertainty of these parameters, so it should be more robust to dataset shift, but
0:27:08when you see.. the point here is: if you have some amount of data
0:27:12so it's better to use it. But you're right
0:27:20You are completely right, of course.