0:00:16welcome
0:00:16Exploring similarity and fusion of i-vector and sparse representation based speaker verification systems
0:00:55'Morning.
0:00:56I'm Haris from Indian Institute of Technology Guwahati and I am here to present our
0:01:01work on exploring the similarity and fusion of i-vector based sparse representation based
0:01:07speaker verification systems.
0:01:12As we all know, the i-vector based systems form
0:01:15current state-of-the-art of speaker verification.
0:01:20And also recently
0:01:22some works
0:01:24explored the use of sparse representation
0:01:28for the speaker verification in gender speaker recognition task.
0:01:32So, in sparse representation based works, both examplar dictionary based and learned dictionary based techniques
0:01:41have been explored. So, in this work we're trying to find the similarity between these...
0:01:47I mean, the sparse representation based and
0:01:50the i-vector based speaker verification systems. We also proposed a feature-level combination or fusion of
0:01:57these two systems exploiting the advantages from both of the systems.
0:02:05So, again, I will have a short review of the i-vector based system, with we're
0:02:12all familiar with. So, i-vector based speaker verification system can be interpreted as climbing the
0:02:20compact representation of the high-dimensional supervectors by taking projections to a matrix called Total variability
0:02:28matrix, which is a low-rank. And the estimation of the... estimation of the i-vector can
0:02:35be performed using the
0:02:40by the given equation and
0:02:44so we find the i-vector representation of training as well as testing utterances, and find
0:02:52the similarity between these two using the cosine kernel and this is the... So now,
0:02:59before going to the basics of sparse representation based speaker verification system, we'll have a
0:03:04look at the fundamentals of sparse representation. Sparse representation finds... will try to represent a
0:03:12vector, y, by using a dictionary, our matrix, as a linear combination of the columns
0:03:20of the dictionary matrix. And we also put the constraint that the number of columns
0:03:27used to be... that's how we find the sparse representation vector, which actually takes only
0:03:35a few known zero values.
0:03:39So, the columns of dictionary are often named as atoms
0:03:44used by different sparse representation community. So, the
0:03:52transformation matrices named as the dictionary.
0:03:57So, a sparse representation applications we can find starting from compression, de-noising and also classification.
0:04:08So, it has been well explored in the area of image processing and many other
0:04:15areas of signal processing. So, the basic idea behind sparse representation based classification is the
0:04:26test example for a class can be approximated as a linear population of training examples.
0:04:33So, in the... one of the very first works in this were photos is a
0:04:42face recognition task. A whole set of examples. The dictionary was formed using that examples,
0:04:52training examples from different classes and classification was performed by finding the sparse representation of
0:05:01the best example using the dictionary created by the training example. That means: so we
0:05:09call these approaches exemplar based, because the dictionaries are created using examples of the... examples
0:05:17from different classes.
0:05:19So, motivated by this work, a few works
0:05:25explored the theme for speaker identification and then for speaker verification. As we know, in
0:05:31speaker verification we don't have... it's not a close set task, so we'll ... the
0:05:38work... you placed a set of background speakers to create the dictionary, examplar dictionary. So,
0:05:46the claimed speakers training example and a set of background utterances. That has a meaning
0:05:54for the form of vectors, which is, which can be the supervector representation or i-vector
0:05:59representation; forms the dictionary, exemplar dictionary. And the text example is projector of behind sparse
0:06:07representation of the text example over the dictionary. And so, for finding the sparse solution,
0:06:14we can go for that many algorithms available for finding the sparse solution. It can
0:06:19be zero minimization based or one minimization based. The example of a zero minimization is
0:06:27an orthogonal matching pursuit. Or we can go for a basis pursuit, Lasso; any of
0:06:34these algorithms. And
0:06:37So, in this pass
0:06:40representation vector, that's D. It is supposed to ... how we do the scoring, scoring
0:06:50for performing the speak verification, yes? We take the some of coeficients in that sparse
0:06:57vector, which correspond to the claimed speaker training an example. And they should be in
0:07:06that train, the score corresponding to the claim and the background and total corrections in
0:07:13the sparse vector is considered as the score for verification. This is, actually, proposed in
0:07:20the work by and there are, in case if you have multiple examples for a
0:07:27claimed speaker, we'll take some of the corrections corresponding to that. Otherwise, we'll have only
0:07:32one example, so we take, basically, the correction corresponding the particular
0:07:39example and against the
0:07:42so
0:07:45as an improvement over this work, in our previous work we have proposed the use
0:07:51of a learned dictionary for the task of ... i mean, for doing the sparse
0:07:57representation based speaker verification. Here, actually, similar to that, i-vector formulation learn and dictionary,
0:08:07which is,in our task, D.
0:08:10And
0:08:11We use the center of mean shifted supervectors as the speaker representation and the train
0:08:20has a list of testing examples, that are the corresponding supervectors. They are represented over
0:08:28the learned dictionary, which is also learned from the supervector representations only and
0:08:37the sparse representation of training and the testing examples are extracted using orthogonal matching pursuit.
0:08:46And similar to the i-vector system, we find the similarity between these two representations using
0:08:53that cosine kernel.
0:08:56So, this system was being named, in this work we reffered to is as SRSV.
0:09:05So,
0:09:08we have used couple of methods for learning the dictionary. One was the well known
0:09:14KSVD algorithm for learning the dictionary for sparse approximation and we also used modified version
0:09:22of KSVD algorithm, which is near that, so S-KSVD algorithm. This is, actually, a supervised
0:09:30version of the KSVD algorithm. In KSVD
0:09:34what we do is: the development data
0:09:41it's an iterative matter, there is all in the figure, there are two phases. One
0:09:46is the sparse coding phase ,the dictionary update phase, the sparse coding phase. We find,
0:09:52we initialize the process with the random dictionary or randomly ... created using randomly chosen
0:10:02examples. And we find the sparse representation of the development data or the dictionary. And
0:10:08in the next stage, we update the dictionary using singular value decomposition, the columns of
0:10:12dictionaries are updated
0:10:15and this is done iteratively to get the optimized dictionary. In the sparse coding stage,
0:10:21any of the sparse algorithms can be used. In our we have used OMP.
0:10:28So, the difference of S-KSVD algorithm is that it basically uses class ... the training
0:10:37examples. So. in case we need the basic idea, our basic goal is to minimize
0:10:45the representation error and to better sparse constraint. In S-KSVD apart from this, minimizing the
0:10:55representation error, we put there constrain on the class probability also. That means in
0:11:04the same. We have to minimize the representation error as we maximise the separability degree
0:11:10representation. So here we use a Fisher criteria
0:11:19with the representation. Minimizing the representation error. So, this is... this can be considered as
0:11:28a discriminative dictionary, something like LDA incorporated to the dictinary learning conceept. So, these two
0:11:36matters we have used for learning the dictiona and our experiments are done using the
0:11:42NIST two thousand three database. Actually, when we were doing this experiment we had access
0:11:47to only two thousand three data. We are looking or trying to get results using
0:11:53latest databases, so
0:11:58now coming to the work
0:12:02unlike the theme matrix learning, ... I'll go back to the slide... from here I
0:12:10see it's twenty four, this is very much similar to the i-vector formulation. The difference
0:12:16lies in extracting the natural, the representation like this, which is sparse here. And in
0:12:25case of i-vector it's a pool vector. And also, the way we learn the dictionary.
0:12:30In case of theme matrix learning it's something similar to PPC, whereas in this matter,
0:12:40the dictionaries are learned with the sparse constraint or expecting a sparse representation.
0:12:48And also as we have used OMP. OMP is a greedy approach for finding the
0:12:53... minimizing the zero of the vector x. So here we can have either a
0:13:02constraint over the representation error, or we can have a constraint over the sparsity.
0:13:07So, in this figure we examine the effect of sparsity in the final results of
0:13:16speaker verification.
0:13:17so
0:13:21And there are two sparse representation
0:13:27processes are in board. One, in the dictionary learning phase, and one at the decoding
0:13:32or the testing phase. So what
0:13:37Which sparsity should be used to learning the dictionary, because I know I am facing
0:13:41more in the learning dictionary also,
0:13:43at the time of testing. So we try to
0:13:47find the optimal or best number for this sparsity constraint.
0:13:54Actually, while learning the dictionary, we are taking the sparse representation of the seen data.
0:14:02That means that we initialize the dictionary, then we try to represent the same data
0:14:09of the dictionary and we are updating. So, it's... that process is over at seen
0:14:14data. So we can expect
0:14:16more compact representation compared to doing the sparse representation over an unseen data. So here
0:14:23we have observed that when we use the learning... dictionary learning phase, we have used
0:14:29very high sparsity. That means very large number of atoms selected. Whereas in the testing
0:14:36phase, we should relax the constraintment for higher number of atoms, because
0:14:43on unseen data. There are different data and evaluation that are also gonna
0:14:49So, in this particular work we have observed the selecton of five atoms while dictinary
0:14:55learning and fifty atoms while representation are giving ...I mean this is giving their best
0:15:02performance.
0:15:04so
0:15:06I compared the results obtained from the i-vector based system,
0:15:14sparse representation based system with case we did there. So here are
0:15:23So, the results of equal rate, the respond to the i-vector system is better by
0:15:31one percent compared to the sparse representation based system,
0:15:37using case with the dictionary. So, look at the distribution of scores. Though, I mean,
0:15:44our work our work exactly
0:15:48differed these two systems.
0:15:51We can see that this is the true scores and false scores distribution.
0:15:56Red curve is the false score distribution and false distributions and blue one is the
0:16:02true
0:16:05Here, this is corresponding the i-vector system.
0:16:08The sparse representations.
0:16:11So, here you can see this false representation, false trials. Scores are pi at zero
0:16:20compared to the i-vector system.
0:16:22That's obvious like the sparse representation into
0:16:27I mean, unlike that i-vectors... I mean, in the false trial case, there is high
0:16:37chance to have orthogonal representation for a... let's say...
0:16:44suppose in the false trials, speakers are different, so the atoms selected by two different
0:16:50speakers can be... I mean, can be different. So, that cosine kernel will give zero
0:16:57scores in many after trials. That leads to a distribution like this.
0:17:05And the mean of the true trials
0:17:10has shifted towards the right, which is good,
0:17:13but at the same time, the variables of true trials have increased. So, which ultimately
0:17:20makes the system perform
0:17:23with the i-vector system.
0:17:24And, in fact, this is one at the pi key, false trials distribution, is good.
0:17:32Also that shifting towards the right is good, but at the same time, that increasing
0:17:36the variables, makes the system performing to i-vector based system.
0:17:43So, here
0:17:46there is another work which is trying to do the sparse representation
0:17:52over the T matrix system.
0:17:54So,
0:17:56here, you know, actually, the actual work, they have used Lasso algorithm for binding.
0:18:08the sparse representation instead of i-vector over the T matrix. So, here we have repeated
0:18:16this experiment matching with our previous experiment, using OMP. So, as OMP has, unlike that
0:18:24Lasso basis, OMP has sitting constraint using it, so we examined the
0:18:34change in the sparsity with the T matrix and the dictionary. So, you can see
0:18:40here this blue curve shows there are performance equal at rates corresponding to them, as
0:18:47SRSV system with the T matrix
0:18:50and the green curve shows the KSVD.
0:18:54And for the third we have shown that... the i-vector, the classical i-vector system.
0:19:03Here you can see that and the T matrices look more learned with the sparse
0:19:09constraint. Last number atoms elected for the representation it gives a really bad performance. Whereas,
0:19:16KSVD based system gives a decent or comparable result with the i-vector, tkaing the real
0:19:23numbers of atom selector.
0:19:27when we go, when we increase the number of columns selected, ultimately it affects the
0:19:33i-vector performance. Because i-vector is also a full representation. So here, as noted before, the
0:19:41SRSV case, we did perform slightly inferior to
0:19:44i-vector system.
0:19:47So,
0:19:49the conclusion result. SRSV system with the T dictionary performs poor with high sparsity and
0:19:56approaches i-vector performance
0:19:57with lower sparsity.
0:19:59And also, wit the use of all atoms the performance of the t-SRSV system matches
0:20:05that of i-vector system.
0:20:08Now, so, we'll try to find the effect on which, mean,
0:20:14the previous talks also,
0:20:18larger dimension i-vectors. So here, I mean, basically, we'll go for large number of false
0:20:25alarms in the T matrix
0:20:27Here says effect of dimension of, the size of dictionary on the sparse representation
0:20:35based systems.
0:20:36and this blue curve shows the performance that is size of i-vectors, I mean, the
0:20:42number of various sizes of dictionary. And the green is the SRSV
0:20:50and the red is SRSV with the
0:20:54T matrix as the dictionary.
0:20:57Here the number of atoms or columns selected is matching with the KSVD one. And
0:21:05this i-vector, of course, is a folder of destination.
0:21:10The atom selector, the T-SRSV performs very ... I mean,
0:21:16the performance is good comared to the other two. 1.0 the i-vectors, in the case
0:21:22of i-vector, three hundred columns
0:21:28or four hundred hardly make any difference, but the KSVD, the optimal performance or the
0:21:33best performance
0:21:36That atoms selected with T atoms and extracted
0:21:40So, we are trying to increase the number of atoms selected for representation in the
0:21:51case. So, we found that when we increased the number
0:21:54for sizes of dictionary also. It approaches the performance of the i-vector system.
0:22:03Still, the i-vector based system is
0:22:04giving the best performance.
0:22:15Now, motivated by the performance of these three systems, basically. The KSVD dictionary, T-SRSV, T
0:22:26dictionary based SRSV and i-vector. We tried to use the power of these two and
0:22:37we proposed the fusion of this
0:22:42representation. The more we did this, we found that i-vector representation of supervector using T
0:22:51matrix in the conventional way. And the resynteticized supervectors and we termed it as the
0:22:59T smoothing. Whereas we know that projection to a lower dimension space removes the ...
0:23:05I mean, reduced dimensions and small nuisances
0:23:10also
0:23:11And so, this helps in clasification. So this diagram, or this intensity plot, shows the
0:23:21results of the control experiment in the case of twenty five speakers we do. These
0:23:25speakers having five examples and we find the similarity of the cosine kernel within the
0:23:32supervectors before this smoothing and after smoothing.
0:23:36So, that
0:23:38all cases combinations, the similarity, is on the cosine kernel, our newest improvement. But in
0:23:46case of in - class, improvement is much better, compared to them, between class cases.
0:23:52Which, of course, it will be helping in the class performance. So then, we have
0:23:59used this smooth supervectors for learning the dictionary and for sparse representation.
0:24:05each
0:24:16We are becoming to the results later.
0:24:20Here we can compare the performance
0:24:24various matters we have trained.
0:24:30There's to be compared i-vector with the KSVD dictionary system, as I mentioned, there is
0:24:37the
0:24:37I mean, approximately one
0:24:41percent equal rate between these two.
0:24:43T matrix based
0:24:45sparse representation with pool atom selected. This is, actually, t atom selected, this is the
0:24:52best performance came with the pool
0:24:54pool representation.
0:25:00Then another thing, which I already mentioned, by use of the discriminative dictionary, we have
0:25:06more tha a huge improvement in performance. This number against the KSVD, ... actually these
0:25:15need to consider that this is type of thing incorpporated into that. So that justifies
0:25:22the improvement in the performance. So, smoothing using that the T matrix, we recorded approximately
0:25:29thirty percent relative improvement in case of both the dictionaries.
0:25:36So, we have
0:25:38also did some channel and session variability compensation using joint factor analysis and LDA and
0:25:45WCCN. So, after this research,
0:25:54compensation. So here
0:25:56we have known that i-vector with this LDN WCCN as a three channel compensation vectors
0:26:08Actually, joint factor analysis, we used as a preprocessing of the supervectors before doing sparse
0:26:15representation.
0:26:17Also, a combination of these two
0:26:20the sparse representation with LDA WCCN have been trained.
0:26:31And also, we have tried to do the
0:26:32score level fusion of the best performing system
0:26:35and this ended up as the
0:26:40performance od point ninety nine equal error rate in case of
0:26:42these two thousand three database.
0:26:46To summarize
0:26:49We have highlighted the close similarity between the i-vector and sparse representation based SV system.
0:26:55We have studied the use of total variability matrix as a dictionary with the matching
0:27:02pursuit as the
0:27:05algorithm for sparse representation.
0:27:08We found that, compared to the SVD dictionary, that T matric can be used as
0:27:14the dictionary with the better results,
0:27:17but with the high number of atoms selected. Among all the dictionaries we found that
0:27:22the supervised one,
0:27:26performed much better that the other one left astray. And we also proposed a feature
0:27:35level fusion of the i-vector and
0:27:36sparse representation based systems. And we found that amount of channel of session compensation method
0:27:49in case of sparse representation joint factor analysis
0:27:53based preprecession held
0:27:57better with.
0:28:08Time for questions, any questions?
0:28:15So let me ask one... for the sparse representations you always work within mean supervectors
0:28:22from the system? Did you ever tried to reconstruct assumptions statistic predictions if you do
0:28:27if you obtain supervectors up adaptation or you lose some information before
0:28:35We actually anomalized supervectors within the cov
0:29:11What is the motivation for using
0:29:16sparse representation here?
0:29:19besides tha fact that it's a technique which is available
0:29:54They are good speakers, and are representing all s
0:30:02There can be some dimensions which
0:30:04are closer to the particular
0:30:27And just a question... you sure there?
0:30:30You need not to be.
0:30:32The representation
0:30:48sparse representation
0:31:25thank you
0:32:25In my experiments I did not find
0:32:28sparsity
0:32:52Let's thank the speaker again.