Speech Transcript - On Exploring the Similarity and Fusion of i-Vector and Sparse Representation based Speaker Verification Systems

0:00:16	welcome
0:00:16	Exploring similarity and fusion of i-vector and sparse representation based speaker verification systems
0:00:55	'Morning.
0:00:56	I'm Haris from Indian Institute of Technology Guwahati and I am here to present our
0:01:01	work on exploring the similarity and fusion of i-vector based sparse representation based
0:01:07	speaker verification systems.
0:01:12	As we all know, the i-vector based systems form
0:01:15	current state-of-the-art of speaker verification.
0:01:20	And also recently
0:01:22	some works
0:01:24	explored the use of sparse representation
0:01:28	for the speaker verification in gender speaker recognition task.
0:01:32	So, in sparse representation based works, both examplar dictionary based and learned dictionary based techniques
0:01:41	have been explored. So, in this work we're trying to find the similarity between these...
0:01:47	I mean, the sparse representation based and
0:01:50	the i-vector based speaker verification systems. We also proposed a feature-level combination or fusion of
0:01:57	these two systems exploiting the advantages from both of the systems.
0:02:05	So, again, I will have a short review of the i-vector based system, with we're
0:02:12	all familiar with. So, i-vector based speaker verification system can be interpreted as climbing the
0:02:20	compact representation of the high-dimensional supervectors by taking projections to a matrix called Total variability
0:02:28	matrix, which is a low-rank. And the estimation of the... estimation of the i-vector can
0:02:35	be performed using the
0:02:40	by the given equation and
0:02:44	so we find the i-vector representation of training as well as testing utterances, and find
0:02:52	the similarity between these two using the cosine kernel and this is the... So now,
0:02:59	before going to the basics of sparse representation based speaker verification system, we'll have a
0:03:04	look at the fundamentals of sparse representation. Sparse representation finds... will try to represent a
0:03:12	vector, y, by using a dictionary, our matrix, as a linear combination of the columns
0:03:20	of the dictionary matrix. And we also put the constraint that the number of columns
0:03:27	used to be... that's how we find the sparse representation vector, which actually takes only
0:03:35	a few known zero values.
0:03:39	So, the columns of dictionary are often named as atoms
0:03:44	used by different sparse representation community. So, the
0:03:52	transformation matrices named as the dictionary.
0:03:57	So, a sparse representation applications we can find starting from compression, de-noising and also classification.
0:04:08	So, it has been well explored in the area of image processing and many other
0:04:15	areas of signal processing. So, the basic idea behind sparse representation based classification is the
0:04:26	test example for a class can be approximated as a linear population of training examples.
0:04:33	So, in the... one of the very first works in this were photos is a
0:04:42	face recognition task. A whole set of examples. The dictionary was formed using that examples,
0:04:52	training examples from different classes and classification was performed by finding the sparse representation of
0:05:01	the best example using the dictionary created by the training example. That means: so we
0:05:09	call these approaches exemplar based, because the dictionaries are created using examples of the... examples
0:05:17	from different classes.
0:05:19	So, motivated by this work, a few works
0:05:25	explored the theme for speaker identification and then for speaker verification. As we know, in
0:05:31	speaker verification we don't have... it's not a close set task, so we'll ... the
0:05:38	work... you placed a set of background speakers to create the dictionary, examplar dictionary. So,
0:05:46	the claimed speakers training example and a set of background utterances. That has a meaning
0:05:54	for the form of vectors, which is, which can be the supervector representation or i-vector
0:05:59	representation; forms the dictionary, exemplar dictionary. And the text example is projector of behind sparse
0:06:07	representation of the text example over the dictionary. And so, for finding the sparse solution,
0:06:14	we can go for that many algorithms available for finding the sparse solution. It can
0:06:19	be zero minimization based or one minimization based. The example of a zero minimization is
0:06:27	an orthogonal matching pursuit. Or we can go for a basis pursuit, Lasso; any of
0:06:34	these algorithms. And
0:06:37	So, in this pass
0:06:40	representation vector, that's D. It is supposed to ... how we do the scoring, scoring
0:06:50	for performing the speak verification, yes? We take the some of coeficients in that sparse
0:06:57	vector, which correspond to the claimed speaker training an example. And they should be in
0:07:06	that train, the score corresponding to the claim and the background and total corrections in
0:07:13	the sparse vector is considered as the score for verification. This is, actually, proposed in
0:07:20	the work by and there are, in case if you have multiple examples for a
0:07:27	claimed speaker, we'll take some of the corrections corresponding to that. Otherwise, we'll have only
0:07:32	one example, so we take, basically, the correction corresponding the particular
0:07:39	example and against the
0:07:42	so
0:07:45	as an improvement over this work, in our previous work we have proposed the use
0:07:51	of a learned dictionary for the task of ... i mean, for doing the sparse
0:07:57	representation based speaker verification. Here, actually, similar to that, i-vector formulation learn and dictionary,
0:08:07	which is,in our task, D.
0:08:10	And
0:08:11	We use the center of mean shifted supervectors as the speaker representation and the train
0:08:20	has a list of testing examples, that are the corresponding supervectors. They are represented over
0:08:28	the learned dictionary, which is also learned from the supervector representations only and
0:08:37	the sparse representation of training and the testing examples are extracted using orthogonal matching pursuit.
0:08:46	And similar to the i-vector system, we find the similarity between these two representations using
0:08:53	that cosine kernel.
0:08:56	So, this system was being named, in this work we reffered to is as SRSV.
0:09:05	So,
0:09:08	we have used couple of methods for learning the dictionary. One was the well known
0:09:14	KSVD algorithm for learning the dictionary for sparse approximation and we also used modified version
0:09:22	of KSVD algorithm, which is near that, so S-KSVD algorithm. This is, actually, a supervised
0:09:30	version of the KSVD algorithm. In KSVD
0:09:34	what we do is: the development data
0:09:41	it's an iterative matter, there is all in the figure, there are two phases. One
0:09:46	is the sparse coding phase ,the dictionary update phase, the sparse coding phase. We find,
0:09:52	we initialize the process with the random dictionary or randomly ... created using randomly chosen
0:10:02	examples. And we find the sparse representation of the development data or the dictionary. And
0:10:08	in the next stage, we update the dictionary using singular value decomposition, the columns of
0:10:12	dictionaries are updated
0:10:15	and this is done iteratively to get the optimized dictionary. In the sparse coding stage,
0:10:21	any of the sparse algorithms can be used. In our we have used OMP.
0:10:28	So, the difference of S-KSVD algorithm is that it basically uses class ... the training
0:10:37	examples. So. in case we need the basic idea, our basic goal is to minimize
0:10:45	the representation error and to better sparse constraint. In S-KSVD apart from this, minimizing the
0:10:55	representation error, we put there constrain on the class probability also. That means in
0:11:04	the same. We have to minimize the representation error as we maximise the separability degree
0:11:10	representation. So here we use a Fisher criteria
0:11:19	with the representation. Minimizing the representation error. So, this is... this can be considered as
0:11:28	a discriminative dictionary, something like LDA incorporated to the dictinary learning conceept. So, these two
0:11:36	matters we have used for learning the dictiona and our experiments are done using the
0:11:42	NIST two thousand three database. Actually, when we were doing this experiment we had access
0:11:47	to only two thousand three data. We are looking or trying to get results using
0:11:53	latest databases, so
0:11:58	now coming to the work
0:12:02	unlike the theme matrix learning, ... I'll go back to the slide... from here I
0:12:10	see it's twenty four, this is very much similar to the i-vector formulation. The difference
0:12:16	lies in extracting the natural, the representation like this, which is sparse here. And in
0:12:25	case of i-vector it's a pool vector. And also, the way we learn the dictionary.
0:12:30	In case of theme matrix learning it's something similar to PPC, whereas in this matter,
0:12:40	the dictionaries are learned with the sparse constraint or expecting a sparse representation.
0:12:48	And also as we have used OMP. OMP is a greedy approach for finding the
0:12:53	... minimizing the zero of the vector x. So here we can have either a
0:13:02	constraint over the representation error, or we can have a constraint over the sparsity.
0:13:07	So, in this figure we examine the effect of sparsity in the final results of
0:13:16	speaker verification.
0:13:17	so
0:13:21	And there are two sparse representation
0:13:27	processes are in board. One, in the dictionary learning phase, and one at the decoding
0:13:32	or the testing phase. So what
0:13:37	Which sparsity should be used to learning the dictionary, because I know I am facing
0:13:41	more in the learning dictionary also,
0:13:43	at the time of testing. So we try to
0:13:47	find the optimal or best number for this sparsity constraint.
0:13:54	Actually, while learning the dictionary, we are taking the sparse representation of the seen data.
0:14:02	That means that we initialize the dictionary, then we try to represent the same data
0:14:09	of the dictionary and we are updating. So, it's... that process is over at seen
0:14:14	data. So we can expect
0:14:16	more compact representation compared to doing the sparse representation over an unseen data. So here
0:14:23	we have observed that when we use the learning... dictionary learning phase, we have used
0:14:29	very high sparsity. That means very large number of atoms selected. Whereas in the testing
0:14:36	phase, we should relax the constraintment for higher number of atoms, because
0:14:43	on unseen data. There are different data and evaluation that are also gonna
0:14:49	So, in this particular work we have observed the selecton of five atoms while dictinary
0:14:55	learning and fifty atoms while representation are giving ...I mean this is giving their best
0:15:02	performance.
0:15:04	so
0:15:06	I compared the results obtained from the i-vector based system,
0:15:14	sparse representation based system with case we did there. So here are
0:15:23	So, the results of equal rate, the respond to the i-vector system is better by
0:15:31	one percent compared to the sparse representation based system,
0:15:37	using case with the dictionary. So, look at the distribution of scores. Though, I mean,
0:15:44	our work our work exactly
0:15:48	differed these two systems.
0:15:51	We can see that this is the true scores and false scores distribution.
0:15:56	Red curve is the false score distribution and false distributions and blue one is the
0:16:02	true
0:16:05	Here, this is corresponding the i-vector system.
0:16:08	The sparse representations.
0:16:11	So, here you can see this false representation, false trials. Scores are pi at zero
0:16:20	compared to the i-vector system.
0:16:22	That's obvious like the sparse representation into
0:16:27	I mean, unlike that i-vectors... I mean, in the false trial case, there is high
0:16:37	chance to have orthogonal representation for a... let's say...
0:16:44	suppose in the false trials, speakers are different, so the atoms selected by two different
0:16:50	speakers can be... I mean, can be different. So, that cosine kernel will give zero
0:16:57	scores in many after trials. That leads to a distribution like this.
0:17:05	And the mean of the true trials
0:17:10	has shifted towards the right, which is good,
0:17:13	but at the same time, the variables of true trials have increased. So, which ultimately
0:17:20	makes the system perform
0:17:23	with the i-vector system.
0:17:24	And, in fact, this is one at the pi key, false trials distribution, is good.
0:17:32	Also that shifting towards the right is good, but at the same time, that increasing
0:17:36	the variables, makes the system performing to i-vector based system.
0:17:43	So, here
0:17:46	there is another work which is trying to do the sparse representation
0:17:52	over the T matrix system.
0:17:54	So,
0:17:56	here, you know, actually, the actual work, they have used Lasso algorithm for binding.
0:18:08	the sparse representation instead of i-vector over the T matrix. So, here we have repeated
0:18:16	this experiment matching with our previous experiment, using OMP. So, as OMP has, unlike that
0:18:24	Lasso basis, OMP has sitting constraint using it, so we examined the
0:18:34	change in the sparsity with the T matrix and the dictionary. So, you can see
0:18:40	here this blue curve shows there are performance equal at rates corresponding to them, as
0:18:47	SRSV system with the T matrix
0:18:50	and the green curve shows the KSVD.
0:18:54	And for the third we have shown that... the i-vector, the classical i-vector system.
0:19:03	Here you can see that and the T matrices look more learned with the sparse
0:19:09	constraint. Last number atoms elected for the representation it gives a really bad performance. Whereas,
0:19:16	KSVD based system gives a decent or comparable result with the i-vector, tkaing the real
0:19:23	numbers of atom selector.
0:19:27	when we go, when we increase the number of columns selected, ultimately it affects the
0:19:33	i-vector performance. Because i-vector is also a full representation. So here, as noted before, the
0:19:41	SRSV case, we did perform slightly inferior to
0:19:44	i-vector system.
0:19:47	So,
0:19:49	the conclusion result. SRSV system with the T dictionary performs poor with high sparsity and
0:19:56	approaches i-vector performance
0:19:57	with lower sparsity.
0:19:59	And also, wit the use of all atoms the performance of the t-SRSV system matches
0:20:05	that of i-vector system.
0:20:08	Now, so, we'll try to find the effect on which, mean,
0:20:14	the previous talks also,
0:20:18	larger dimension i-vectors. So here, I mean, basically, we'll go for large number of false
0:20:25	alarms in the T matrix
0:20:27	Here says effect of dimension of, the size of dictionary on the sparse representation
0:20:35	based systems.
0:20:36	and this blue curve shows the performance that is size of i-vectors, I mean, the
0:20:42	number of various sizes of dictionary. And the green is the SRSV
0:20:50	and the red is SRSV with the
0:20:54	T matrix as the dictionary.
0:20:57	Here the number of atoms or columns selected is matching with the KSVD one. And
0:21:05	this i-vector, of course, is a folder of destination.
0:21:10	The atom selector, the T-SRSV performs very ... I mean,
0:21:16	the performance is good comared to the other two. 1.0 the i-vectors, in the case
0:21:22	of i-vector, three hundred columns
0:21:28	or four hundred hardly make any difference, but the KSVD, the optimal performance or the
0:21:33	best performance
0:21:36	That atoms selected with T atoms and extracted
0:21:40	So, we are trying to increase the number of atoms selected for representation in the
0:21:51	case. So, we found that when we increased the number
0:21:54	for sizes of dictionary also. It approaches the performance of the i-vector system.
0:22:03	Still, the i-vector based system is
0:22:04	giving the best performance.
0:22:15	Now, motivated by the performance of these three systems, basically. The KSVD dictionary, T-SRSV, T
0:22:26	dictionary based SRSV and i-vector. We tried to use the power of these two and
0:22:37	we proposed the fusion of this
0:22:42	representation. The more we did this, we found that i-vector representation of supervector using T
0:22:51	matrix in the conventional way. And the resynteticized supervectors and we termed it as the
0:22:59	T smoothing. Whereas we know that projection to a lower dimension space removes the ...
0:23:05	I mean, reduced dimensions and small nuisances
0:23:10	also
0:23:11	And so, this helps in clasification. So this diagram, or this intensity plot, shows the
0:23:21	results of the control experiment in the case of twenty five speakers we do. These
0:23:25	speakers having five examples and we find the similarity of the cosine kernel within the
0:23:32	supervectors before this smoothing and after smoothing.
0:23:36	So, that
0:23:38	all cases combinations, the similarity, is on the cosine kernel, our newest improvement. But in
0:23:46	case of in - class, improvement is much better, compared to them, between class cases.
0:23:52	Which, of course, it will be helping in the class performance. So then, we have
0:23:59	used this smooth supervectors for learning the dictionary and for sparse representation.
0:24:05	each
0:24:16	We are becoming to the results later.
0:24:20	Here we can compare the performance
0:24:24	various matters we have trained.
0:24:30	There's to be compared i-vector with the KSVD dictionary system, as I mentioned, there is
0:24:37	the
0:24:37	I mean, approximately one
0:24:41	percent equal rate between these two.
0:24:43	T matrix based
0:24:45	sparse representation with pool atom selected. This is, actually, t atom selected, this is the
0:24:52	best performance came with the pool
0:24:54	pool representation.
0:25:00	Then another thing, which I already mentioned, by use of the discriminative dictionary, we have
0:25:06	more tha a huge improvement in performance. This number against the KSVD, ... actually these
0:25:15	need to consider that this is type of thing incorpporated into that. So that justifies
0:25:22	the improvement in the performance. So, smoothing using that the T matrix, we recorded approximately
0:25:29	thirty percent relative improvement in case of both the dictionaries.
0:25:36	So, we have
0:25:38	also did some channel and session variability compensation using joint factor analysis and LDA and
0:25:45	WCCN. So, after this research,
0:25:54	compensation. So here
0:25:56	we have known that i-vector with this LDN WCCN as a three channel compensation vectors
0:26:08	Actually, joint factor analysis, we used as a preprocessing of the supervectors before doing sparse
0:26:15	representation.
0:26:17	Also, a combination of these two
0:26:20	the sparse representation with LDA WCCN have been trained.
0:26:31	And also, we have tried to do the
0:26:32	score level fusion of the best performing system
0:26:35	and this ended up as the
0:26:40	performance od point ninety nine equal error rate in case of
0:26:42	these two thousand three database.
0:26:46	To summarize
0:26:49	We have highlighted the close similarity between the i-vector and sparse representation based SV system.
0:26:55	We have studied the use of total variability matrix as a dictionary with the matching
0:27:02	pursuit as the
0:27:05	algorithm for sparse representation.
0:27:08	We found that, compared to the SVD dictionary, that T matric can be used as
0:27:14	the dictionary with the better results,
0:27:17	but with the high number of atoms selected. Among all the dictionaries we found that
0:27:22	the supervised one,
0:27:26	performed much better that the other one left astray. And we also proposed a feature
0:27:35	level fusion of the i-vector and
0:27:36	sparse representation based systems. And we found that amount of channel of session compensation method
0:27:49	in case of sparse representation joint factor analysis
0:27:53	based preprecession held
0:27:57	better with.
0:28:08	Time for questions, any questions?
0:28:15	So let me ask one... for the sparse representations you always work within mean supervectors
0:28:22	from the system? Did you ever tried to reconstruct assumptions statistic predictions if you do
0:28:27	if you obtain supervectors up adaptation or you lose some information before
0:28:35	We actually anomalized supervectors within the cov
0:29:11	What is the motivation for using
0:29:16	sparse representation here?
0:29:19	besides tha fact that it's a technique which is available
0:29:54	They are good speakers, and are representing all s
0:30:02	There can be some dimensions which
0:30:04	are closer to the particular
0:30:27	And just a question... you sure there?
0:30:30	You need not to be.
0:30:32	The representation
0:30:48	sparse representation
0:31:25	thank you
0:32:25	In my experiments I did not find
0:32:28	sparsity
0:32:52	Let's thank the speaker again.

On Exploring the Similarity and Fusion of i-Vector and Sparse Representation based Speaker Verification Systems

SESSION 01: Speaker Recognition - Compact Representation

Haris B C