0:00:16 | welcome |
---|---|

0:00:16 | Exploring similarity and fusion of i-vector and sparse representation based speaker verification systems |

0:00:55 | 'Morning. |

0:00:56 | I'm Haris from Indian Institute of Technology Guwahati and I am here to present our |

0:01:01 | work on exploring the similarity and fusion of i-vector based sparse representation based |

0:01:07 | speaker verification systems. |

0:01:12 | As we all know, the i-vector based systems form |

0:01:15 | current state-of-the-art of speaker verification. |

0:01:20 | And also recently |

0:01:22 | some works |

0:01:24 | explored the use of sparse representation |

0:01:28 | for the speaker verification in gender speaker recognition task. |

0:01:32 | So, in sparse representation based works, both examplar dictionary based and learned dictionary based techniques |

0:01:41 | have been explored. So, in this work we're trying to find the similarity between these... |

0:01:47 | I mean, the sparse representation based and |

0:01:50 | the i-vector based speaker verification systems. We also proposed a feature-level combination or fusion of |

0:01:57 | these two systems exploiting the advantages from both of the systems. |

0:02:05 | So, again, I will have a short review of the i-vector based system, with we're |

0:02:12 | all familiar with. So, i-vector based speaker verification system can be interpreted as climbing the |

0:02:20 | compact representation of the high-dimensional supervectors by taking projections to a matrix called Total variability |

0:02:28 | matrix, which is a low-rank. And the estimation of the... estimation of the i-vector can |

0:02:35 | be performed using the |

0:02:40 | by the given equation and |

0:02:44 | so we find the i-vector representation of training as well as testing utterances, and find |

0:02:52 | the similarity between these two using the cosine kernel and this is the... So now, |

0:02:59 | before going to the basics of sparse representation based speaker verification system, we'll have a |

0:03:04 | look at the fundamentals of sparse representation. Sparse representation finds... will try to represent a |

0:03:12 | vector, y, by using a dictionary, our matrix, as a linear combination of the columns |

0:03:20 | of the dictionary matrix. And we also put the constraint that the number of columns |

0:03:27 | used to be... that's how we find the sparse representation vector, which actually takes only |

0:03:35 | a few known zero values. |

0:03:39 | So, the columns of dictionary are often named as atoms |

0:03:44 | used by different sparse representation community. So, the |

0:03:52 | transformation matrices named as the dictionary. |

0:03:57 | So, a sparse representation applications we can find starting from compression, de-noising and also classification. |

0:04:08 | So, it has been well explored in the area of image processing and many other |

0:04:15 | areas of signal processing. So, the basic idea behind sparse representation based classification is the |

0:04:26 | test example for a class can be approximated as a linear population of training examples. |

0:04:33 | So, in the... one of the very first works in this were photos is a |

0:04:42 | face recognition task. A whole set of examples. The dictionary was formed using that examples, |

0:04:52 | training examples from different classes and classification was performed by finding the sparse representation of |

0:05:01 | the best example using the dictionary created by the training example. That means: so we |

0:05:09 | call these approaches exemplar based, because the dictionaries are created using examples of the... examples |

0:05:17 | from different classes. |

0:05:19 | So, motivated by this work, a few works |

0:05:25 | explored the theme for speaker identification and then for speaker verification. As we know, in |

0:05:31 | speaker verification we don't have... it's not a close set task, so we'll ... the |

0:05:38 | work... you placed a set of background speakers to create the dictionary, examplar dictionary. So, |

0:05:46 | the claimed speakers training example and a set of background utterances. That has a meaning |

0:05:54 | for the form of vectors, which is, which can be the supervector representation or i-vector |

0:05:59 | representation; forms the dictionary, exemplar dictionary. And the text example is projector of behind sparse |

0:06:07 | representation of the text example over the dictionary. And so, for finding the sparse solution, |

0:06:14 | we can go for that many algorithms available for finding the sparse solution. It can |

0:06:19 | be zero minimization based or one minimization based. The example of a zero minimization is |

0:06:27 | an orthogonal matching pursuit. Or we can go for a basis pursuit, Lasso; any of |

0:06:34 | these algorithms. And |

0:06:37 | So, in this pass |

0:06:40 | representation vector, that's D. It is supposed to ... how we do the scoring, scoring |

0:06:50 | for performing the speak verification, yes? We take the some of coeficients in that sparse |

0:06:57 | vector, which correspond to the claimed speaker training an example. And they should be in |

0:07:06 | that train, the score corresponding to the claim and the background and total corrections in |

0:07:13 | the sparse vector is considered as the score for verification. This is, actually, proposed in |

0:07:20 | the work by and there are, in case if you have multiple examples for a |

0:07:27 | claimed speaker, we'll take some of the corrections corresponding to that. Otherwise, we'll have only |

0:07:32 | one example, so we take, basically, the correction corresponding the particular |

0:07:39 | example and against the |

0:07:42 | so |

0:07:45 | as an improvement over this work, in our previous work we have proposed the use |

0:07:51 | of a learned dictionary for the task of ... i mean, for doing the sparse |

0:07:57 | representation based speaker verification. Here, actually, similar to that, i-vector formulation learn and dictionary, |

0:08:07 | which is,in our task, D. |

0:08:10 | And |

0:08:11 | We use the center of mean shifted supervectors as the speaker representation and the train |

0:08:20 | has a list of testing examples, that are the corresponding supervectors. They are represented over |

0:08:28 | the learned dictionary, which is also learned from the supervector representations only and |

0:08:37 | the sparse representation of training and the testing examples are extracted using orthogonal matching pursuit. |

0:08:46 | And similar to the i-vector system, we find the similarity between these two representations using |

0:08:53 | that cosine kernel. |

0:08:56 | So, this system was being named, in this work we reffered to is as SRSV. |

0:09:05 | So, |

0:09:08 | we have used couple of methods for learning the dictionary. One was the well known |

0:09:14 | KSVD algorithm for learning the dictionary for sparse approximation and we also used modified version |

0:09:22 | of KSVD algorithm, which is near that, so S-KSVD algorithm. This is, actually, a supervised |

0:09:30 | version of the KSVD algorithm. In KSVD |

0:09:34 | what we do is: the development data |

0:09:41 | it's an iterative matter, there is all in the figure, there are two phases. One |

0:09:46 | is the sparse coding phase ,the dictionary update phase, the sparse coding phase. We find, |

0:09:52 | we initialize the process with the random dictionary or randomly ... created using randomly chosen |

0:10:02 | examples. And we find the sparse representation of the development data or the dictionary. And |

0:10:08 | in the next stage, we update the dictionary using singular value decomposition, the columns of |

0:10:12 | dictionaries are updated |

0:10:15 | and this is done iteratively to get the optimized dictionary. In the sparse coding stage, |

0:10:21 | any of the sparse algorithms can be used. In our we have used OMP. |

0:10:28 | So, the difference of S-KSVD algorithm is that it basically uses class ... the training |

0:10:37 | examples. So. in case we need the basic idea, our basic goal is to minimize |

0:10:45 | the representation error and to better sparse constraint. In S-KSVD apart from this, minimizing the |

0:10:55 | representation error, we put there constrain on the class probability also. That means in |

0:11:04 | the same. We have to minimize the representation error as we maximise the separability degree |

0:11:10 | representation. So here we use a Fisher criteria |

0:11:19 | with the representation. Minimizing the representation error. So, this is... this can be considered as |

0:11:28 | a discriminative dictionary, something like LDA incorporated to the dictinary learning conceept. So, these two |

0:11:36 | matters we have used for learning the dictiona and our experiments are done using the |

0:11:42 | NIST two thousand three database. Actually, when we were doing this experiment we had access |

0:11:47 | to only two thousand three data. We are looking or trying to get results using |

0:11:53 | latest databases, so |

0:11:58 | now coming to the work |

0:12:02 | unlike the theme matrix learning, ... I'll go back to the slide... from here I |

0:12:10 | see it's twenty four, this is very much similar to the i-vector formulation. The difference |

0:12:16 | lies in extracting the natural, the representation like this, which is sparse here. And in |

0:12:25 | case of i-vector it's a pool vector. And also, the way we learn the dictionary. |

0:12:30 | In case of theme matrix learning it's something similar to PPC, whereas in this matter, |

0:12:40 | the dictionaries are learned with the sparse constraint or expecting a sparse representation. |

0:12:48 | And also as we have used OMP. OMP is a greedy approach for finding the |

0:12:53 | ... minimizing the zero of the vector x. So here we can have either a |

0:13:02 | constraint over the representation error, or we can have a constraint over the sparsity. |

0:13:07 | So, in this figure we examine the effect of sparsity in the final results of |

0:13:16 | speaker verification. |

0:13:17 | so |

0:13:21 | And there are two sparse representation |

0:13:27 | processes are in board. One, in the dictionary learning phase, and one at the decoding |

0:13:32 | or the testing phase. So what |

0:13:37 | Which sparsity should be used to learning the dictionary, because I know I am facing |

0:13:41 | more in the learning dictionary also, |

0:13:43 | at the time of testing. So we try to |

0:13:47 | find the optimal or best number for this sparsity constraint. |

0:13:54 | Actually, while learning the dictionary, we are taking the sparse representation of the seen data. |

0:14:02 | That means that we initialize the dictionary, then we try to represent the same data |

0:14:09 | of the dictionary and we are updating. So, it's... that process is over at seen |

0:14:14 | data. So we can expect |

0:14:16 | more compact representation compared to doing the sparse representation over an unseen data. So here |

0:14:23 | we have observed that when we use the learning... dictionary learning phase, we have used |

0:14:29 | very high sparsity. That means very large number of atoms selected. Whereas in the testing |

0:14:36 | phase, we should relax the constraintment for higher number of atoms, because |

0:14:43 | on unseen data. There are different data and evaluation that are also gonna |

0:14:49 | So, in this particular work we have observed the selecton of five atoms while dictinary |

0:14:55 | learning and fifty atoms while representation are giving ...I mean this is giving their best |

0:15:02 | performance. |

0:15:04 | so |

0:15:06 | I compared the results obtained from the i-vector based system, |

0:15:14 | sparse representation based system with case we did there. So here are |

0:15:23 | So, the results of equal rate, the respond to the i-vector system is better by |

0:15:31 | one percent compared to the sparse representation based system, |

0:15:37 | using case with the dictionary. So, look at the distribution of scores. Though, I mean, |

0:15:44 | our work our work exactly |

0:15:48 | differed these two systems. |

0:15:51 | We can see that this is the true scores and false scores distribution. |

0:15:56 | Red curve is the false score distribution and false distributions and blue one is the |

0:16:02 | true |

0:16:05 | Here, this is corresponding the i-vector system. |

0:16:08 | The sparse representations. |

0:16:11 | So, here you can see this false representation, false trials. Scores are pi at zero |

0:16:20 | compared to the i-vector system. |

0:16:22 | That's obvious like the sparse representation into |

0:16:27 | I mean, unlike that i-vectors... I mean, in the false trial case, there is high |

0:16:37 | chance to have orthogonal representation for a... let's say... |

0:16:44 | suppose in the false trials, speakers are different, so the atoms selected by two different |

0:16:50 | speakers can be... I mean, can be different. So, that cosine kernel will give zero |

0:16:57 | scores in many after trials. That leads to a distribution like this. |

0:17:05 | And the mean of the true trials |

0:17:10 | has shifted towards the right, which is good, |

0:17:13 | but at the same time, the variables of true trials have increased. So, which ultimately |

0:17:20 | makes the system perform |

0:17:23 | with the i-vector system. |

0:17:24 | And, in fact, this is one at the pi key, false trials distribution, is good. |

0:17:32 | Also that shifting towards the right is good, but at the same time, that increasing |

0:17:36 | the variables, makes the system performing to i-vector based system. |

0:17:43 | So, here |

0:17:46 | there is another work which is trying to do the sparse representation |

0:17:52 | over the T matrix system. |

0:17:54 | So, |

0:17:56 | here, you know, actually, the actual work, they have used Lasso algorithm for binding. |

0:18:08 | the sparse representation instead of i-vector over the T matrix. So, here we have repeated |

0:18:16 | this experiment matching with our previous experiment, using OMP. So, as OMP has, unlike that |

0:18:24 | Lasso basis, OMP has sitting constraint using it, so we examined the |

0:18:34 | change in the sparsity with the T matrix and the dictionary. So, you can see |

0:18:40 | here this blue curve shows there are performance equal at rates corresponding to them, as |

0:18:47 | SRSV system with the T matrix |

0:18:50 | and the green curve shows the KSVD. |

0:18:54 | And for the third we have shown that... the i-vector, the classical i-vector system. |

0:19:03 | Here you can see that and the T matrices look more learned with the sparse |

0:19:09 | constraint. Last number atoms elected for the representation it gives a really bad performance. Whereas, |

0:19:16 | KSVD based system gives a decent or comparable result with the i-vector, tkaing the real |

0:19:23 | numbers of atom selector. |

0:19:27 | when we go, when we increase the number of columns selected, ultimately it affects the |

0:19:33 | i-vector performance. Because i-vector is also a full representation. So here, as noted before, the |

0:19:41 | SRSV case, we did perform slightly inferior to |

0:19:44 | i-vector system. |

0:19:47 | So, |

0:19:49 | the conclusion result. SRSV system with the T dictionary performs poor with high sparsity and |

0:19:56 | approaches i-vector performance |

0:19:57 | with lower sparsity. |

0:19:59 | And also, wit the use of all atoms the performance of the t-SRSV system matches |

0:20:05 | that of i-vector system. |

0:20:08 | Now, so, we'll try to find the effect on which, mean, |

0:20:14 | the previous talks also, |

0:20:18 | larger dimension i-vectors. So here, I mean, basically, we'll go for large number of false |

0:20:25 | alarms in the T matrix |

0:20:27 | Here says effect of dimension of, the size of dictionary on the sparse representation |

0:20:35 | based systems. |

0:20:36 | and this blue curve shows the performance that is size of i-vectors, I mean, the |

0:20:42 | number of various sizes of dictionary. And the green is the SRSV |

0:20:50 | and the red is SRSV with the |

0:20:54 | T matrix as the dictionary. |

0:20:57 | Here the number of atoms or columns selected is matching with the KSVD one. And |

0:21:05 | this i-vector, of course, is a folder of destination. |

0:21:10 | The atom selector, the T-SRSV performs very ... I mean, |

0:21:16 | the performance is good comared to the other two. 1.0 the i-vectors, in the case |

0:21:22 | of i-vector, three hundred columns |

0:21:28 | or four hundred hardly make any difference, but the KSVD, the optimal performance or the |

0:21:33 | best performance |

0:21:36 | That atoms selected with T atoms and extracted |

0:21:40 | So, we are trying to increase the number of atoms selected for representation in the |

0:21:51 | case. So, we found that when we increased the number |

0:21:54 | for sizes of dictionary also. It approaches the performance of the i-vector system. |

0:22:03 | Still, the i-vector based system is |

0:22:04 | giving the best performance. |

0:22:15 | Now, motivated by the performance of these three systems, basically. The KSVD dictionary, T-SRSV, T |

0:22:26 | dictionary based SRSV and i-vector. We tried to use the power of these two and |

0:22:37 | we proposed the fusion of this |

0:22:42 | representation. The more we did this, we found that i-vector representation of supervector using T |

0:22:51 | matrix in the conventional way. And the resynteticized supervectors and we termed it as the |

0:22:59 | T smoothing. Whereas we know that projection to a lower dimension space removes the ... |

0:23:05 | I mean, reduced dimensions and small nuisances |

0:23:10 | also |

0:23:11 | And so, this helps in clasification. So this diagram, or this intensity plot, shows the |

0:23:21 | results of the control experiment in the case of twenty five speakers we do. These |

0:23:25 | speakers having five examples and we find the similarity of the cosine kernel within the |

0:23:32 | supervectors before this smoothing and after smoothing. |

0:23:36 | So, that |

0:23:38 | all cases combinations, the similarity, is on the cosine kernel, our newest improvement. But in |

0:23:46 | case of in - class, improvement is much better, compared to them, between class cases. |

0:23:52 | Which, of course, it will be helping in the class performance. So then, we have |

0:23:59 | used this smooth supervectors for learning the dictionary and for sparse representation. |

0:24:05 | each |

0:24:16 | We are becoming to the results later. |

0:24:20 | Here we can compare the performance |

0:24:24 | various matters we have trained. |

0:24:30 | There's to be compared i-vector with the KSVD dictionary system, as I mentioned, there is |

0:24:37 | the |

0:24:37 | I mean, approximately one |

0:24:41 | percent equal rate between these two. |

0:24:43 | T matrix based |

0:24:45 | sparse representation with pool atom selected. This is, actually, t atom selected, this is the |

0:24:52 | best performance came with the pool |

0:24:54 | pool representation. |

0:25:00 | Then another thing, which I already mentioned, by use of the discriminative dictionary, we have |

0:25:06 | more tha a huge improvement in performance. This number against the KSVD, ... actually these |

0:25:15 | need to consider that this is type of thing incorpporated into that. So that justifies |

0:25:22 | the improvement in the performance. So, smoothing using that the T matrix, we recorded approximately |

0:25:29 | thirty percent relative improvement in case of both the dictionaries. |

0:25:36 | So, we have |

0:25:38 | also did some channel and session variability compensation using joint factor analysis and LDA and |

0:25:45 | WCCN. So, after this research, |

0:25:54 | compensation. So here |

0:25:56 | we have known that i-vector with this LDN WCCN as a three channel compensation vectors |

0:26:08 | Actually, joint factor analysis, we used as a preprocessing of the supervectors before doing sparse |

0:26:15 | representation. |

0:26:17 | Also, a combination of these two |

0:26:20 | the sparse representation with LDA WCCN have been trained. |

0:26:31 | And also, we have tried to do the |

0:26:32 | score level fusion of the best performing system |

0:26:35 | and this ended up as the |

0:26:40 | performance od point ninety nine equal error rate in case of |

0:26:42 | these two thousand three database. |

0:26:46 | To summarize |

0:26:49 | We have highlighted the close similarity between the i-vector and sparse representation based SV system. |

0:26:55 | We have studied the use of total variability matrix as a dictionary with the matching |

0:27:02 | pursuit as the |

0:27:05 | algorithm for sparse representation. |

0:27:08 | We found that, compared to the SVD dictionary, that T matric can be used as |

0:27:14 | the dictionary with the better results, |

0:27:17 | but with the high number of atoms selected. Among all the dictionaries we found that |

0:27:22 | the supervised one, |

0:27:26 | performed much better that the other one left astray. And we also proposed a feature |

0:27:35 | level fusion of the i-vector and |

0:27:36 | sparse representation based systems. And we found that amount of channel of session compensation method |

0:27:49 | in case of sparse representation joint factor analysis |

0:27:53 | based preprecession held |

0:27:57 | better with. |

0:28:08 | Time for questions, any questions? |

0:28:15 | So let me ask one... for the sparse representations you always work within mean supervectors |

0:28:22 | from the system? Did you ever tried to reconstruct assumptions statistic predictions if you do |

0:28:27 | if you obtain supervectors up adaptation or you lose some information before |

0:28:35 | We actually anomalized supervectors within the cov |

0:29:11 | What is the motivation for using |

0:29:16 | sparse representation here? |

0:29:19 | besides tha fact that it's a technique which is available |

0:29:54 | They are good speakers, and are representing all s |

0:30:02 | There can be some dimensions which |

0:30:04 | are closer to the particular |

0:30:27 | And just a question... you sure there? |

0:30:30 | You need not to be. |

0:30:32 | The representation |

0:30:48 | sparse representation |

0:31:25 | thank you |

0:32:25 | In my experiments I did not find |

0:32:28 | sparsity |

0:32:52 | Let's thank the speaker again. |