Speech Transcript - Hierarchical speaker clustering methods for the NIST i-vector Challenge

0:00:15	i present the other words that we did the
0:00:19	our first speech to the i-vector challenge
0:00:21	and actually that is in just the slides it is some more that was not
0:00:27	presented in the paper
0:00:29	but the was submitted the a system description i think that this was to
0:00:33	we should with you with you guys
0:00:35	so
0:00:38	here's outline of my talk so first i will present the
0:00:41	of the progress of our system
0:00:43	and then i will a detailed to work to ideas that are the class training
0:00:47	and the score normalisation for comp losing computing the stock
0:00:52	so
0:00:54	so we it is
0:00:54	the time of the panel for the mindcf
0:00:58	for also for our system
0:01:00	so for starting from the baseline was they
0:01:03	min dcf of zero point three hundred the at six
0:01:06	we end up with the mean dcf of zero point two hundred the forty seven
0:01:09	which makes a
0:01:11	relative improvement of about thirty six percent
0:01:14	so i'm gonna present the this is the main a direct in the i graphical
0:01:20	manner so we have the development set and have the evaluation set that is
0:01:23	split into enrollment and test
0:01:25	and we have this these the three steps that was the in that baseline so
0:01:31	we have the whitening the nickel normalisation and the cosine scoring
0:01:34	and as we see that only whitening need the training
0:01:38	and do so we don't they really need the at the label of
0:01:43	of the development set for that
0:01:45	so static from this the baseline us something we get that can be done is
0:01:50	if we can better choose the of the data the data for the whitening
0:01:55	i mean if we take only that the
0:01:57	the you tenants with more than thirty five seconds no id experiments
0:02:01	we will are getting like
0:02:03	what it is some improvements with the mean dcf of zero point three hundred seventy
0:02:06	two
0:02:08	so after afterward i what i'm gonna use that this the a conditioned i-vectors so
0:02:12	i'm gonna use this deaf twenty about two and later experiments
0:02:16	like to systems
0:02:18	so
0:02:19	so all the next step that we did is the clustering
0:02:23	so
0:02:25	is a clustering so actually tried different kind of clustering and then i'm gonna come
0:02:29	back this is just later on but the one of the best clustering that you're
0:02:33	getting is that what you called the cosine be any clustering
0:02:36	and so actually
0:02:39	after this clustering we take only the
0:02:41	the clusters that have more than a to i-vectors in it
0:02:46	and we and we apply and now we can apply like
0:02:50	supervised based techniques like lda be at a double c and muppets
0:02:54	so here we just a this study at and clustering in the loop and you
0:02:59	can see that we can already get some improvements women dcf of zero point three
0:03:04	hundred three hundred fifty six
0:03:09	so what we tried next is
0:03:12	less to place the cosine scoring by about the kind of scrollings force of for
0:03:16	them was the svm
0:03:17	so actually here the so we trained a linear svm for every target speaker
0:03:23	what the positive we have only one positive samples of the next normalized
0:03:27	i-vector of the target speaker and the negative samples are the next normalize i-vector of
0:03:32	of the processed the
0:03:35	development set
0:03:37	so we had we can get some jump more miss with the mindcf of three
0:03:41	hundred two
0:03:43	you're two
0:03:44	so next we added the w c n and the loop
0:03:50	just after the lda
0:03:51	so he had for the svm would not get any improvement
0:03:55	but for
0:03:57	for the lda that would explain next slide we will got the w c and
0:04:02	was happily but
0:04:03	so here's
0:04:05	so he is a bit the a so we use our scalability implementation of the
0:04:08	standard lda
0:04:09	and does so the scores are the likelihood ratio between the average i-vectors of the
0:04:14	target speaker and the test i-vector not as he that the i-vector the average i-vectors
0:04:19	not normalized in this case which is not the case for the svm
0:04:24	so here also again we can get additional improvements with the mindcf of zero point
0:04:28	the two hundred it and i two
0:04:32	afterward we tried the some
0:04:35	we tried some score normalisation ideas
0:04:38	actually that i tried that you know i tried the
0:04:42	s-norm and others and one that was the working the best is a small
0:04:47	but i will also come back to the slated
0:04:50	as so actually a small usually what was used only at the recognition level but
0:04:54	i also applied as a clustering so he'll when we apply that's clustering we can
0:04:59	we can get additional improvement to the even dcf of zero a zero point the
0:05:03	two hundred eighty six
0:05:06	then i applied this if one at the after a lda scoring and you can
0:05:13	get also another jumping performance of the mindcf of zero point two how that the
0:05:16	fifty and eight and this was a system that was submitted as a dateline
0:05:21	at the design that line of the evaluation
0:05:25	afterward i thought also i that idea which replace this cosine create a score a
0:05:30	clustering by svm clustering which is also done in iraq and a manner
0:05:36	and also we can get them into several as and
0:05:39	additional improvements to the mindcf of the
0:05:42	zero point two hundred the forty seven which is very close to the best performing
0:05:45	system
0:05:46	so we now system this is more or less than i hit of the
0:05:49	just the pushing of our system
0:05:51	we don't have usually don't have quality measures
0:05:53	function
0:05:56	so that's it after afterward we tried the so i was trained with the clustering
0:06:02	so for the clustering
0:06:05	okay clustering was already study in the charts are four i-vectors in either support unsupervised
0:06:12	the manner or supervised manner for example the work from mit on cosine bayes k-means
0:06:18	clustering in which the number of clusters is known a priori and which because they
0:06:22	would what you want composition conversational a telephone speech
0:06:26	and then the improve the system by using good basic spectral clustering i don't with
0:06:30	a simple heuristic that the that in that computing the number of cluster automatically
0:06:36	other words from cream what using the cosine based the mean shift clustering
0:06:41	so wouldn't post methods all if i'm not among all use cosine does the scoring
0:06:47	other method used to provide the clustering like the one from you where the used
0:06:52	integer linear programming
0:06:55	but their method there a distance metric i think a small amount of this
0:06:59	requires labeled training data
0:07:01	in order to compute the within class at companies matrix
0:07:04	other works from the project at all when using the p at a
0:07:09	based clustering but of course this vad a needs labeled the external unlabeled data to
0:07:14	remote two
0:07:15	to compute the lda model and then of to do the similar to compute of
0:07:21	similarity measure and the iraqi could and do the iraqi plastic
0:07:27	so actually we tried different kind of clustering i'm not gonna going to ten and
0:07:31	all of them one of those was the ward clustering
0:07:34	and actually so it is also known also provides you don't clustering with the goal
0:07:38	is to optimize an overall objective functions by function by minimizing the within class scatter
0:07:45	this clustering is very fast
0:07:48	since its use lance williams algorithm
0:07:50	in a recursive manner
0:07:52	like in a recursive manner
0:07:56	and the actually the problem of this algorithm
0:08:00	is that it needs euclidean distance to be to be to be good
0:08:05	and the problem
0:08:06	it was shown in this work that the cost the euclidean this is not as
0:08:10	good as the cosine distance
0:08:12	what the as a cluster that we tried is what i quit the cosine ple
0:08:16	clustering so it's two-step clustering
0:08:18	what the first one is based on cosine
0:08:22	cosine measure
0:08:23	so
0:08:24	actually after each iteration the similarity measure is updated by the computing the cosine measure
0:08:30	between average i-vector of the resulting clusters
0:08:32	and the here the we decide to stop early in the clustering process in order
0:08:37	to ensure high purity clusters
0:08:41	so once we have this a first set of cluster because we can would step
0:08:45	a second us a step of clusters is the
0:08:48	s dataset is that is second step of clustering which debate on the lda
0:08:53	and actually we did it so somehow differently from others so actually we
0:08:58	we after each iteration we could train the p lda model and compute i again
0:09:06	the this is a bit i can similar to make a matrix
0:09:10	and the but since this is hot somehow posterior doing it we would we get
0:09:14	every five hundred
0:09:17	merged
0:09:19	so i'm gonna show this
0:09:22	this figure that the show them as the evaluation of a mindcf in terms of
0:09:26	the clustering process
0:09:29	on the progress set using as back and the bit happier days a model scoring
0:09:35	so as we see boasts
0:09:38	what clustering which is in blue and cosine classical sample at clustering which is in
0:09:42	that we can get better performance so then
0:09:46	baseline system and also we can see that consecutive clustering is much better than the
0:09:50	ward clustering
0:09:52	and the best the heat in this experiment the best the results were obtained was
0:09:55	a number of clusters of sixteen fell
0:10:01	let me now look a bit of the score normalisation
0:10:04	so as i say the we try to think of kind of normalization one of
0:10:07	the most the successful one was introduced by professor can but and he's as soon
0:10:13	then i think energy models on the paper
0:10:16	so this actually works quite nice in was unlabeled code set which is the case
0:10:22	in our that's not you
0:10:24	so as a set i use it for both a recognition and clustering so few
0:10:29	for recognition
0:10:30	the core set
0:10:31	that i used was all the development set
0:10:33	so the thirty six on the
0:10:37	i-vectors and what i took that the top-k neighbours
0:10:41	neighbours the i-vector to the propose but target the speech i-vector and the test i-vector
0:10:48	so use the formalize you see it's a symmetric form a lot
0:10:52	so we have mu and sigma involve this formal or more you the mean you
0:10:57	kate by for instance just means that
0:10:59	we take the top the one thousand five hundred the scores
0:11:05	that are scores that of the highest for
0:11:08	target speaker for the target speaker and then we do this and c and the
0:11:12	same for some there's the duration and that's one
0:11:15	so we have more or less the same formula that was used for the
0:11:19	for clustering
0:11:20	and he it but you of course it's between
0:11:23	two plus two pair of a pair of clusters
0:11:26	and the cohort set in this case is actually all the
0:11:30	what the average i-vectors that what that are not concern in this and this measures
0:11:36	so please or dialect or the clusters
0:11:38	but not see wanted one this one
0:11:41	so
0:11:44	that's that i'm gonna
0:11:46	conclude so
0:11:47	actually in this but this evaluation was very helpful for us we learn a lot
0:11:51	of things and the
0:11:53	and it was i mean and also the by special successful
0:11:57	so
0:11:59	and also we don't that clustering is
0:12:02	what important
0:12:03	and also the adaptive a symmetric normalization
0:12:06	this is that's can be reproduced with the with our open-source libraries that the
0:12:11	that you can see this link and we also you can
0:12:16	use you know what it and icassp paper
0:12:20	as future work and its you start working nist on it is
0:12:24	and how to automatically
0:12:26	addicted mind that of the stopping criteria criterion the clustering process and actually we have
0:12:30	some ideas
0:12:31	but i hope we can lead such a shared with you guys
0:12:34	so like the variation of the number of the mindcf on the development set and
0:12:38	the variation of the number of clusters of nothing written a clusters
0:12:42	and also possible use of spectral clustering
0:12:46	and so one a good idea for next the maybe for next evaluation that could
0:12:51	be considered
0:12:52	because it's because of its potential application is the somewhat supervised the clustering
0:12:58	so actually here there's many techniques emotionally that were that are order to use like
0:13:02	co-training and others
0:13:04	thank you for
0:13:27	congratulations that was very good system without fusion getting these results is amazing i have
0:13:34	the slight impression that you make the distinction between a supervised and unsupervised if you
0:13:40	can go back pieces like
0:13:42	i could easy to go back then slide
0:13:47	well i think this distinction is a little bit arbitrary a good as the unsupervised
0:13:53	we use the tree with that muhammad since i we used we try to use
0:13:59	labels and of course to what's better in the best results we demonstrated was it
0:14:04	was of course
0:14:05	it's always good a good idea is like some labels if you have them and
0:14:08	my impression is that the only way to get a fully unsupervised clustering without knowing
0:14:14	the number of classes is a more like model is bayesian method although in the
0:14:20	main see if there are some tricks in they if you check the original paper
0:14:24	of common each you you'll see that there are some tricks yet you can do
0:14:28	in that are successful in a much processing so you can somehow estimate the number
0:14:33	of classes but i think that's also the guys from
0:14:40	from liam that you have the must supervised
0:14:44	they use it also with the stander prewhitening without even
0:14:49	i getting about the labels and this and the system works fine as well so
0:14:54	it's a little bit are better for me this these distinctions not
0:14:58	it i think and my sense the
0:15:01	supplies an unsupervised adjust the
0:15:04	i in the sense of labeled around the unlabeled training data to
0:15:08	and actually i think
0:15:21	i just have a question of outdoor your svm you said you use this single
0:15:25	positive examples from the averaged i-vector instead of
0:15:30	five was that the examples of you try both
0:15:33	i that the
0:15:35	so you see a number of summation i tried many and actually
0:15:38	as so this one this one was what you the best
0:15:42	and i forgot to mention that it's was in and by used the
0:15:46	it's not you will the weights would like zero point one for positive ends you
0:15:50	want mine for negative
0:15:52	so that's but i think it's not it's not
0:15:54	well we gain bit by doing this
0:15:57	it's more or less the same if use the
0:16:01	but i mean
0:16:03	by the
0:16:05	by nist in a t
0:16:07	but as an online
0:16:17	the new could just the
0:16:19	say what they wanted to say about is em so i just have a comment
0:16:27	i never had the or progress
0:16:30	slide like you a the one you sure in
0:16:33	you of the third slide so
0:16:36	when you're developing the system you had the only progress which is
0:16:41	wonderful a really wonderful situation for to be more
0:16:46	almost it's very interesting for us to know also or you negative trials what you
0:16:52	tried and what was not efficient
0:16:56	during the development of who systems it's a somewhat the but it's interesting for me
0:17:01	that's true and
0:17:02	well if a file if i want to talk what about the things that did
0:17:05	not work i think it takes
0:17:07	but whatever that's
0:17:20	so you show some
0:17:24	different approaches for clustering but like you is just a few of the system
0:17:33	when distance slightly and the stuff to get a the backend was only the lda
0:17:38	the system
0:17:39	you
0:17:41	i combination different back end and i try
0:17:47	it was also
0:17:49	a different from the others i
0:17:52	put me it was not
0:17:53	maybe
0:17:54	some small gain something forensic
0:17:57	i guess regression and you
0:18:02	use measure
0:18:10	i think the that at the adaptive score normalization was doing the work of what
0:18:15	the
0:18:17	a quality measure was do we get for others i think it was also that
0:18:23	you know find
0:18:24	what the

Hierarchical speaker clustering methods for the NIST i-vector Challenge

Nist I-Vector Special Session

Elie Khoury, Laurent El Shafey, Marc Ferras and Sebastien Marcel