Speech Transcript - On the Use of PLDA i-vector Scoring for Clustering Short Segments

0:00:15	hi first i want to think
0:00:17	two or stay here because i thought that only me colours and the cameraman will
0:00:22	be here
0:00:26	the work was
0:00:27	than by
0:00:29	die unit and like me most about it i
0:00:33	and he should
0:00:36	present this war but
0:00:37	unfortunately about ten days ago he got married
0:00:46	so
0:00:48	you prefer to go to cost oracle or something else in order to come here
0:00:54	and present
0:00:55	the more also used a quiz me and i stuck with this work
0:01:01	and
0:01:02	so
0:01:03	you have to suffer me for about ten
0:01:09	after that
0:01:10	most of this workshop also on
0:01:13	the n n's and spatially
0:01:16	a very good managing skin all i felt that i also want to say some
0:01:22	think about it so
0:01:24	i briefly
0:01:25	take talk about it a little be then
0:01:28	i give a motivation
0:01:30	about the clustering problem
0:01:33	the basic mean shift algorithm and them but discussion we need
0:01:37	and then i present the
0:01:40	clustering system some experiments and summary
0:01:45	so about the intense
0:01:49	okay next
0:01:57	so
0:01:59	our problem as you we have
0:02:01	a texas station there are many
0:02:05	text across each lr
0:02:09	have
0:02:10	not one driver about the driver also
0:02:12	changed
0:02:14	and we have recording four
0:02:16	quite
0:02:17	they two days three days
0:02:20	that you speak at to talk so we exactly know where the segment
0:02:25	the and the start of the n
0:02:27	of each segment
0:02:29	and we collected
0:02:32	these recording devices
0:02:36	and
0:02:38	and the end of the day we want to segment and no which segments were
0:02:43	said by
0:02:44	one speaker now the speaker so
0:02:48	each month and talk
0:02:50	we don't wear an
0:02:51	where
0:02:53	can take that one speaker now
0:02:57	then the next time speaks after two hours three hours
0:03:01	maybe to model
0:03:02	from different car
0:03:04	and what we have
0:03:08	use a bag of segments
0:03:11	which are unlabeled
0:03:13	and we want you please
0:03:16	mostly these segments are very short on the average one how seconds two seconds
0:03:22	so
0:03:24	we want to
0:03:25	cluster
0:03:28	short segments
0:03:29	and usually the population use white be sort of speakers for two speakers
0:03:35	and so on
0:03:37	so the issues our problem
0:03:43	a given
0:03:44	mainly short segments we want
0:03:47	to segment them into on morgan use group which means that we want
0:03:53	have
0:03:54	it would
0:03:57	cluster purity it means that each cluster we would be occupied mostly by one speaker
0:04:04	only
0:04:05	but we also want to have
0:04:07	speaker purity
0:04:10	so
0:04:11	we don't want that the same speaker will be spread between ten clusters
0:04:19	the basic mean shift algorithm
0:04:21	is that we have but
0:04:23	many vectors we choose and vector or find by
0:04:29	and b and b so the
0:04:33	and go
0:04:34	set of clothes this
0:04:37	vectors
0:04:39	and take all the
0:04:40	vectors which are below some threshold
0:04:44	and then shift
0:04:47	than in particular the weighted mean of these
0:04:52	vectors
0:04:53	take the mean as reference points and the gain
0:04:57	looking for the
0:05:00	neighbours that below the threshold calculate the mean and you to converse some
0:05:06	point and
0:05:07	these what we can do we
0:05:09	each point we
0:05:11	each vector
0:05:13	demos talk about this algorithm many times
0:05:17	so for more details please refer to tables
0:05:21	and
0:05:23	lda the after we find the stable point of each wrecked or we are
0:05:31	a group all the vectors which are close one to each other
0:05:36	according to some threshold
0:05:38	and the number of broad we have this is the number of clusters
0:05:44	and the points which are in each group out there
0:05:48	one of the cluster but we know that a canadian distance is not very good
0:05:52	for the purpose of speaker clustering
0:05:57	so
0:06:00	here what we present with cosine distance now we use the be lda scoring instead
0:06:05	of cosine
0:06:08	so instead of looking for
0:06:12	closest vectors in the sense of forty the as distance we look for
0:06:19	based high score between b lda score
0:06:24	and the
0:06:26	calculate the
0:06:28	a new mean
0:06:29	when they've function g is the weight in the weight is basically the
0:06:34	and be lda score
0:06:37	and are the difference we made we didn't do not use a threshold
0:06:44	to look for
0:06:47	class vectors
0:06:50	below the threshold instead we was k and then we set
0:06:55	okay and take the k nearest
0:07:00	and vectors which have the highest the
0:07:03	bleu score
0:07:07	so
0:07:08	basically we
0:07:11	all these creations
0:07:14	the and
0:07:15	now this case
0:07:17	the to a shown is not fixed
0:07:20	like in the original algorithm but you the depends on the
0:07:26	largest distance of the case vector or
0:07:31	but we calculate the same are
0:07:35	algorithm in shift so we calculate the mean according to these k
0:07:39	nearest vectors shifted the mean and again continue the process
0:07:48	hearable the
0:07:50	well i-vectors but i-vector also because
0:07:54	i will explain that we
0:07:55	do the small modification of them
0:07:58	and applying mean shift algorithm according to the bleu score
0:08:03	and we have the results i just as we mentioned
0:08:06	and
0:08:08	because i we compared to the brother previous or
0:08:12	that in previous work
0:08:13	the threshold was fixed
0:08:15	we will scroll signs used and these stunts
0:08:19	and we use run don't mean shifted mean that we don't
0:08:23	i'll go across all the points but only randomly choose them
0:08:32	so before clustering we of course need to train ubm
0:08:38	total variability matrix but before using one
0:08:42	build the score we found that it is better to
0:08:46	do pca on the data and gab just another job pca
0:08:52	a no we reduce our r vector from for dimensions of the four hundred to
0:08:58	two hundred fifty
0:09:00	and we tried to compare it to is just i-vector of size two hundred fifty
0:09:04	this work better
0:09:07	we don't sure why
0:09:09	but is the fact it was better and we apply and next would indeed whitening
0:09:16	and apply a p lda score on these vectors
0:09:26	okay
0:09:27	so this was explained before
0:09:34	the experiment setup was that
0:09:37	we use nist two thousand eight two we got six
0:09:41	into a low dimensional segments
0:09:45	on average to enhance seconds
0:09:49	and
0:09:50	and we have average number all five segments per
0:09:54	speaker so to three
0:09:57	a no
0:09:59	we
0:10:00	calculate
0:10:02	the results
0:10:03	according to
0:10:05	average speaker purity average cluster purity the k parameter
0:10:10	and the other important parameter use how many class so we have at the end
0:10:15	comparing to the true number of clusters
0:10:22	so we're starting from the beginning
0:10:27	we start with the
0:10:30	here we go system of cosine distance with
0:10:35	a threshold fixed threshold this is the red line
0:10:39	and we see that
0:10:41	we have a wide be we can we
0:10:44	have to know exactly
0:10:46	what is the
0:10:47	based threshold
0:10:49	to make clustering when we use k-means instead
0:10:54	can use them but they're holding is that we see that we
0:10:58	have a plucked or so it
0:11:01	doesn't
0:11:03	make a difference if we choose sorting or seventeen so it's much more robust and
0:11:10	when we use k nearest neighbor hold
0:11:14	all the
0:11:16	results are for
0:11:19	i thought speakers
0:11:22	next we yea instead of using
0:11:26	random mean shoes
0:11:28	useful mean shift which are much more expensive computationally
0:11:34	but we see that we have some gain
0:11:36	still be is cosine distance
0:11:40	and then we switch from cosine distance to be lda score
0:11:45	and we have
0:11:47	and it to beat
0:11:48	more gain
0:11:51	i have to say that
0:11:53	both for the lda training
0:11:56	and for
0:11:58	and w c and training in the cosine system
0:12:01	we trained them on short segments well too long segment
0:12:08	shortly we will see why we did it on short segments
0:12:13	this one
0:12:15	when we train
0:12:18	be lda on long segments we have very better results
0:12:23	but on short segments
0:12:26	we improve the remote results dramatically
0:12:30	the total variability matrix trained on long segments only
0:12:35	we didn't use short segments
0:12:37	because it was very bit
0:12:41	this all results
0:12:43	deal now only we've sort of speakers
0:12:47	and this is some summary of all the results we see it's better to move
0:12:51	from
0:12:52	and i fixed threshold to get a nearest neighbor hope to go from randall mean
0:12:58	shift to full mean shift to move to be lda
0:13:04	and the hope for results
0:13:11	it's a totally
0:13:13	not require a problem is how many clusters
0:13:17	we have after clustering problem process
0:13:22	with when the compared to the actual number of clusters
0:13:26	and the red line of the drawing a units
0:13:31	of the
0:13:32	and a fixed threshold
0:13:36	and
0:13:37	if you are looking
0:13:39	and the result
0:13:41	it's not so nice we have
0:13:46	true forty six clusters speakers but it was estimated
0:13:51	s
0:13:52	about one hundred eighty clusters means that we have many small clusters they are very
0:13:58	pure but
0:14:01	a small and too much
0:14:05	but when we will scale and then
0:14:08	we can see that we have about a factor of two about six two clusters
0:14:13	so we have better cave better clustering performance with much less
0:14:19	clusters
0:14:23	these are the results
0:14:27	when we use the cosine distance with a fixed threshold
0:14:31	on different arbour off
0:14:34	speakers from three to one hundred eighty eight
0:14:39	we
0:14:39	when we will compare with the
0:14:42	proposed algorithm we will see that
0:14:44	in this case the
0:14:47	cluster purity is better
0:14:50	it it's understandable many small clusters and they're all pure of one segment to segment
0:14:57	but the k the overall and
0:15:00	results
0:15:03	in our case in the our algorithm is better
0:15:06	and the average number of clusters you can see that for three speakers that okay
0:15:11	but
0:15:12	let's able to one hundred eighty eight speakers it's
0:15:16	by a factor of ten almost we have much more
0:15:21	clusters that true number of speakers
0:15:25	when we go to
0:15:27	the be lda we skin
0:15:32	we have
0:15:33	better speaker purity
0:15:35	and
0:15:37	much less class see that the
0:15:41	by a factor of one how to for two
0:15:46	and these summarize
0:15:48	the results
0:15:50	for three and seven
0:15:53	speakers we have
0:15:54	a little bit to the better results by cosine with a fixed threshold but when
0:16:01	we go from
0:16:03	fifteen speakers and more
0:16:06	we prefer and they've been this score is k and the and nearest neighbor
0:16:13	we see both the results of k and for the number of clusters
0:16:22	and
0:16:24	okay we propose new system which al
0:16:28	but class and performance and
0:16:33	much less number of clusters we pay for these and it would be by a
0:16:39	computationally because we moved from a random a mean shift to one and she if
0:16:45	the
0:16:49	and that's all what they have to say two
0:16:58	we have a question
0:17:11	thank you so insecure remark that for sure to utterance clustering
0:17:18	results with a training but in the remainder removes the longer utterances well mm disappointing
0:17:26	of you my some other noises so also minimizes to bother explaining this to managing
0:17:33	the resulting protocol for improved composite ross is twice is also implement a mattress is
0:17:40	possible to enrol
0:17:43	with thing because if you'll train it on the long segments there would be that
0:17:49	big mismatch between the training condition and the testing that we train you on long
0:17:56	segments and calculate the
0:18:00	on i-vectors from short segments it would be something
0:18:06	not appropriate
0:18:07	but maybe with number two into the speaker or subspace are composed to suppose maybe
0:18:12	to be more correct reason longer
0:18:17	basically much more accurate but not for our problem so yes okay
0:18:24	most important is a new sound so yes or no i think that there should
0:18:27	be some trade off between the accuracy of the a score or training score and
0:18:35	to see the true problem
0:18:37	yes
0:18:38	two
0:18:42	extension of the proposed
0:18:46	right
0:18:52	or you
0:18:54	a can thank you for your presentation i can you please go back to that
0:18:59	results section very you showed that values of k and number of speakers
0:19:04	stopping
0:19:07	maybe it's okay let us know that
0:19:11	go for here and then you are increasing the number of speakers and the value
0:19:17	of k is x is an is fixed
0:19:20	and that the results are going down i mean like and you like and you
0:19:24	try with different values of k
0:19:27	for different of us to the at least k is the
0:19:34	square of the multiplication of s b and s p smell the k of the
0:19:38	k nearest neighbor i mean that that's and i
0:19:43	this can and j o k is the best or the result with the best
0:19:48	k
0:19:50	but as you see that
0:19:52	so before
0:19:53	the rose
0:19:55	no big difference
0:19:57	if you use fourteen or fifteen or seventeen for each number of speakers
0:20:04	for which number of speakers are fixed and we can use the these rifle use
0:20:08	the almost the same for
0:20:12	and i
0:20:14	any number of speakers with the we tested
0:20:17	it to reach a plateau and stays there
0:20:20	i assume that is we will the increase k two
0:20:24	of fifty or seventy two will decrease of the results would begin go down at
0:20:29	some point
0:20:30	but for reasonable and the
0:20:32	"'kay" size
0:20:34	you just almost the same results
0:20:45	what data did you used to train your p lda when you use a short
0:20:50	segment
0:20:54	the same data that we used for ubm and i don't remember you should go
0:20:58	to costa rica to ask die buster
0:21:03	i
0:21:04	it sounds anyway but it it's not from the this that let's say a real
0:21:10	the same development and set for training the ubm and
0:21:15	take part of it just started in short segments in the train the building right
0:21:21	but we need to the short segment you're taking multiple short segments per telephone
0:21:26	right
0:21:28	we take a couple of for phone call and make multiple segments out of it
0:21:32	yes but it sure randomly so it from different sessions for the same okay so
0:21:37	that so the cuda in use the same
0:21:40	short segments from the same phone call
0:21:42	a strange automatic could be that several of them will be but
0:21:48	we just randomly choose suppose of a really respect i just ask on it because
0:21:53	i agree the this jumping back a question that
0:21:57	what we've seen with things not for clustering so maybe a different thing but in
0:22:02	terms of the p lda parameters that
0:22:05	you do better with training those up with the longer ones even what it's doing
0:22:08	short duration test this is given for speaker echo so it may not be derived
0:22:13	that three's only resides asking the data you did a random selection so yes very
0:22:17	unlikely that it was concentrated from the same call
0:22:20	so that was my mean first and the
0:22:25	and the results are all also all the segments for the clustering were
0:22:32	on the test set were chosen randomly and we're and we're an experiment ten times
0:22:39	except of the
0:22:41	last one of one hundred eighty eight speakers because there are only one hundred eighty
0:22:47	eight speakers in the dataset so we can couldn't two randomly
0:22:55	ms
0:23:02	idea first of all one of the things that the like to in the original
0:23:07	it's a means if target
0:23:09	was it's probabilistic interpretation in the fact that the analysis start with i don't parametric
0:23:19	density estimation meaning that in each point you create a small either gaussian or a
0:23:24	triangular say a pdf
0:23:28	with triangular grand up with the kind of the threshold which is the uniform with
0:23:32	a gaussian grid up again with a notion because that's that there's the differentiation
0:23:38	and therefore dates
0:23:41	a rule is derived
0:23:44	by simple differentiation in order to find them all at which point where converts where
0:23:52	convergence
0:23:53	i'm wondering if you
0:23:55	choose a p lda let's say like so you don't
0:23:59	put either cosine distance or a standard i squared distance which is was initial
0:24:07	can you tell us because one question is not
0:24:12	whether these update rule
0:24:14	comes naturally
0:24:16	from the same mechanism buddies a new as explained to you i don't parametric
0:24:23	okay but we can get estimation
0:24:28	as you estimated answers no
0:24:31	we also isn't
0:24:32	one so it's more realistic what works a useful
0:24:41	the question
0:24:48	so next and speaking

On the Use of PLDA i-vector Scoring for Clustering Short Segments

Speaker Clustering and Diarization

Itay Salmun, Irit Opher, Itshak Lapidot