Speech Transcript - STC Speaker Recognition System for the NIST i-Vector Challenge

0:00:16	i am certainly not myself and that would like to
0:00:21	tell you
0:00:22	about our
0:00:24	system for the nist i-vector challenge
0:00:29	so
0:00:30	the old land of my topic is false
0:00:33	first
0:00:36	i would like to
0:00:37	show your overall system description
0:00:41	is then i will be the i will describe i clustering program and a
0:00:47	next
0:00:48	i can stick one went we will present so
0:00:53	our subsystems
0:00:55	like i-vector p l d subsystem be vector
0:00:59	r b m or dbn i p l d subsystem
0:01:04	and the last one i-vector lda svm subsystems
0:01:10	next so i would talk about
0:01:12	mark while the matter function to incorporate
0:01:18	test duration information
0:01:20	in scoring
0:01:21	and the
0:01:24	next so
0:01:25	subsystem fusion really present that and finally i will
0:01:32	the present so our results and so i will make conclusions
0:01:40	let's min
0:01:42	show you overall system description
0:01:45	yes you can see we
0:01:49	exploring different systems
0:01:53	subsystems
0:01:54	idea to build the that's a standard one and
0:01:58	state-of-the-art systems the speaker recognition task
0:02:04	the same no and
0:02:07	some noble systems also
0:02:11	and was aware used
0:02:13	aside just our bn or d b and b vectors
0:02:17	subsystems
0:02:19	which is based on a p l d's tandem be of the model
0:02:25	and the last one is
0:02:27	and
0:02:28	well known lda svm subsystem based on i-vectors
0:02:37	we made a fusion or four
0:02:39	our different combinations or for our systems
0:02:44	and also we take we took into account so quality measure function and so we
0:02:52	incorporated test duration information
0:02:56	two
0:02:57	it's a good scoring results
0:03:05	so
0:03:06	our system was developed by different although simultaneously
0:03:11	and that the let us to
0:03:14	different clustering algorithms
0:03:17	to the different subsystems
0:03:20	as you can see for
0:03:23	the lda are be
0:03:25	the l b
0:03:26	subsystem be used
0:03:29	clustering algorithm one
0:03:32	and for the
0:03:33	lda svm subsystem we
0:03:37	have developed
0:03:38	its own clustering
0:03:41	algorithm which name is order and two
0:03:48	so few words about the clustering problem
0:03:51	with the which we
0:03:53	we're
0:03:54	do their thing
0:03:56	so
0:03:59	first so we try to use sound a standard
0:04:02	techniques for clustering such as
0:04:04	kind means and bottoms
0:04:06	but we didn't succeed with
0:04:10	those techniques
0:04:12	and the
0:04:14	there are two empirical established back from the speaker recognition
0:04:18	which are can help us
0:04:21	first of them is that the cosine metric is a kind meaning comparison metric and
0:04:26	on vector space and the second so you the
0:04:29	that the model a raging normalized a vector is
0:04:34	consider the most efficient model the session
0:04:37	model
0:04:38	so
0:04:39	we decided to use for initial clustering step only for initial clustering step
0:04:45	cosine distance
0:04:49	next we try to used to build a big would be very clustering strategy
0:04:55	after there is of course
0:04:58	cosine initial clustering step
0:05:01	it's makes sense to use a more efficient bill dimitri
0:05:04	which explicitly takes into account
0:05:07	between speaker or within speaker variability
0:05:11	so you can see the
0:05:13	this scheme all the
0:05:15	you'll do we clustering on this line
0:05:18	but we
0:05:20	manage
0:05:22	with only one iteration
0:05:24	we obtain good results are on the after the first iteration of the p l
0:05:29	d requires three
0:05:31	so we did
0:05:33	cosine into the station then the lda training and
0:05:37	building a tree clustering
0:05:41	we a deed
0:05:43	sites you know four bars
0:05:45	using a bus
0:05:47	algorithm one em algorithm two
0:05:52	no i should mention about
0:05:56	and b lda model because i will need
0:05:59	some
0:06:00	parameter names on the next slides
0:06:03	so we used on our model
0:06:08	and the number or for eigenvoice matrix a eigenvoice voices source and the one and
0:06:15	the number of eigen channels was and two
0:06:22	well
0:06:23	first
0:06:23	clustering algorithm consist of two stage
0:06:27	states
0:06:28	and so
0:06:29	but you're stage is
0:06:31	and every stick also watch
0:06:33	for the clusters
0:06:35	it is
0:06:37	like i mean shift
0:06:38	clustering algorithm
0:06:41	so we step by step find
0:06:43	the clusters
0:06:46	using mean shift
0:06:48	algorithm
0:06:51	and the second stage we try to compensates the hero all
0:06:57	the weighting for one speaker i-vectors to diff
0:07:02	one different in different clusters
0:07:06	so we used
0:07:07	a simple bottom-up stage of the
0:07:10	agglomerative hierarchical clustering
0:07:13	and so
0:07:14	use a simple repeat until up
0:07:17	i'll
0:07:20	they also you can see the reference
0:07:22	to the mean shift clustering
0:07:24	our viewers told us about
0:07:27	that our algorithm is very similar to the
0:07:32	two
0:07:33	that's it is described
0:07:35	in this or
0:07:39	our seconds algorithm is just a sound or standard
0:07:45	agglomerative four
0:07:47	bottom-up stage of h t algorithm and it is else a used i it is
0:07:54	also uses a course
0:07:57	cosine or plp matrix
0:08:00	and so
0:08:01	the threshold tower three is involved
0:08:04	two
0:08:06	for stopping criterion
0:08:10	the next slide i
0:08:15	show you
0:08:17	i will show you
0:08:19	the same with some parameters
0:08:21	and it's values
0:08:23	for initial post clustering we used to
0:08:30	such condition such conditions
0:08:33	that our threshold from
0:08:35	first and second stage
0:08:37	or was equal
0:08:39	and so
0:08:41	were you go and the equal zero point twenty nine
0:08:46	we used to sixty a
0:08:48	sixteen the random clustering integerization
0:08:52	and also we
0:08:54	use the rules that no liz and two and no more than
0:08:58	fifteen fifty vectors
0:09:01	or could be
0:09:03	in
0:09:04	a cluster one cluster
0:09:06	because
0:09:08	l so it should be mentioned that the p lda clustering
0:09:14	was done using simplified the lda model
0:09:18	so we i used
0:09:20	the three hundred eigenvoices
0:09:23	and the used full covariance noise model
0:09:27	for such a case
0:09:29	the threshold tall one was equal negative zero point two
0:09:34	and shower
0:09:35	two was
0:09:38	zero point twenty two
0:09:40	nine
0:09:42	and for a clustering who we will use the rules a normal it's and three
0:09:47	and no more than
0:09:48	fifty i-vectors
0:09:52	jolt
0:09:53	would be chosen
0:09:56	for algorithm two
0:09:58	would be used to the value
0:10:00	that was three
0:10:01	which was people zero point forty three and we also used simplified really model but
0:10:08	the different is that we used only
0:10:10	the diagonal covariance noise maddox
0:10:14	and the
0:10:15	there was another rule
0:10:18	no list three and no more than
0:10:20	so directors in clusters
0:10:26	well
0:10:27	for as their bodies and false or our experiments
0:10:31	we use we used another plp model
0:10:36	which
0:10:38	two into cannot you count channel factors
0:10:41	and to be used only diagonal covariance matrix
0:10:45	so in our case
0:10:48	and one was required to achieve d and two was
0:10:54	fifty five
0:10:56	model training or to build the i-vector purity system
0:11:03	have to be made using curve the results of for the algorithm one clustering
0:11:10	for the initialisation all their eigenvoice maddox we may have used you see
0:11:16	and the
0:11:19	it to have been mentioned that only one ml duration you maximum likelihood duration is
0:11:25	need
0:11:26	you we will eight
0:11:31	next iteration you'd so we'll that best to some degradation
0:11:37	a few words about a b m p l d system
0:11:41	and we can use it's to
0:11:44	extract
0:11:47	you be vectors from our i-vector
0:11:50	i-vectors
0:11:51	so it is not so strictly speaking it is not
0:11:55	and extractor but it is and non-linear project of role i-vector space to be i-vector
0:12:01	space which incorporate the not information or to the
0:12:06	speaker verification task
0:12:09	so we now simply used
0:12:12	probably in training for their
0:12:14	classification task
0:12:17	two
0:12:18	obtain german
0:12:19	distribute distribution all the i-vectors and its
0:12:24	the labels
0:12:28	and also we try to use so
0:12:30	additional hidden line
0:12:33	with
0:12:34	unsupervised training
0:12:38	and the in this case the number or for a new rounds or for first
0:12:44	wire was two thousand and the number all
0:12:47	neurons of softmax lie was five hundred
0:12:52	just that's in the previous one
0:12:54	where are
0:12:56	each
0:12:58	was equal
0:13:01	five hundred
0:13:04	so what is to be reactive
0:13:08	we used posterior or posteriors of the softmax layer to obtain our be vectors by
0:13:14	using
0:13:15	p c and the
0:13:19	we see projection all the local posteriors
0:13:23	in the low dimensional space
0:13:25	so in our case
0:13:28	the number was
0:13:30	and see it was equal to
0:13:33	number all near on solve who he don't lie and
0:13:38	what equal five
0:13:41	but for that be vector p l b vector space be used
0:13:47	another be lda model which is different from the i-vector space
0:13:53	we use the number of for each invoice four hundred and the in such a
0:13:59	case to be used a simplified be of v mobile
0:14:05	so
0:14:07	lda svm as the have been mentioned
0:14:11	before used to
0:14:13	rusting algorithm to and tusks score normalization procedure yes it's normalization
0:14:21	few worst about well to measure function
0:14:23	we it is well-known that the a threshold of the mean decision cost
0:14:31	function depends on
0:14:34	test
0:14:35	and roll
0:14:37	segment duration
0:14:39	and to take intake for so i in the nist i-vector challenge of a deal
0:14:44	with we don't with
0:14:46	multi session and role model
0:14:49	and the
0:14:52	every duration also and role model is much better a much larger than the duration
0:14:59	of the test models
0:15:00	so we ignored the dependence
0:15:03	one there
0:15:05	and roll durations
0:15:07	and so we
0:15:09	focused
0:15:10	on the explore investigation all the dependence on the test
0:15:15	duration
0:15:17	so we did it using power
0:15:20	clustering results
0:15:21	we
0:15:23	prepare
0:15:24	some protocols
0:15:26	five session
0:15:27	and roll protocols and to be obtained and several points
0:15:33	and the also obtained linear dependence
0:15:37	well the threshold
0:15:40	front
0:15:41	locally from both
0:15:43	this duration
0:15:44	but
0:15:48	it should be mentioned that
0:15:51	though who are very from function no could be replaced by the
0:15:56	power function for example
0:15:59	the
0:16:00	square root
0:16:02	the because of similar bic a or
0:16:06	those function
0:16:09	for of system fusion we used a simple
0:16:14	linear combination weighted sum
0:16:17	well the scores but to be also
0:16:21	we need to some sigma normalising a fusion
0:16:26	for c lda svm subs system
0:16:30	it equals one but for a other subsystems it it's
0:16:38	before
0:16:41	so to results
0:16:43	first
0:16:45	i will show you
0:16:47	our results
0:16:49	with incorporating hopeful to duration information so they can see that using
0:16:55	quite a measure of for function let us a two
0:17:00	significantly to reduce
0:17:04	minimum decision cost function
0:17:07	and i guess
0:17:08	requires the reduction
0:17:11	for me minimum decision cost function by ten percent
0:17:16	for lda svm subsystem but for final fusion with equal weights
0:17:25	it's also achieve achieves good performance break seven thousand
0:17:30	relative
0:17:35	no about the pure sound or for i-vector and be vector
0:17:40	space purity models
0:17:42	and scores of this model
0:17:45	so
0:17:47	it's
0:17:49	we a obtain and
0:17:51	we obtain so
0:17:54	and reduction of the mean decision cost function this is you to the fact that
0:17:59	the
0:18:01	r b m or dbn presents non-linear
0:18:06	transform want the i-vector space it's a it's a little with us to make that
0:18:12	few room
0:18:13	such systems
0:18:19	no for
0:18:21	lda and r b m field is subsystems pure and b
0:18:25	at your good results
0:18:29	but the weights aurora on equal
0:18:31	different that we are there have optimize it by submissions
0:18:36	and v the habit you
0:18:39	zero point the two
0:18:41	four and one
0:18:45	and the to the our best results
0:18:49	we just consists of four
0:18:51	three subsystems
0:18:53	of the svm subsystems are be mpo this subsystem and ubm the only subsystem
0:18:59	or
0:19:00	in such a case the dbn plp
0:19:04	you it gave us a little bit more information
0:19:07	for the verification and we managed to achieve
0:19:13	zero point two
0:19:15	three nine results
0:19:17	results
0:19:18	which is the best one
0:19:21	and took
0:19:22	conclusion
0:19:23	we have presented so our system which consist of
0:19:27	p obviously it'll d and their bm systems
0:19:32	we present its agglomerative clustering algorithms
0:19:38	they also combination of the lda and l die it'll d is frames systems
0:19:44	use
0:19:45	different clustering algorithm
0:19:47	this resulted in effect if you're one
0:19:50	and a nonlinear transformation of
0:19:53	i-vectors in be vector space
0:19:55	it also
0:19:57	leads to successful fusion
0:20:02	classical i-vector systems
0:20:06	so that's all
0:20:32	i have also congratulations a i just wanna one ask you the use of the
0:20:38	mean six outweighs more version of mincing start with
0:20:43	did you compare its for example with that standard right of clustering
0:20:48	to see how much gain from using this algorithm
0:20:53	yes we did it and the
0:20:56	a you can see that to be used the algorithm to and we try to
0:21:00	use a great and two clustering for training the p l d model
0:21:06	and the algorithm to is just an bottom-up stage hands says honour one and the
0:21:13	it's let us to
0:21:16	some degradation the mean shift the was
0:21:20	better
0:21:21	for this task
0:21:23	specially for p l d train
0:21:26	and

STC Speaker Recognition System for the NIST i-Vector Challenge

Nist I-Vector Special Session

Sergey Novoselov, Timur Pekhovsky and Konstantin Simonchik