Speech Transcript - On autoencoders in the i-vector space for speaker recognition

0:00:15	i speech
0:00:18	that's going to present our
0:00:21	i files a odin counters i
0:00:25	i-vector space for speaker recognition
0:00:29	well and
0:00:32	i
0:00:34	presentation
0:00:36	that let me start from the
0:00:38	motivation or activation
0:00:41	cool for fireworks
0:00:44	down i would like to the and details are the only thing
0:00:51	a particular i will focus on
0:00:54	i've and
0:01:00	and the
0:01:02	few words
0:01:05	will be made out that they can't and scoring
0:01:08	well the next section of the dedicated to
0:01:13	and improve denoising thing or is this so i mean you're probability
0:01:19	we tried to apply
0:01:21	we tried to apply
0:01:23	this technique
0:01:25	and a deep
0:01:27	our conjecture will be considered in this section
0:01:31	next denoting comforting for the system in the domain mismatch
0:01:36	scenario will prevented
0:01:38	and the finally i will conclude
0:01:41	my presentation
0:01:44	okay let me start for all
0:01:46	our motivation and goals last year published our work about implementation of it you know
0:01:53	it engulfed encoder
0:01:55	for the speaker verification task
0:01:58	and the this system
0:02:01	based on
0:02:04	t aec still showed
0:02:07	some improvements
0:02:08	compared to the commonly used baseline system i mean ple on the raw i-vectors
0:02:15	well and this motivated us to
0:02:18	two
0:02:19	for the investigation to detailed investigation
0:02:24	and the
0:02:26	and i'll go also used to study the proposed to solve and i in the
0:02:31	i-vector space
0:02:33	to analyse different straight edges all units as a nation and training probably big back
0:02:38	and parameters
0:02:40	to investigate about and to explored a different deep architecture
0:02:46	we
0:02:47	we offer
0:02:48	and to investigate
0:02:50	the a basis to increase or domain mismatch conditions
0:02:57	well to the
0:03:01	the dataset and experimental setup we used in our work
0:03:05	as you can see for the
0:03:08	training data as a training data we used a telephone channel recording from the nist
0:03:13	is the re
0:03:15	corpora for evaluation we used and used
0:03:20	ten sre protocol condition five extended
0:03:23	and to our results presented in terms of four
0:03:28	equal error rate and minimum detection cost function
0:03:33	and now to our front end tent
0:03:37	i-vector extractor
0:03:40	as you can see we used to
0:03:42	uhuh
0:03:44	mfccs and the first and second to do it is just the county where from
0:03:50	well are what structural was based on
0:03:53	the nn posteriors
0:03:55	with the eleven frames why thing
0:03:58	we used
0:03:59	two thousand and that's a silence at one hundred three phone states with the twenty
0:04:05	non speech state
0:04:07	and
0:04:08	instead of
0:04:10	using
0:04:11	hardwired decision
0:04:12	we try to use soft one solution using the nn outputs
0:04:18	well you can see this formula i
0:04:21	we
0:04:23	try to apply
0:04:25	cepstral means you mean and variance normalization
0:04:28	in this way in the statistics space
0:04:32	well and the you can see that all e
0:04:35	triphone states corresponding to the
0:04:37	speech
0:04:38	states are used to calculate
0:04:41	a sufficient statistics
0:04:43	finally a four hundred dimensional i-vectors were instructed for
0:04:50	our first experiments
0:04:54	well
0:04:56	few works about the det system and the
0:05:00	the a training procedure
0:05:03	to their own devising transform we're
0:05:06	use
0:05:07	do noise are pre-training generative pre-training speech
0:05:13	with the contrastive divergence algorithm
0:05:16	well
0:05:17	and
0:05:19	tool
0:05:20	to train our
0:05:22	denoising transform we
0:05:26	we used the
0:05:29	speaker session dependent i-vectors and the box
0:05:34	the mean four
0:05:36	the main for i of all i-vectors of the same speaker
0:05:41	i mean i s
0:05:44	well and we modeled
0:05:47	joint distribution of
0:05:49	this
0:05:51	i-vectors
0:05:53	and then after training but we unfold they are and
0:05:58	a finds you and you two
0:06:01	two
0:06:02	to obtain a
0:06:04	a denoising out in order
0:06:10	well
0:06:11	on the next slide i have a back to prevent
0:06:15	our system
0:06:17	under consideration
0:06:19	well as you can see we used
0:06:22	convention the lda based system as our baseline
0:06:27	with whitening and length normalisation
0:06:30	a pre-processing
0:06:32	well
0:06:34	the next system is based on
0:06:36	are a out to import or also with a whitening and men normalisation
0:06:44	a pre-processing and the finally
0:06:48	are where
0:06:51	next system is a det based a
0:06:57	well it's just and
0:06:59	l two in order which is
0:07:01	find fuel from the army and this dashed looked at all means fine tuning procedure
0:07:10	well
0:07:10	and the
0:07:12	a ball the hero or about the parameter transmission or substitution
0:07:18	i will focus on that on the on my neck slides
0:07:22	it is very important
0:07:24	right it just turned out to be important in our system
0:07:29	well
0:07:31	for
0:07:33	we used two covariance model for scoring it's can be viewed as simple case of
0:07:37	the lda and the score can be a
0:07:42	expressed in terms of
0:07:45	between speaker and within speaker covariance matrices
0:07:50	well
0:07:52	few words about the parameter substitution
0:07:56	during our experiments
0:07:58	in our work we figure out that the
0:08:02	the best performing the best performance of the a base based the system
0:08:10	is performed so well we a substitute
0:08:14	why whitening and p lda back-end parameters from they are bm system
0:08:20	to the eight based system
0:08:24	denoting crafting for the basis
0:08:27	well it's empirical fun
0:08:29	but it's it is wearing important
0:08:33	for this system
0:08:35	let me show you our first results
0:08:38	well with just the system
0:08:41	on the nist as the retail
0:08:44	protocol and to
0:08:45	as you can see
0:08:47	the gain
0:08:49	we're observed again
0:08:52	over the baseline system when we applied our da a based system with parameter replacement
0:08:59	both four
0:09:01	commonly used in nist sre ten protocol and our second
0:09:06	corpus called rest rooms telecom test got stuck on the on the results
0:09:14	and
0:09:16	so
0:09:18	some information about the
0:09:20	a risk telecon corpus can perform and the by the slide
0:09:25	well
0:09:28	and
0:09:28	to the analysis of the det based system we decided to use cluster variability criteria
0:09:37	e g
0:09:39	it is also called for can not criteria
0:09:43	well it is based on
0:09:46	we since began between speaker covariance matrices
0:09:50	and if you're
0:09:53	take a look at this figure and you can see that there
0:10:00	odin quarter based projections have more stronger clustered variability
0:10:09	about unit is well and the in this case we didn't apply and normalization for
0:10:16	our bn and
0:10:17	d e bay super projections
0:10:22	well i mean about normalization i mean to know whitening
0:10:27	were applied to d r b m and v
0:10:31	well
0:10:33	additionally
0:10:34	are we decided to use cosine scoring
0:10:38	as an independent estimation
0:10:40	or to assess the
0:10:44	the properties of our projections
0:10:49	you can see from this result
0:10:51	that no weight in the that da based system achieves the
0:10:56	the good performance among the
0:10:59	all the system
0:11:01	by the way we try to use
0:11:04	and simple
0:11:05	out in order to
0:11:08	two
0:11:10	to try that it's in speaker recognition if you
0:11:13	but it shot out to be the
0:11:15	not so would is the e bay system
0:11:20	well
0:11:23	and now to the white in can length normalization
0:11:27	when we apply this parameters for the r b m and g u based projections
0:11:33	we obtain those results
0:11:37	and i
0:11:39	that we can see the
0:11:41	the lines are very similar
0:11:43	and that close to each other
0:11:46	in this situation a where we applied
0:11:50	it di da a based
0:11:53	whitening
0:11:54	one of the four
0:11:56	forty it based system
0:11:58	it's turned out to be
0:12:00	not so who
0:12:02	for the system
0:12:04	and the
0:12:05	now on the next slide
0:12:07	we applied parameter substitution so we decided to use the parameter whitening parameter from our
0:12:15	em system
0:12:17	and the
0:12:18	in this situation we achieve good performance of the system
0:12:24	yes you can see
0:12:26	one baseline
0:12:28	and the
0:12:30	to the figure
0:12:31	i
0:12:32	you also can see at the to
0:12:35	the discriminative properties
0:12:37	or was the in this case
0:12:41	is
0:12:41	a more stronger for the a basis projection
0:12:48	to summarize altogether i prepared
0:12:53	all table we we'll terrible with the all common result
0:12:58	and the among the
0:13:02	the system the a based system with a are very important the substitution i mean
0:13:07	whitening
0:13:09	at you the best performance
0:13:15	well
0:13:17	and no to the
0:13:19	p lda based scoring
0:13:22	well
0:13:23	in this table
0:13:24	you can see that our results we obtained a opted different experiments in different configuration
0:13:31	of our system
0:13:32	and again
0:13:33	at the last line
0:13:35	the table you can see that the
0:13:38	good improvement would be in
0:13:40	can be achieved by using
0:13:43	parameter substitution from there are bm system
0:13:46	but the question
0:13:48	why it's happens is still open for us and we didn't manage to until it's
0:13:54	question
0:13:57	well
0:13:59	no i will
0:14:01	we will discuss some improvements for the a based system
0:14:06	and first we decided to apply to apply
0:14:09	dropout regularisation
0:14:11	for both our em training
0:14:14	and the
0:14:16	for fine-tuning
0:14:18	well as you can see
0:14:21	dropped out helps
0:14:23	to improve the system
0:14:25	when we used the it's a in
0:14:27	the orange where
0:14:29	our em training stage
0:14:31	r be improved training
0:14:33	but unfortunately apple a plan to produce the stage of discriminative fine tuning wasn't couple
0:14:39	for us
0:14:42	well to the jeep our conjecture we try to use the two schemes
0:14:49	first
0:14:50	you can see the first one on the slide
0:14:53	it is cold stating audience
0:14:57	well
0:14:59	after training the first are
0:15:01	it's out what can be may be used as a as an input for the
0:15:05	next are
0:15:07	and then we try to find t one
0:15:09	each altogether you
0:15:13	jointly
0:15:14	well but it does not
0:15:17	asked to improve the system
0:15:21	about the second that scheme
0:15:23	which is named stating bias
0:15:28	manage to obtain good results
0:15:30	but in this scenario we need to
0:15:33	to you and or two
0:15:36	substitute whitening parameter again probably are bm system
0:15:42	some big generative pretrained system
0:15:45	and the we get a little bit improvement from that
0:15:52	and the
0:15:54	next question i would like to focus is
0:15:59	the domain mismatch tonight
0:16:01	we investigated our da a best system in
0:16:06	in the domain mismatch conditions
0:16:10	well we used domain adaptation challenge that a dataset
0:16:14	and setup
0:16:16	it's a back end we use cosine scoring
0:16:19	two covariance model record s
0:16:22	to as the lda and simplify the lda with
0:16:27	four hundred dimensional speaker subspace
0:16:29	referred to
0:16:30	as the only
0:16:33	it should be noted that in our experiments we absolutely ignore label so the in
0:16:38	the main beta we used
0:16:41	we use it
0:16:42	one way to estimate whitening and the
0:16:46	whitening parameters or the systems
0:16:49	well and not to the results
0:16:52	you can see
0:16:54	for the baseline
0:16:58	system when we use in domain data for training
0:17:03	we obtain both results for
0:17:05	cosine scoring and you can see that the in applying a to do when the
0:17:10	wind di da a based system
0:17:13	before was focus i in only a scoring
0:17:17	but so when we
0:17:22	used out-of-domain that the data to train our systems
0:17:25	or with a you can see the degradation
0:17:29	for both for cosine and you'll be scoring
0:17:34	and in the
0:17:37	find
0:17:38	this table
0:17:39	you can see it the improvement
0:17:41	when we used whitening parameters from
0:17:45	in the mean data
0:17:51	the same results but for the
0:17:53	a simplified field v scoring
0:17:56	well i just little bit
0:18:00	better
0:18:01	then you'll be
0:18:04	and i'll to conclude ones
0:18:06	we present to
0:18:09	the study of denoising grafting order
0:18:12	in there
0:18:13	i-vector space
0:18:14	we figured out that the i
0:18:17	but
0:18:20	i'm sort be performed on the t or tdoa based system is you two
0:18:24	you by employing can parameters directly from the rear are beyond i'll put
0:18:31	the question is still open why are beyond transform provide better bacon parameters for this
0:18:38	set
0:18:39	well dropped about helps to improve the results but when applied to do our em
0:18:46	training stage
0:18:47	and that helped when we implemented in fine tuning
0:18:54	different project share in the form of stated denoising crafting quarter provide a few further
0:19:01	improvements
0:19:03	well and all our findings
0:19:06	regarding speaker verification system in my conditions
0:19:10	called so true in
0:19:12	mismatched condition case
0:19:15	and
0:19:16	the last one it's and the you think whitening parameters for the target domain along
0:19:21	the
0:19:22	the a it train twenty out-of-domain set
0:19:25	else two
0:19:27	the weights avoid significant
0:19:29	performance gap
0:19:30	goes by domain mismatch
0:19:32	that's it
0:19:43	top questions
0:19:51	michael
0:19:57	in this late it's when d you show the and the stacked
0:20:01	in tennessee note and can then
0:20:03	digits right more than two layers
0:20:09	yes but in this
0:20:11	in this we need to inject whitening conflict summarisation between the wires it is the
0:20:21	this has five
0:20:24	five i want to with whitening and length normalization injection
0:20:31	i mean
0:20:32	and that when you when you use it to like us to
0:20:39	to denoising of the encoders
0:20:40	you improve the results
0:20:42	so that you use your tie the third one
0:20:47	what's
0:20:50	what do you know more than one at each other than the corrected where a
0:20:56	whole
0:20:56	i see
0:20:58	well we i
0:21:00	we decided to
0:21:02	two
0:21:05	through might not able to for the
0:21:09	goal deeper in this because of four we find out that this result is very
0:21:15	similar to the you know our first one based on only one
0:21:27	simmons
0:21:32	although we probably have discussed this issue about your question why copying the p lda
0:21:38	and the and the long length normalization variables from b r p m rather than
0:21:43	they
0:21:44	final say stage gives better performance
0:21:49	where it should be initial maybe of a over feeding you do the back propagation
0:21:55	but you're doing since you're using the same set
0:21:59	maybe therefore let's say via residual matrix that were using be lda becomes artificially small
0:22:07	in terms of strays let's say
0:22:10	so how to check maybe the traces of the two matrices
0:22:14	the one that you estimate from r b m and what i guesstimate after to
0:22:18	see maybe
0:22:20	the covariance matrices are sufficiently small
0:22:23	might be a result of overfitting
0:22:26	well this and now assumption and we try to check out chip it calculates after
0:22:34	as the meeting our paper our paper was submitted but we figure out
0:22:41	it was it does not the reason because of we try to
0:22:46	to split our datasets in two parts and the to use separate data to train
0:22:54	a lda based and so but they can parameters but
0:22:58	the results
0:22:59	schultz
0:23:01	shows that
0:23:02	and is not the rate
0:23:04	it is not a repeating
0:23:07	occured while we trained the system on the same data
0:23:12	well
0:23:15	al so try to
0:23:18	explain the situation by
0:23:21	using a house bill option assumption well i mean
0:23:27	after
0:23:29	det projection we can obtain
0:23:31	no more or less
0:23:34	goals and but less torsion the
0:23:38	and that can be the
0:23:40	this
0:23:41	in this case but
0:23:43	seems to us
0:23:46	but also it is not the answer
0:23:54	this time for another question jumps a
0:24:03	just to construe on the first step of your system but i think it will
0:24:09	to be spot you say that you are using twenty
0:24:13	non-speech states i don't quite amazed both this huge number could you say something about
0:24:19	that
0:24:20	you mean huge number of non-speech states but
0:24:25	we have
0:24:29	we use this
0:24:30	standard caldera see from our where speech recognition department
0:24:36	and they the
0:24:39	you fast
0:24:40	and a twice to use this configuration all these system and the we train
0:24:48	ours the d n and in this way
0:24:50	and the
0:24:51	well it's provide food
0:24:54	voice activity detection for our system
0:24:57	and we are also it's a
0:25:00	mentioned we also used to
0:25:03	this
0:25:04	capabilities to
0:25:06	a to a black soft one solution
0:25:10	also what decision in this statistic space
0:25:15	well i mean we
0:25:17	we have done
0:25:20	cepstral mean shift normalization in the statistics space
0:25:23	by excluding a non speech
0:25:26	well non speech is the problem our consideration
0:25:34	that's to the speaker again thank you

On autoencoders in the i-vector space for speaker recognition

Speaker & Language Recognition: Deep learning approaches

Timur Pekhovsky, Sergey Novoselov, Aleksei Sholohov, Oleg Kudashev