Speech Transcript - i-Vector Modeling with Deep Belief Networks for Multi-Session Speaker Recognition

0:00:15	my name is the only hobby a from part process research centre points of what's
0:00:20	and on a take the topic e is the i-vector more than in q we
0:00:25	deep belief networks
0:00:27	for multi session speaker recognition
0:00:32	you know the acoustic modeling a using deep belief networks have been shown to be
0:00:37	effective in speech recognition area and it's the getting popular not nowadays
0:00:43	but a very few items the using only r p m's restricted boltzmann machines or
0:00:49	generative ubms have been carried out in speaker recognition area
0:00:54	we have proposed in our period previous work is that the was published in i
0:01:00	can speak at some fourteen
0:01:01	we use the both generative and discriminative it dbn
0:01:07	on that work we use the only a single session target i-vectors as the inputs
0:01:12	the to the networks
0:01:15	in this paper we extend our previous work from a single decision to a more
0:01:20	decision test
0:01:22	that the we have used the then
0:01:24	i-vector challenge database in these experiments
0:01:28	and also we have modified our proposed impostor selection method that the
0:01:34	to be more accurate and more robust against the its parameters
0:01:41	first the ability to short a background about the deep belief networks and then i
0:01:46	will go
0:01:48	i will describe a all our dbn based system and then i will go or
0:01:54	more in details the in our proposed impostor selection method
0:01:58	and the i didn't show the experimental results that and at the and the conclusion
0:02:07	deep belief networks the are originally a problems
0:02:11	probabilistic generative models
0:02:14	that every two at some layers are treated as the restricted boltzmann machines
0:02:21	and the old ones are you to our bn will be the inputs to the
0:02:29	above all the m and is trained to label layer
0:02:33	however by adding top
0:02:37	label layer this you know generative dbn can be converted to a discriminative want by
0:02:42	doing the standard back propagation
0:02:48	in this is like the i have some information about the how they are bm
0:02:54	is trained and trained and
0:02:56	how it's the good fit for to be matched with the per training a neural
0:03:04	networks but i think i can escape is
0:03:09	it's and i and is better to focus on our method
0:03:17	less remind what's the problem
0:03:19	the problem is to model each target the speaker be a valuable i-vectors what we
0:03:26	have you are five i-vectors are part of i-vectors per each target speaker and a
0:03:32	large amount of background the i-vectors of the development set
0:03:37	our proposal is to use the deep belief networks for two main reasons
0:03:42	first is the two
0:03:46	face first is to take that want a job well unsupervised learning using the
0:03:52	i relevant background data at the development set
0:03:55	and to take that mine page of a supervised learning to train each target model
0:04:00	and discriminatively
0:04:04	this is the whole blacked out drama all our proposed method let's the two in
0:04:12	the widely in three main is that's
0:04:15	the first is that is balanced training
0:04:19	what what's the problem imbalanced training here in this case the we have a large
0:04:25	amount of background i make doors as a negative samples and if you amount of
0:04:31	a target data at the positive samples
0:04:33	as we are going to model each target speaker discriminative leaving it you get let's
0:04:41	and the training the network with such a on balanced training be the list the
0:04:48	overfitting
0:04:51	so the solutions we have proposed here to decrease the number of background i-vectors as
0:04:57	much as possible in their effective way
0:05:02	we don't is in tremendous that's the first
0:05:05	we select the only those background i-vectors that are more informative
0:05:14	and then clustering the selected on in post or by k-means algorithm and the using
0:05:21	cosine distance criteria
0:05:24	and then using the
0:05:28	the imposed and the cluster centroids as a negative samples
0:05:33	and then finally a we will distribute a the positive and negative samples and equality
0:05:39	in mind the mini batch it
0:05:47	the second step is the adaptation process that you have proposed in our previous work
0:05:54	i adaptation using all the background i-vectors we have be trained at a deep net
0:06:03	network
0:06:04	unsupervised think the without a label
0:06:07	and because the trained model universal deep belief network
0:06:12	and then each to target the speaker network speaker will be adapted from this a
0:06:19	universal dbn
0:06:21	but how adaptation the works
0:06:25	adaptation
0:06:26	be initialized and the networks the i instead of randomly and be initialized by the
0:06:33	ubm parameters
0:06:35	and then do they are unsupervised learning
0:06:40	on we the balanced data all
0:06:44	from this of one for only a few iterations
0:06:50	in our previous work we have shown that
0:06:53	the period and the pre-training in this case
0:06:56	works better than random initialization
0:06:58	and the proposed occupation works better then pre-training
0:07:05	the second is that this last is that is fine tuning that is actually a
0:07:10	back propagating is
0:07:13	the neural networks using the label later
0:07:17	but we have to change something here in comparison to estimate would be perverts the
0:07:25	do one the only one layer error by provided
0:07:29	propagation
0:07:30	for few iterations the before full back propagation is carried out
0:07:35	our experimental results in our last in our own previous works shown
0:07:41	as shown that is this works better because and the op the top
0:07:47	the label layer
0:07:50	by this is the something like a pre-training the top layer as well and it
0:07:54	works better that during the whole backprop right migration
0:07:59	without doing this
0:08:03	on the other hand be bic and bic and a d by our black there'd
0:08:09	role models is then be to two main phases that the first the phase is
0:08:15	target independent and the c can is target dependent
0:08:19	actually target independent using the whole background i-vectors we have we train a universal deep
0:08:26	belief networks
0:08:27	and it be compute the impostor centroids
0:08:32	that how this process is carried out only once for all the target speakers we
0:08:38	have
0:08:40	in the second that's
0:08:41	and you think
0:08:45	using the you db and impostor centroids
0:08:49	and the available target i-vectors we will train our networks the discriminative be
0:09:00	let's scroll more in details in the proposed impostor selection method
0:09:04	and this method is
0:09:07	it is similar to the
0:09:09	support vector or bayes the
0:09:13	approach that proposed by mitchell at clarion and the is it compose the but we
0:09:19	have used here the cosine distance criteria and the we have changes some other things
0:09:28	it composed of well four main steps the
0:09:31	as some of the we have the whole background i-vectors in wants to hang out
0:09:36	on another so that we have the client i-vectors
0:09:39	each collect direct or
0:09:42	that in this case is the average all five i-vectors berries client
0:09:47	be to compare our bit all background i-vectors we have
0:09:51	using cosine distance criteria
0:09:53	and the top and i killers this the background i-vectors to each client
0:09:59	will be kept in address that thought age in this
0:10:04	a steps
0:10:05	and maybe do the same for all the reliant i-vectors
0:10:10	until the car i-vectors the cocktail ends that we have
0:10:15	and the be compute the impostor frequencies in this that age and be normalized aim
0:10:22	at n is the and top i-vectors the in each other for each client and
0:10:29	the whole number of collect i-vectors
0:10:33	and beep is that the this normalisation
0:10:37	at the impostor frequency is more robust the against the threshold that we will define
0:10:45	on this the frequencies
0:10:48	then we set a threshold on this normalized impostor frequencies and those impostors have higher
0:10:55	frequency frequencies then this are sure will be selected that the most informative impostors
0:11:05	actually we have b
0:11:10	we have the impostor frequencies and for all the background i-vectors we will have one
0:11:15	frequencies will be defined iterations and those i-vectors the impostors that have higher impostor frequencies
0:11:23	that then defined threshold will be selected
0:11:27	this the threshold and the then and parameter will be defined experimentally
0:11:33	at the experiment on section
0:11:41	if the order or the impostor frequencies for the
0:11:46	impostors the we will see that the any post or the have the same frequency
0:11:53	a impostor frequencies
0:11:55	that the that's why be have
0:11:58	defined at a ritual the on the impostor frequencies not just the selecting the top
0:12:06	a fixed number of a simple so
0:12:12	in experimental station the dataset the that you have used is the
0:12:18	nist the two thousand fourteen a i-vector challenge the i-vector size that you know is
0:12:23	six hundred
0:12:25	post processing that you have like eight out on i-vectors on
0:12:29	all mean normalization the last whitening
0:12:33	one hidden layers is used in this extreme as and the hidden layer like a
0:12:37	layer size is four hundred
0:12:42	forty owning the
0:12:43	the two parameters for the impostor selection method that is
0:12:49	the threshold and the and parameter if we plot the per the minimum dcf
0:12:57	verses the this threshold for different and
0:13:01	we will see and he's a
0:13:03	a small
0:13:05	the results are not good i if and is the too high
0:13:11	biz the performance of the system want to be used a bell white changing the
0:13:17	original
0:13:18	and the best one is the choosing in according to our experiments is choosing
0:13:23	and equal to one hundred and it shows the
0:13:28	by setting that originals by this we will have a minimum m
0:13:34	value for minimal dcf by these utterance rolled and setting and equals to one hundred
0:13:44	in experiment all the results the be in this challenge we have we had one
0:13:50	baseline system that everyone knows what's the baseline
0:13:55	our proposed a dbn based is then be the target independent impostors that is good
0:14:01	lowball impostors for the same for all the
0:14:05	target speakers
0:14:06	if we
0:14:08	do this experiments we will have a this results
0:14:11	that the is the big difference between
0:14:15	the baseline system and our system
0:14:18	and if we add a the target dependent the
0:14:22	targets
0:14:22	to the target independent impostors that in this case is one hundred is and the
0:14:29	parameter and the at this pool is targeting depend the non-target depend then we will
0:14:34	have
0:14:36	better performance that is the
0:14:39	this
0:14:40	when you
0:14:41	but in this case a if we at the target dependent the complexity of the
0:14:47	system will be more than the first one because the in for each target the
0:14:54	for each a target speaker for just speaker we need to do the clustering separately
0:15:01	what in this case we just the compute the impostor centroids the ones for all
0:15:06	the speakers
0:15:11	if we do this that normal score normalisation on our baseline i have on or
0:15:17	dbn and basis them maybe without that normalization and the results in this
0:15:24	what if the ad that normalization using the all the whole impostor database we have
0:15:29	the development set we will have words results
0:15:33	if it's select the only ten top one thousand kilos this i-vectors impostors we would
0:15:40	have it be better what is it is the worse than a without using that
0:15:45	norm that normalization
0:15:47	but the
0:15:50	beach the but if
0:15:52	we use the same impostor selection method for that normalization v a v is the
0:16:01	and setting the parameter t and aiken again for this that normalization
0:16:07	we will see that we have a be in for right you be improvement here
0:16:15	and the
0:16:20	and the in comparison to the baseline system we will see that the we will
0:16:26	have
0:16:27	to in the three percent improvements
0:16:31	actually this twenty percent improvement is the in comparison with these results with these results
0:16:37	the that he's the all the results the improvement is more than this
0:16:44	but
0:16:46	in this experiment so the for impostor selection method you have used the client i-vectors
0:16:53	our experiment our new results experimental results have shown that if we don't use the
0:16:58	client i-vectors
0:17:00	i collect i-vectors the
0:17:03	and the just select the particular and the i-vectors collect i-vectors from only the development
0:17:10	set we will see that the
0:17:14	we will have almost the same results then this that are very similar that actually
0:17:20	a
0:17:23	for our system proposed system it doesn't matter that we used the client i-vectors in
0:17:28	or impostor selection method or select or jobs randomly choosing a the actual and i-vectors
0:17:36	from only the background i-vectors
0:17:44	and the main conclusions and
0:17:46	in this paper or b and b have the problem of the impostor selection method
0:17:51	for that we have shown that the helps to well outs is then to what
0:18:00	the
0:18:01	we'll have a good important for performance in multi session task
0:18:07	and that really been the out more i-vectors the well very sharp where each target
0:18:13	speaker helped the dbn system to capture more speaker and session variabilities in comparison to
0:18:21	the single session task
0:18:25	and also the final discriminative dbn per dbn based the approach showed a considerable performance
0:18:33	in comparison to the com conventional baseline system propose the wine is seen in this
0:18:41	challenge
0:18:42	thank you
0:18:51	we have time for question
0:18:58	thanks to talk alike extension of the background dataset selection that you on the
0:19:03	one question that comes to mono is when you doing a selection you looking at
0:19:08	all the clients that are going to be enrolled system sorry i and you know
0:19:12	also are not close enough again a so when you doing this dataset selection you
0:19:16	looking at what is just statistically important are the clients that are going to be
0:19:21	rolling system so you're
0:19:23	system itself fourteen hours information about are you going to test on
0:19:27	why wouldn't you just to closed set speaker i'd say that
0:19:33	so reading it
0:19:34	the when you're choosing at your impostors your before you dbn training all z norm
0:19:41	that selection process itself is aware of all your target speakers
0:19:47	yes that's correct
0:19:48	so why not take a further and just a closed set speaker i they for
0:19:52	the i-vector challenge
0:19:53	yes that's why i'm telling you at the experiment the results extend i told you
0:20:00	if we don't use the non-target i-vectors and just the and select randomly the same
0:20:08	number of actual and i-vectors only from the development set
0:20:14	and we use these a in iteration process use the for instance the one thousand
0:20:21	the three hundred the i-vectors randomly from the development set and do the same processes
0:20:27	the computing the and impostor frequencies
0:20:30	and then again choose the and the random i-vectors and do the same and computing
0:20:37	the impostors and then being the outrage overall impose an impostor frequencies and you the
0:20:43	same set the threshold and setting the parameters
0:20:46	we had almost the very similar results of these results that you have views on
0:20:51	the target like make so that's a that's a very
0:20:56	client specific selection menu not aware of the other clients in that sense
0:21:01	very nice
0:21:03	with data yes technically looking at the other clients with against the rules of the
0:21:06	i-vector challenge but he has a solution that didn't have the other thing is the
0:21:10	closed set scoring don't make here for wouldn't actually work because they are all different
0:21:15	speaker

i-Vector Modeling with Deep Belief Networks for Multi-Session Speaker Recognition

Neural Nets for Speaker and Language Modeling

Omid Ghahabi and Javier Hernando