Speech Transcript - Bottleneck Features for Speaker Recognition

0:00:36	OK so
0:00:38	my name is Sibel Yaman
0:00:40	today I will talk about my work
0:00:42	my colleagues
0:00:47	tis my picture to show speaker recognition
0:01:08	it's my take about introduction
0:01:13	I'll give you a big picture of speaker recognition
0:01:20	I'm also representating overview speaker recognition
0:01:22	methodology I'll take about the detail
0:01:26	next picture shows an exam
0:01:32	two key idea and first one is develop conversation of training in key terms
0:01:39	and the second one is cooperation of the separate system in discriminative training
0:01:46	I will report the experimental results and conclude the summary
0:01:54	OK we are listening a keynote speaker in the morning in the speech recognition literature
0:02:01	report to platform in HMMs
0:02:10	however in the speaker recognition we are not got that yet but
0:02:14	we are improving our results in everyday
0:02:23	here are the bottleneck
0:02:26	architecture that be used we have input layer
0:02:32	from l norm at speeching point five long of speech
0:02:38	these are raw features that means there no delta and delta deltas
0:02:43	fourteen dimensional MFCC features
0:02:47	there are fully connected to a layer of
0:02:51	about thousands of hidden notes
0:02:54	which are diagonal connect to narrow an hidden notes which are as a bottleneck layer
0:03:02	they are connected to another hidden layer and finally they are connected to
0:03:16	input feature statistics
0:03:18	are feed into network
0:03:20	and they are passed to the bottleneck layers
0:03:23	in the propagation mode the speaker recognition
0:03:27	speaker is going to information and
0:03:30	that propagate to lower layers
0:03:32	and that how we create the road map
0:03:39	this is kind of network strategy
0:03:43	i-vector should be studied
0:03:46	in 1998
0:03:48	by koing and heck
0:03:54	let me tell you what happen before
0:03:56	just use this bottleneck to do that
0:04:03	it doesn't work
0:04:08	this use some standard
0:04:14	we compare
0:04:17	traditional call transitional
0:04:22	bottleneck features
0:04:25	it's shown in red color
0:04:27	that blue color shows
0:04:30	MFCC baseline systems
0:04:33	let's just keep this number in our
0:04:39	and when we compare
0:04:42	performance in term of equal error rate
0:04:46	we see
0:04:48	40% decreasing
0:04:52	in the same microphone test set in NIST 2010
0:04:57	about 45%
0:05:00	decreasing in the different task
0:05:08	do techniques that will make performance per person where they are
0:05:20	here are some overview methodology
0:05:25	MFCC features
0:05:27	one way is we could do some linear transformation
0:05:31	and obtain delta and delta deltas
0:05:34	higher order features
0:05:37	but we want to do is
0:05:41	we want to perform nonlinear transformation on this
0:05:45	features so that when we were
0:05:51	choose sort of data
0:05:54	we just transform
0:05:56	MFCC features on robust features
0:06:01	in another words
0:06:02	be identified some two
0:06:05	two ways of using deep belief network to do that
0:06:09	the first one is
0:06:12	we can't change training algorithm in that way that
0:06:16	it could get speaker recognition application better
0:06:21	the second one is almost speaker
0:06:24	system combination in deal with speaker recognition
0:06:27	recognition so
0:06:29	we are explore if there is a way
0:06:32	a separate system in training
0:06:42	next I will talk about some details
0:06:45	ideas
0:06:49	first of all is the platform in frame training
0:06:54	learning the speaker information is constrained to show
0:07:00	compacted the current frame
0:07:02	if you want to increase the context
0:07:11	in our system in conversation level training algorithm
0:07:17	it shows the solution to the problems
0:07:21	first of all if the training is conversation level
0:07:28	it will be making one single decision per decoding
0:07:33	which should be the case
0:07:35	and another advantage could be
0:07:43	there are several ways to be doing this is to explore
0:07:47	I will explain that latter
0:07:52	so the first idea key is
0:07:56	using a speaker recognition training criterion
0:07:59	so the speaker recognition training criterion is
0:08:02	log likelihood ratio based training criterion
0:08:07	deal by Brummer
0:08:09	it is a way to sum cost
0:08:13	to see target trail plus
0:08:20	this use the kind of this objective times
0:08:27	think to find a target
0:08:38	and the I will remain that
0:08:40	upper layer is a speaker layer
0:08:48	as my mention earlier for part each coding we should be made one decision
0:08:55	there are severals ways by doing that how to feedback
0:09:00	make a decision in one frame
0:09:03	make a decision on another frame on the decoding
0:09:08	we took another pic here what we do
0:09:12	the score are averaged at the output layer before the nonlinearity
0:09:18	which it means that for each frame over the decoding we do the statistics and
0:09:24	sum them and average them
0:09:32	as I mention earlier the second key idea of my methodology is
0:09:38	using a separate system in training
0:09:43	this diagram here the top layer of the network
0:09:49	as before we have a BN score generation scheme here
0:09:54	we have a standard MFCC system
0:09:58	but seems we are using likelihood ratio base
0:10:01	training criterion in this score must be weighted
0:10:05	if be based
0:10:09	bottleneck features
0:10:12	so we do that with linear combination
0:10:16	are these two types of scores used them in training criterion
0:10:25	so one question is how the calibration is achieved
0:10:30	as we see here we have three parameters in the bottom equation
0:10:36	w 1 and w 2 and kappa are estimated
0:10:44	by min. are training objective where the network are weighted
0:10:53	after these parameters are estimated
0:10:55	there I how fixed it and estimated
0:11:03	as many of are speakers are mentioned we used
0:11:09	speakers system after we extracted features
0:11:15	many we use UBM support we have i-vectors
0:11:19	we have PLDA
0:11:23	I skip this part next I will repost experimental results
0:11:32	we ran experiments on the same and different microphone task of NIST SRE 2001
0:11:41	this is our main interest our target
0:11:47	in recording we use microphone recording only in bottleneck network
0:11:55	this give us we use all SRE 2004 5 6 and switchboard
0:12:02	data
0:12:05	data in our experiments
0:12:09	used microphone recording for bottleneck network training
0:12:14	we have 173 speakers
0:12:17	speakers in the training and validation sets
0:12:21	give us about 4341 recording in training
0:12:26	865 recordings in terms of number of
0:12:32	input samples
0:12:34	we have about few million samples in training
0:12:39	and two millions in calibration
0:12:46	number extracts was
0:12:48	like that we have
0:12:51	294 dimensional input 42 times 21 we use
0:12:59	plus and minus time frames and sum context frames
0:13:02	network has 1000 hidden notes
0:13:08	by 42 bottle network
0:13:12	notes these are fully connected to another 500 notes
0:13:18	173 speakers
0:13:24	I could like to mention that
0:13:27	as the process of input feature and bottleneck features
0:13:35	the input features are mean and variance normalize
0:13:41	conditional network
0:13:43	mean and variance are estimated in window length of three second of speech
0:13:50	and the also corrected to bottleneck features
0:13:54	to make them comparable with diagonal covariance GMMs as an assumption
0:14:02	this one show the factor of training criteria
0:14:08	present in blue and red column here
0:14:13	red color as a traditional
0:14:17	training network and now we have green color
0:14:21	we use calculation training
0:14:24	and those I just described
0:14:26	may you remember that decreasing is 40%
0:14:31	on the same microphone test it was 45%
0:14:37	in the different microphone test
0:14:41	it became 30% and 35%34% respectively
0:14:46	the different is more
0:14:50	observe but observe in
0:14:53	different performance in term of
0:15:01	when the train network in transitional right now is 30%
0:15:09	we also explore the factor of bottle network in layer in the next slide
0:15:15	the effect of yes
0:15:17	observed the train set
0:15:21	as we increase feature vector bottleneck feature and we got improvement
0:15:27	we did explore
0:15:34	this slide shows the combination strategy I just mention
0:15:40	the blue column shows the MFCC baseline
0:15:45	the red line show the line combination of two scores using toolkit
0:15:52	the green line shows that
0:15:55	training network in separate system
0:16:01	yes we also get improvement in 18%
0:16:06	with this strategy
0:16:10	I could like to
0:16:16	and let's we show how to
0:16:19	one way to train another network using full time
0:16:23	speaker recognition system
0:16:26	and also show how to use
0:16:28	a separate system in training network
0:16:32	and thank you questions are welcome
0:16:47	these just features yes
0:16:52	that just the same so instead of MFCC features we want bottleneck features
0:17:23	yes we do use GMM
0:17:26	so we what we said is MFCC delta and delta are project of linear transformation
0:17:41	see our question is if we perform nonlinear transform can we do better
0:18:35	we did we start with two thousands
0:18:49	we actually first start with baseline system it has 34
0:19:00	baseline feature got the same way
0:19:46	actually this combination
0:19:48	it was two steps combination
0:19:51	first we use these MFCC score in training network
0:19:54	after that we get PLDA score we combine these scores
0:20:13	oh I have that results actually
0:20:18	linear combination for these red column shows that
0:20:26	so we got some perhaps 30%
0:20:38	we don't have any delta
0:20:42	we actually to avoid that
0:21:21	test on 2010
0:21:25	training data are form 2004 5 and 6

Bottleneck Features for Speaker Recognition

SESSION 04: Neural Network for Speaker Recognition

Sibel Yaman