Speech Transcript - Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification

0:00:15	hi everybody in this that i'm going to present the average at least the mental
0:00:23	for
0:00:25	allowing for the states and also selecting assumption is that is seeking text dependent speaker
0:00:32	verification
0:00:33	and also using deep neural network for improving the performance of the
0:00:39	text dependent speaker verification
0:00:43	a text dependent speaker verification is
0:00:47	task of verifying both speaker and the also phase signal the phase information
0:00:54	and we can use for improving the performance
0:01:00	we proposed a freezing dependent hmm models for a lightning frame
0:01:06	to the states and also to the gaussian components
0:01:10	an by using a hmm a we shall use the phrase information also we can
0:01:17	take into account the framework
0:01:22	then to use the h and then channel reviews the
0:01:26	uncertainty in the i-vector estimation if we need a they're pretty
0:01:32	and the average to resolve the covariance both sides
0:01:37	as uncertainty
0:01:39	this not so the reviews the uncertainty about twenty pairs
0:01:45	compared to the g
0:01:48	in addition to write we try to using deep neural networks for reducing the gap
0:01:54	between a gmm how much of the alignments
0:01:59	are also for improving the performance of the assuming this that's
0:02:06	that i certainly the general i-vector based system
0:02:10	in i-vector as system may mobile the all test you can then
0:02:15	supervector is we the
0:02:18	and this equation
0:02:20	in i-vector system we need to zero and first order statistic for training and the
0:02:25	extracting i-vectors
0:02:27	you can see the efficient
0:02:30	in this the equations a got one i shows the posterior probability of one frame
0:02:37	b
0:02:38	generated be the one a specific a gaussian components
0:02:43	we can component are computed the
0:02:47	gamma
0:02:48	got most be different at all the gmm-ubm
0:02:52	and or also or channel dismantled
0:02:58	but then you want them using the chairman has ubm in text dependent speaker verification
0:03:04	you have several choices
0:03:08	the first and second one is the using a phrase dependent hmm models in this
0:03:13	case you have to train and i-vector extractor for each phrase
0:03:19	this is suitable for common raspberries are also for text prompted speaker verification
0:03:25	we need sufficient training data for each phrase out of the also and so this
0:03:31	is not practical for
0:03:33	real application of taste gonna speaker verification
0:03:38	then other choices a tied mixture
0:03:41	hmms
0:03:42	and the last minute or phrasing the dimensional models
0:03:47	in this the middle the be used from a monophone structure same as the speech
0:03:52	production
0:03:54	and the
0:03:56	trait phrase model by concatenating the
0:04:01	four times and models
0:04:04	and extracting such as that seek for each phrase and convert into the
0:04:08	same shape of for all for example train an i-vector extractor for all
0:04:13	phrase
0:04:14	in this method to we don't need any a bead only large amount of training
0:04:19	data for each other frames on hmms can be trained
0:04:22	using any transcriber data
0:04:27	in these
0:04:30	the first stage of the this method is the training of phone recognizer under constructing
0:04:37	mobile left right problems for each phrase
0:04:40	and that doing a bit every forced alignment to align the frame to the states
0:04:46	and eight and
0:04:49	in
0:04:50	each state
0:04:51	extracting such as that's is the same as
0:04:55	simple gmm
0:04:59	and this is the for each phrase test statistic have different shapes and you have
0:05:05	to
0:05:07	change a them to a unique shape to be able to train on i-vectors structure
0:05:13	for all of the reason
0:05:15	in this
0:05:17	in the button of this
0:05:19	figure see that
0:05:22	spectral zero or a first order statistics
0:05:28	colours that
0:05:30	phrase specific statistics of the final of erin
0:05:33	we just sound the
0:05:35	some part of
0:05:37	nh just sound
0:05:40	part of the statistic that associated with the
0:05:43	saying the state of the same performance
0:05:46	and after the training
0:05:49	train an i-vector extractor exactly similar to text independent speaker
0:05:54	and verification
0:05:58	for channel compensation and scoring a text dependent speaker verification via problem be these
0:06:07	it's proved that the performance of the lda it's not so a lot and sometimes
0:06:12	the performance of the baseline gmm this is better than a p lda
0:06:19	also because the in text dependent speaker location of the training data it's really make
0:06:24	you need to
0:06:26	in most number of speakers also number of
0:06:30	sometimes pair freight
0:06:33	we cannot use a simple lda and
0:06:35	the reduces the end of the search just that using a regularized reduces cm
0:06:42	for
0:06:44	reducing the effect of a sample size
0:06:49	in their regularized values the c and we just add some
0:06:58	we just had some regularization to the
0:07:01	the covariance multi cell for each class something and all that i think it's the
0:07:06	exactly same as the symbol that uses
0:07:11	and the also in a
0:07:15	takes the gun on the speaker location because the old the ocean it's a very
0:07:19	short you have to
0:07:22	use the phrase dependent transform and also for is the government
0:07:27	regular is a the first dependent score normalization
0:07:30	especially venues the
0:07:32	a hmm for a long data frames
0:07:35	use cosine similarity for scoring and the system for normalization
0:07:43	for reducing the get fifteen
0:07:46	hmm and gmm alignment the we can use
0:07:50	the intent
0:07:51	in two scenarios the first one maybe use the nn for calculating
0:07:57	and posterior probability and it is exactly same as was found in
0:08:04	and text independent speaker verification
0:08:06	and another choice is using
0:08:10	the nn for extracting bodily features
0:08:13	for improving the gmm alignment
0:08:18	in this case the better of fun i'm like features based clustering obtained on the
0:08:23	performance of the gmm based improve
0:08:28	for it's like four
0:08:31	note for heavy use stacked bottleneck features
0:08:35	in this topology
0:08:38	the two
0:08:41	to the bottleneck networks
0:08:44	that's good that to each other
0:08:48	the bottleneck loaded of the first stage construct their input of the second stage
0:08:55	and we use the old that of the
0:08:58	but what a nickel that of the second stage as
0:09:01	well to make hold
0:09:04	are used to different the
0:09:06	networks one us for a menu for extracting bottleneck features that have about eight thousand
0:09:14	percent and another one used for bows
0:09:18	extracting bottleneck out of the calculating the posterior probability
0:09:23	that have
0:09:25	i bought of one thousand sentence
0:09:28	four feet input features used utterance six a lot and the scale filterbank
0:09:34	and also three features
0:09:40	where x for experiment of used car one of the r s r dataset
0:09:46	in a result dataset there are a three hundred speakers
0:09:50	one hundred on the
0:09:52	fifty
0:09:53	so that males and one hundred forty three females each of which are problems for
0:09:58	announcing thirty
0:10:01	and different phrases from timit in nine this thing sessions are used really a sessions
0:10:08	for enrollment a by averaging the i-vector and others for testing
0:10:12	we just use backgrounds for
0:10:15	training and the results reported just some evaluation set
0:10:20	a for training the n and the we use the switchboard data sets
0:10:25	as a feature we use different acoustic features
0:10:29	thirty nine dimensional plp features are also
0:10:34	the initial all mfcc features both of them extracted from sixteen influence
0:10:40	and two version of the bottleneck features but extracted from at a data
0:10:48	for
0:10:50	vad we use supervised
0:10:53	silence model for
0:10:55	just dropping to find out
0:10:57	just probably the initial and final silence in a regional trans on
0:11:04	after that applied
0:11:05	cepstral mean and variance too much mean and variance
0:11:09	use of four hundred dimensional i-vectors that length normalized before regularized w c n
0:11:17	and as lisa the use phrase dependent required their use in an s not cosine
0:11:22	distance for a score
0:11:27	in this table you can see the comparison results between a different a features and
0:11:33	also alignment that so it
0:11:35	in the first section of this table you can as can bear the performance of
0:11:41	the gmm and hmm aligner and you can see that it shows that the significantly
0:11:46	improve the performance
0:11:51	and comparing that the nn alignment with hmm of each and see that the nn
0:11:56	also calendar improve the performance
0:12:00	especially for the female the performance is better than it channel alignment
0:12:05	may be used it was just features
0:12:09	then use bottleneck features
0:12:12	the performance of the gmm it's
0:12:16	increased
0:12:17	and you can see compare these two number on also others
0:12:23	well for hmm based for female the performance is better for mesa
0:12:29	you got some deterioration in performance
0:12:34	well for the and then we can use those bottleneck features on the l an
0:12:38	alignment you can see the
0:12:41	so
0:12:44	you can see that
0:12:47	q
0:12:48	you duration in performance
0:12:50	maybe use both of them
0:12:52	and the in the last section you see the a pair results of the bottleneck
0:12:59	concatenated image
0:13:01	that the mfcc features
0:13:04	in this case we got that this result
0:13:07	for weight loss hmm and the gmm case you can see that of in the
0:13:12	use this the features the performance of the gmm
0:13:16	it's very close to the
0:13:17	hmm one but again for be and then the performance is not so
0:13:24	because the pair performance of the chinese it's better than other we just report the
0:13:29	results on this but also in this table
0:13:34	in this table in the first
0:13:36	section we compare the
0:13:39	performance of the different features
0:13:43	mfcc plp what'll make a two button think one of them extracted from
0:13:50	is smaller network
0:13:52	you can see that most i'm this field you a
0:13:56	the perplexity same
0:13:58	and the bottleneck its course for made on the it's better for female
0:14:06	then reduce the size of the network
0:14:09	the performance of the bottleneck reviews the you can see
0:14:16	for both appeal to kill the and mfcc we
0:14:21	concatenated with the bottleneck we get a
0:14:24	would be improvement
0:14:28	and in the last session of this table you can see the results of the
0:14:32	errors fusion in score domain
0:14:36	a comprise only the second session that it's fusion in the it's in feature domain
0:14:43	in this case you can see that the in almost all cases the performance of
0:14:49	the main interest for coming is better than features domain name
0:14:54	takes you can then speaker verification because in text
0:14:58	independent
0:14:59	the performance of the
0:15:02	concatenation is better than
0:15:05	fusing the
0:15:08	it's cool it's course of two features
0:15:11	and a higher the
0:15:14	the problem is the training data the training data it's very limited and for larger
0:15:21	than actually need to more training data
0:15:26	you can see that of the four
0:15:29	using the bottleneck features be the plp and mfcc a we get the
0:15:34	we would be improvement
0:15:37	and that this result come from a fusing three different
0:15:41	it scores of three different
0:15:44	features
0:15:48	and at the end we proved that a
0:15:51	be also
0:15:52	can get very best results with i-vectors in a text dependent speaker verification
0:15:58	we verify that in text examine the speaker verification
0:16:02	the performance of that the an alignment
0:16:05	so good
0:16:06	and in some cases similar or better result did
0:16:11	the in an alignment
0:16:14	we also get a
0:16:16	we do excellent a result we using your bottleneck in text dependent speaker verification
0:16:23	it's should even concatenated you the other cepstral features
0:16:29	in text-dependent has been in speaker recognition have also
0:16:33	it score to maintain it is better than
0:16:38	feature level fusion
0:16:41	and we get the best results from i've using three different features
0:16:48	i'm just another one is a text dependent speaker verification you have to
0:16:54	used for a sleep and then the transform on score normalization was
0:16:59	the
0:17:01	duration is very short and then use the
0:17:05	hmm for aligning frame to the states
0:17:08	pitch and not to use the phrase independent
0:17:12	and
0:17:13	score
0:17:21	questions
0:17:36	okay maybe a quick question for lunch a very good on the vector work aid
0:17:41	to try this one the red dots
0:17:43	yes a are also the
0:17:47	results from our right that's
0:17:51	you can see the result of that the use of this was able to interspeech
0:17:58	i can see that a comparison between gmm ubm gmm i-vector also hmm i-vector in
0:18:04	three different non-target trial
0:18:08	and you can see that the especially for the target find that the freight frame
0:18:14	that it is important for us
0:18:17	also the content of it is important and the performance of the
0:18:23	hmm alignment
0:18:24	it's very better than other two methods
0:18:28	and also for impostor enquiry case the performance of directories
0:18:35	better too
0:18:40	first thing to note question
0:18:45	just a quick question to the fusion for gmm systems so they didn't systems were
0:18:52	working controls the hmms drive using cd units is used to see we're very minor
0:18:58	to remain
0:19:00	no i and try to be

Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification

Text Dependent Speaker Verification

Hossein Zeinali, Lukas Burget, Hossein Sameti, Ondrej Glembek and Oldrich Plchot