0:00:15hi everybody in this that i'm going to present the average at least the mental
0:00:23for
0:00:25allowing for the states and also selecting assumption is that is seeking text dependent speaker
0:00:32verification
0:00:33and also using deep neural network for improving the performance of the
0:00:39text dependent speaker verification
0:00:43a text dependent speaker verification is
0:00:47task of verifying both speaker and the also phase signal the phase information
0:00:54and we can use for improving the performance
0:01:00we proposed a freezing dependent hmm models for a lightning frame
0:01:06to the states and also to the gaussian components
0:01:10an by using a hmm a we shall use the phrase information also we can
0:01:17take into account the framework
0:01:22then to use the h and then channel reviews the
0:01:26uncertainty in the i-vector estimation if we need a they're pretty
0:01:32and the average to resolve the covariance both sides
0:01:37as uncertainty
0:01:39this not so the reviews the uncertainty about twenty pairs
0:01:45compared to the g
0:01:48in addition to write we try to using deep neural networks for reducing the gap
0:01:54between a gmm how much of the alignments
0:01:59are also for improving the performance of the assuming this that's
0:02:06that i certainly the general i-vector based system
0:02:10in i-vector as system may mobile the all test you can then
0:02:15supervector is we the
0:02:18and this equation
0:02:20in i-vector system we need to zero and first order statistic for training and the
0:02:25extracting i-vectors
0:02:27you can see the efficient
0:02:30in this the equations a got one i shows the posterior probability of one frame
0:02:37b
0:02:38generated be the one a specific a gaussian components
0:02:43we can component are computed the
0:02:47gamma
0:02:48got most be different at all the gmm-ubm
0:02:52and or also or channel dismantled
0:02:58but then you want them using the chairman has ubm in text dependent speaker verification
0:03:04you have several choices
0:03:08the first and second one is the using a phrase dependent hmm models in this
0:03:13case you have to train and i-vector extractor for each phrase
0:03:19this is suitable for common raspberries are also for text prompted speaker verification
0:03:25we need sufficient training data for each phrase out of the also and so this
0:03:31is not practical for
0:03:33real application of taste gonna speaker verification
0:03:38then other choices a tied mixture
0:03:41hmms
0:03:42and the last minute or phrasing the dimensional models
0:03:47in this the middle the be used from a monophone structure same as the speech
0:03:52production
0:03:54and the
0:03:56trait phrase model by concatenating the
0:04:01four times and models
0:04:04and extracting such as that seek for each phrase and convert into the
0:04:08same shape of for all for example train an i-vector extractor for all
0:04:13phrase
0:04:14in this method to we don't need any a bead only large amount of training
0:04:19data for each other frames on hmms can be trained
0:04:22using any transcriber data
0:04:27in these
0:04:30the first stage of the this method is the training of phone recognizer under constructing
0:04:37mobile left right problems for each phrase
0:04:40and that doing a bit every forced alignment to align the frame to the states
0:04:46and eight and
0:04:49in
0:04:50each state
0:04:51extracting such as that's is the same as
0:04:55simple gmm
0:04:59and this is the for each phrase test statistic have different shapes and you have
0:05:05to
0:05:07change a them to a unique shape to be able to train on i-vectors structure
0:05:13for all of the reason
0:05:15in this
0:05:17in the button of this
0:05:19figure see that
0:05:22spectral zero or a first order statistics
0:05:28colours that
0:05:30phrase specific statistics of the final of erin
0:05:33we just sound the
0:05:35some part of
0:05:37nh just sound
0:05:40part of the statistic that associated with the
0:05:43saying the state of the same performance
0:05:46and after the training
0:05:49train an i-vector extractor exactly similar to text independent speaker
0:05:54and verification
0:05:58for channel compensation and scoring a text dependent speaker verification via problem be these
0:06:07it's proved that the performance of the lda it's not so a lot and sometimes
0:06:12the performance of the baseline gmm this is better than a p lda
0:06:19also because the in text dependent speaker location of the training data it's really make
0:06:24you need to
0:06:26in most number of speakers also number of
0:06:30sometimes pair freight
0:06:33we cannot use a simple lda and
0:06:35the reduces the end of the search just that using a regularized reduces cm
0:06:42for
0:06:44reducing the effect of a sample size
0:06:49in their regularized values the c and we just add some
0:06:58we just had some regularization to the
0:07:01the covariance multi cell for each class something and all that i think it's the
0:07:06exactly same as the symbol that uses
0:07:11and the also in a
0:07:15takes the gun on the speaker location because the old the ocean it's a very
0:07:19short you have to
0:07:22use the phrase dependent transform and also for is the government
0:07:27regular is a the first dependent score normalization
0:07:30especially venues the
0:07:32a hmm for a long data frames
0:07:35use cosine similarity for scoring and the system for normalization
0:07:43for reducing the get fifteen
0:07:46hmm and gmm alignment the we can use
0:07:50the intent
0:07:51in two scenarios the first one maybe use the nn for calculating
0:07:57and posterior probability and it is exactly same as was found in
0:08:04and text independent speaker verification
0:08:06and another choice is using
0:08:10the nn for extracting bodily features
0:08:13for improving the gmm alignment
0:08:18in this case the better of fun i'm like features based clustering obtained on the
0:08:23performance of the gmm based improve
0:08:28for it's like four
0:08:31note for heavy use stacked bottleneck features
0:08:35in this topology
0:08:38the two
0:08:41to the bottleneck networks
0:08:44that's good that to each other
0:08:48the bottleneck loaded of the first stage construct their input of the second stage
0:08:55and we use the old that of the
0:08:58but what a nickel that of the second stage as
0:09:01well to make hold
0:09:04are used to different the
0:09:06networks one us for a menu for extracting bottleneck features that have about eight thousand
0:09:14percent and another one used for bows
0:09:18extracting bottleneck out of the calculating the posterior probability
0:09:23that have
0:09:25i bought of one thousand sentence
0:09:28four feet input features used utterance six a lot and the scale filterbank
0:09:34and also three features
0:09:40where x for experiment of used car one of the r s r dataset
0:09:46in a result dataset there are a three hundred speakers
0:09:50one hundred on the
0:09:52fifty
0:09:53so that males and one hundred forty three females each of which are problems for
0:09:58announcing thirty
0:10:01and different phrases from timit in nine this thing sessions are used really a sessions
0:10:08for enrollment a by averaging the i-vector and others for testing
0:10:12we just use backgrounds for
0:10:15training and the results reported just some evaluation set
0:10:20a for training the n and the we use the switchboard data sets
0:10:25as a feature we use different acoustic features
0:10:29thirty nine dimensional plp features are also
0:10:34the initial all mfcc features both of them extracted from sixteen influence
0:10:40and two version of the bottleneck features but extracted from at a data
0:10:48for
0:10:50vad we use supervised
0:10:53silence model for
0:10:55just dropping to find out
0:10:57just probably the initial and final silence in a regional trans on
0:11:04after that applied
0:11:05cepstral mean and variance too much mean and variance
0:11:09use of four hundred dimensional i-vectors that length normalized before regularized w c n
0:11:17and as lisa the use phrase dependent required their use in an s not cosine
0:11:22distance for a score
0:11:27in this table you can see the comparison results between a different a features and
0:11:33also alignment that so it
0:11:35in the first section of this table you can as can bear the performance of
0:11:41the gmm and hmm aligner and you can see that it shows that the significantly
0:11:46improve the performance
0:11:51and comparing that the nn alignment with hmm of each and see that the nn
0:11:56also calendar improve the performance
0:12:00especially for the female the performance is better than it channel alignment
0:12:05may be used it was just features
0:12:09then use bottleneck features
0:12:12the performance of the gmm it's
0:12:16increased
0:12:17and you can see compare these two number on also others
0:12:23well for hmm based for female the performance is better for mesa
0:12:29you got some deterioration in performance
0:12:34well for the and then we can use those bottleneck features on the l an
0:12:38alignment you can see the
0:12:41so
0:12:44you can see that
0:12:47q
0:12:48you duration in performance
0:12:50maybe use both of them
0:12:52and the in the last section you see the a pair results of the bottleneck
0:12:59concatenated image
0:13:01that the mfcc features
0:13:04in this case we got that this result
0:13:07for weight loss hmm and the gmm case you can see that of in the
0:13:12use this the features the performance of the gmm
0:13:16it's very close to the
0:13:17hmm one but again for be and then the performance is not so
0:13:24because the pair performance of the chinese it's better than other we just report the
0:13:29results on this but also in this table
0:13:34in this table in the first
0:13:36section we compare the
0:13:39performance of the different features
0:13:43mfcc plp what'll make a two button think one of them extracted from
0:13:50is smaller network
0:13:52you can see that most i'm this field you a
0:13:56the perplexity same
0:13:58and the bottleneck its course for made on the it's better for female
0:14:06then reduce the size of the network
0:14:09the performance of the bottleneck reviews the you can see
0:14:16for both appeal to kill the and mfcc we
0:14:21concatenated with the bottleneck we get a
0:14:24would be improvement
0:14:28and in the last session of this table you can see the results of the
0:14:32errors fusion in score domain
0:14:36a comprise only the second session that it's fusion in the it's in feature domain
0:14:43in this case you can see that the in almost all cases the performance of
0:14:49the main interest for coming is better than features domain name
0:14:54takes you can then speaker verification because in text
0:14:58independent
0:14:59the performance of the
0:15:02concatenation is better than
0:15:05fusing the
0:15:08it's cool it's course of two features
0:15:11and a higher the
0:15:14the problem is the training data the training data it's very limited and for larger
0:15:21than actually need to more training data
0:15:26you can see that of the four
0:15:29using the bottleneck features be the plp and mfcc a we get the
0:15:34we would be improvement
0:15:37and that this result come from a fusing three different
0:15:41it scores of three different
0:15:44features
0:15:48and at the end we proved that a
0:15:51be also
0:15:52can get very best results with i-vectors in a text dependent speaker verification
0:15:58we verify that in text examine the speaker verification
0:16:02the performance of that the an alignment
0:16:05so good
0:16:06and in some cases similar or better result did
0:16:11the in an alignment
0:16:14we also get a
0:16:16we do excellent a result we using your bottleneck in text dependent speaker verification
0:16:23it's should even concatenated you the other cepstral features
0:16:29in text-dependent has been in speaker recognition have also
0:16:33it score to maintain it is better than
0:16:38feature level fusion
0:16:41and we get the best results from i've using three different features
0:16:48i'm just another one is a text dependent speaker verification you have to
0:16:54used for a sleep and then the transform on score normalization was
0:16:59the
0:17:01duration is very short and then use the
0:17:05hmm for aligning frame to the states
0:17:08pitch and not to use the phrase independent
0:17:12and
0:17:13score
0:17:21questions
0:17:36okay maybe a quick question for lunch a very good on the vector work aid
0:17:41to try this one the red dots
0:17:43yes a are also the
0:17:47results from our right that's
0:17:51you can see the result of that the use of this was able to interspeech
0:17:58i can see that a comparison between gmm ubm gmm i-vector also hmm i-vector in
0:18:04three different non-target trial
0:18:08and you can see that the especially for the target find that the freight frame
0:18:14that it is important for us
0:18:17also the content of it is important and the performance of the
0:18:23hmm alignment
0:18:24it's very better than other two methods
0:18:28and also for impostor enquiry case the performance of directories
0:18:35better too
0:18:40first thing to note question
0:18:45just a quick question to the fusion for gmm systems so they didn't systems were
0:18:52working controls the hmms drive using cd units is used to see we're very minor
0:18:58to remain
0:19:00no i and try to be