Speech Transcript - Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification

hi everybody in this that i'm going to present the average at least the mental

for

allowing for the states and also selecting assumption is that is seeking text dependent speaker

verification

and also using deep neural network for improving the performance of the

text dependent speaker verification

a text dependent speaker verification is

task of verifying both speaker and the also phase signal the phase information

and we can use for improving the performance

we proposed a freezing dependent hmm models for a lightning frame

to the states and also to the gaussian components

an by using a hmm a we shall use the phrase information also we can

take into account the framework

then to use the h and then channel reviews the

uncertainty in the i-vector estimation if we need a they're pretty

and the average to resolve the covariance both sides

as uncertainty

this not so the reviews the uncertainty about twenty pairs

compared to the g

in addition to write we try to using deep neural networks for reducing the gap

between a gmm how much of the alignments

are also for improving the performance of the assuming this that's

that i certainly the general i-vector based system

in i-vector as system may mobile the all test you can then

supervector is we the

and this equation

in i-vector system we need to zero and first order statistic for training and the

extracting i-vectors

you can see the efficient

in this the equations a got one i shows the posterior probability of one frame

generated be the one a specific a gaussian components

we can component are computed the

gamma

got most be different at all the gmm-ubm

and or also or channel dismantled

but then you want them using the chairman has ubm in text dependent speaker verification

you have several choices

the first and second one is the using a phrase dependent hmm models in this

case you have to train and i-vector extractor for each phrase

this is suitable for common raspberries are also for text prompted speaker verification

we need sufficient training data for each phrase out of the also and so this

is not practical for

real application of taste gonna speaker verification

then other choices a tied mixture

hmms

and the last minute or phrasing the dimensional models

in this the middle the be used from a monophone structure same as the speech

production

and the

trait phrase model by concatenating the

four times and models

and extracting such as that seek for each phrase and convert into the

same shape of for all for example train an i-vector extractor for all

phrase

in this method to we don't need any a bead only large amount of training

data for each other frames on hmms can be trained

using any transcriber data

in these

the first stage of the this method is the training of phone recognizer under constructing

mobile left right problems for each phrase

and that doing a bit every forced alignment to align the frame to the states

and eight and

each state

extracting such as that's is the same as

simple gmm

and this is the for each phrase test statistic have different shapes and you have

change a them to a unique shape to be able to train on i-vectors structure

for all of the reason

in this

in the button of this

figure see that

spectral zero or a first order statistics

colours that

phrase specific statistics of the final of erin

we just sound the

some part of

nh just sound

part of the statistic that associated with the

saying the state of the same performance

and after the training

train an i-vector extractor exactly similar to text independent speaker

and verification

for channel compensation and scoring a text dependent speaker verification via problem be these

it's proved that the performance of the lda it's not so a lot and sometimes

the performance of the baseline gmm this is better than a p lda

also because the in text dependent speaker location of the training data it's really make

you need to

in most number of speakers also number of

sometimes pair freight

we cannot use a simple lda and

the reduces the end of the search just that using a regularized reduces cm

for

reducing the effect of a sample size

in their regularized values the c and we just add some

we just had some regularization to the

the covariance multi cell for each class something and all that i think it's the

exactly same as the symbol that uses

and the also in a

takes the gun on the speaker location because the old the ocean it's a very

short you have to

use the phrase dependent transform and also for is the government

regular is a the first dependent score normalization

especially venues the

a hmm for a long data frames

use cosine similarity for scoring and the system for normalization

for reducing the get fifteen

hmm and gmm alignment the we can use

the intent

in two scenarios the first one maybe use the nn for calculating

and posterior probability and it is exactly same as was found in

and text independent speaker verification

and another choice is using

the nn for extracting bodily features

for improving the gmm alignment

in this case the better of fun i'm like features based clustering obtained on the

performance of the gmm based improve

for it's like four

note for heavy use stacked bottleneck features

in this topology

the two

to the bottleneck networks

that's good that to each other

the bottleneck loaded of the first stage construct their input of the second stage

and we use the old that of the

but what a nickel that of the second stage as

well to make hold

are used to different the

networks one us for a menu for extracting bottleneck features that have about eight thousand

percent and another one used for bows

extracting bottleneck out of the calculating the posterior probability

that have

i bought of one thousand sentence

four feet input features used utterance six a lot and the scale filterbank

and also three features

where x for experiment of used car one of the r s r dataset

in a result dataset there are a three hundred speakers

one hundred on the

fifty

so that males and one hundred forty three females each of which are problems for

announcing thirty

and different phrases from timit in nine this thing sessions are used really a sessions

for enrollment a by averaging the i-vector and others for testing

we just use backgrounds for

training and the results reported just some evaluation set

a for training the n and the we use the switchboard data sets

as a feature we use different acoustic features

thirty nine dimensional plp features are also

the initial all mfcc features both of them extracted from sixteen influence

and two version of the bottleneck features but extracted from at a data

for

vad we use supervised

silence model for

just dropping to find out

just probably the initial and final silence in a regional trans on

after that applied

cepstral mean and variance too much mean and variance

use of four hundred dimensional i-vectors that length normalized before regularized w c n

and as lisa the use phrase dependent required their use in an s not cosine

distance for a score

in this table you can see the comparison results between a different a features and

also alignment that so it

in the first section of this table you can as can bear the performance of

the gmm and hmm aligner and you can see that it shows that the significantly

improve the performance

and comparing that the nn alignment with hmm of each and see that the nn

also calendar improve the performance

especially for the female the performance is better than it channel alignment

may be used it was just features

then use bottleneck features

the performance of the gmm it's

increased

and you can see compare these two number on also others

well for hmm based for female the performance is better for mesa

you got some deterioration in performance

well for the and then we can use those bottleneck features on the l an

alignment you can see the

you can see that

you duration in performance

maybe use both of them

and the in the last section you see the a pair results of the bottleneck

concatenated image

that the mfcc features

in this case we got that this result

for weight loss hmm and the gmm case you can see that of in the

use this the features the performance of the gmm

it's very close to the

hmm one but again for be and then the performance is not so

because the pair performance of the chinese it's better than other we just report the

results on this but also in this table

in this table in the first

section we compare the

performance of the different features

mfcc plp what'll make a two button think one of them extracted from

is smaller network

you can see that most i'm this field you a

the perplexity same

and the bottleneck its course for made on the it's better for female

then reduce the size of the network

the performance of the bottleneck reviews the you can see

for both appeal to kill the and mfcc we

concatenated with the bottleneck we get a

would be improvement

and in the last session of this table you can see the results of the

errors fusion in score domain

a comprise only the second session that it's fusion in the it's in feature domain

in this case you can see that the in almost all cases the performance of

the main interest for coming is better than features domain name

takes you can then speaker verification because in text

independent

the performance of the

concatenation is better than

fusing the

it's cool it's course of two features

and a higher the

the problem is the training data the training data it's very limited and for larger

than actually need to more training data

you can see that of the four

using the bottleneck features be the plp and mfcc a we get the

we would be improvement

and that this result come from a fusing three different

it scores of three different

features

and at the end we proved that a

be also

can get very best results with i-vectors in a text dependent speaker verification

we verify that in text examine the speaker verification

the performance of that the an alignment

so good

and in some cases similar or better result did

the in an alignment

we also get a

we do excellent a result we using your bottleneck in text dependent speaker verification

it's should even concatenated you the other cepstral features

in text-dependent has been in speaker recognition have also

it score to maintain it is better than

feature level fusion

and we get the best results from i've using three different features

i'm just another one is a text dependent speaker verification you have to

used for a sleep and then the transform on score normalization was

the

duration is very short and then use the

hmm for aligning frame to the states

pitch and not to use the phrase independent

and

score

questions

okay maybe a quick question for lunch a very good on the vector work aid

to try this one the red dots

yes a are also the

results from our right that's

you can see the result of that the use of this was able to interspeech

i can see that a comparison between gmm ubm gmm i-vector also hmm i-vector in

three different non-target trial

and you can see that the especially for the target find that the freight frame

that it is important for us

also the content of it is important and the performance of the

hmm alignment

it's very better than other two methods

and also for impostor enquiry case the performance of directories

better too

first thing to note question

just a quick question to the fusion for gmm systems so they didn't systems were

working controls the hmms drive using cd units is used to see we're very minor

to remain

no i and try to be

Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification

Text Dependent Speaker Verification

Hossein Zeinali, Lukas Burget, Hossein Sameti, Ondrej Glembek and Oldrich Plchot