Speech Transcript - i-Vector Modeling with Deep Belief Networks for Multi-Session Speaker Recognition

my name is the only hobby a from part process research centre points of what's

and on a take the topic e is the i-vector more than in q we

deep belief networks

for multi session speaker recognition

you know the acoustic modeling a using deep belief networks have been shown to be

effective in speech recognition area and it's the getting popular not nowadays

but a very few items the using only r p m's restricted boltzmann machines or

generative ubms have been carried out in speaker recognition area

we have proposed in our period previous work is that the was published in i

can speak at some fourteen

we use the both generative and discriminative it dbn

on that work we use the only a single session target i-vectors as the inputs

the to the networks

in this paper we extend our previous work from a single decision to a more

decision test

that the we have used the then

i-vector challenge database in these experiments

and also we have modified our proposed impostor selection method that the

to be more accurate and more robust against the its parameters

first the ability to short a background about the deep belief networks and then i

will go

i will describe a all our dbn based system and then i will go or

more in details the in our proposed impostor selection method

and the i didn't show the experimental results that and at the and the conclusion

deep belief networks the are originally a problems

probabilistic generative models

that every two at some layers are treated as the restricted boltzmann machines

and the old ones are you to our bn will be the inputs to the

above all the m and is trained to label layer

however by adding top

label layer this you know generative dbn can be converted to a discriminative want by

doing the standard back propagation

in this is like the i have some information about the how they are bm

is trained and trained and

how it's the good fit for to be matched with the per training a neural

networks but i think i can escape is

it's and i and is better to focus on our method

less remind what's the problem

the problem is to model each target the speaker be a valuable i-vectors what we

have you are five i-vectors are part of i-vectors per each target speaker and a

large amount of background the i-vectors of the development set

our proposal is to use the deep belief networks for two main reasons

first is the two

face first is to take that want a job well unsupervised learning using the

i relevant background data at the development set

and to take that mine page of a supervised learning to train each target model

and discriminatively

this is the whole blacked out drama all our proposed method let's the two in

the widely in three main is that's

the first is that is balanced training

what what's the problem imbalanced training here in this case the we have a large

amount of background i make doors as a negative samples and if you amount of

a target data at the positive samples

as we are going to model each target speaker discriminative leaving it you get let's

and the training the network with such a on balanced training be the list the

overfitting

so the solutions we have proposed here to decrease the number of background i-vectors as

much as possible in their effective way

we don't is in tremendous that's the first

we select the only those background i-vectors that are more informative

and then clustering the selected on in post or by k-means algorithm and the using

cosine distance criteria

and then using the

the imposed and the cluster centroids as a negative samples

and then finally a we will distribute a the positive and negative samples and equality

in mind the mini batch it

the second step is the adaptation process that you have proposed in our previous work

i adaptation using all the background i-vectors we have be trained at a deep net

network

unsupervised think the without a label

and because the trained model universal deep belief network

and then each to target the speaker network speaker will be adapted from this a

universal dbn

but how adaptation the works

adaptation

be initialized and the networks the i instead of randomly and be initialized by the

ubm parameters

and then do they are unsupervised learning

on we the balanced data all

from this of one for only a few iterations

in our previous work we have shown that

the period and the pre-training in this case

works better than random initialization

and the proposed occupation works better then pre-training

the second is that this last is that is fine tuning that is actually a

back propagating is

the neural networks using the label later

but we have to change something here in comparison to estimate would be perverts the

do one the only one layer error by provided

propagation

for few iterations the before full back propagation is carried out

our experimental results in our last in our own previous works shown

as shown that is this works better because and the op the top

the label layer

by this is the something like a pre-training the top layer as well and it

works better that during the whole backprop right migration

without doing this

on the other hand be bic and bic and a d by our black there'd

role models is then be to two main phases that the first the phase is

target independent and the c can is target dependent

actually target independent using the whole background i-vectors we have we train a universal deep

belief networks

and it be compute the impostor centroids

that how this process is carried out only once for all the target speakers we

have

in the second that's

and you think

using the you db and impostor centroids

and the available target i-vectors we will train our networks the discriminative be

let's scroll more in details in the proposed impostor selection method

and this method is

it is similar to the

support vector or bayes the

approach that proposed by mitchell at clarion and the is it compose the but we

have used here the cosine distance criteria and the we have changes some other things

it composed of well four main steps the

as some of the we have the whole background i-vectors in wants to hang out

on another so that we have the client i-vectors

each collect direct or

that in this case is the average all five i-vectors berries client

be to compare our bit all background i-vectors we have

using cosine distance criteria

and the top and i killers this the background i-vectors to each client

will be kept in address that thought age in this

a steps

and maybe do the same for all the reliant i-vectors

until the car i-vectors the cocktail ends that we have

and the be compute the impostor frequencies in this that age and be normalized aim

at n is the and top i-vectors the in each other for each client and

the whole number of collect i-vectors

and beep is that the this normalisation

at the impostor frequency is more robust the against the threshold that we will define

on this the frequencies

then we set a threshold on this normalized impostor frequencies and those impostors have higher

frequency frequencies then this are sure will be selected that the most informative impostors

actually we have b

we have the impostor frequencies and for all the background i-vectors we will have one

frequencies will be defined iterations and those i-vectors the impostors that have higher impostor frequencies

that then defined threshold will be selected

this the threshold and the then and parameter will be defined experimentally

at the experiment on section

if the order or the impostor frequencies for the

impostors the we will see that the any post or the have the same frequency

a impostor frequencies

that the that's why be have

defined at a ritual the on the impostor frequencies not just the selecting the top

a fixed number of a simple so

in experimental station the dataset the that you have used is the

nist the two thousand fourteen a i-vector challenge the i-vector size that you know is

six hundred

post processing that you have like eight out on i-vectors on

all mean normalization the last whitening

one hidden layers is used in this extreme as and the hidden layer like a

layer size is four hundred

forty owning the

the two parameters for the impostor selection method that is

the threshold and the and parameter if we plot the per the minimum dcf

verses the this threshold for different and

we will see and he's a

a small

the results are not good i if and is the too high

biz the performance of the system want to be used a bell white changing the

original

and the best one is the choosing in according to our experiments is choosing

and equal to one hundred and it shows the

by setting that originals by this we will have a minimum m

value for minimal dcf by these utterance rolled and setting and equals to one hundred

in experiment all the results the be in this challenge we have we had one

baseline system that everyone knows what's the baseline

our proposed a dbn based is then be the target independent impostors that is good

lowball impostors for the same for all the

target speakers

if we

do this experiments we will have a this results

that the is the big difference between

the baseline system and our system

and if we add a the target dependent the

targets

to the target independent impostors that in this case is one hundred is and the

parameter and the at this pool is targeting depend the non-target depend then we will

have

better performance that is the

this

when you

but in this case a if we at the target dependent the complexity of the

system will be more than the first one because the in for each target the

for each a target speaker for just speaker we need to do the clustering separately

what in this case we just the compute the impostor centroids the ones for all

the speakers

if we do this that normal score normalisation on our baseline i have on or

dbn and basis them maybe without that normalization and the results in this

what if the ad that normalization using the all the whole impostor database we have

the development set we will have words results

if it's select the only ten top one thousand kilos this i-vectors impostors we would

have it be better what is it is the worse than a without using that

norm that normalization

but the

beach the but if

we use the same impostor selection method for that normalization v a v is the

and setting the parameter t and aiken again for this that normalization

we will see that we have a be in for right you be improvement here

and the

and the in comparison to the baseline system we will see that the we will

have

to in the three percent improvements

actually this twenty percent improvement is the in comparison with these results with these results

the that he's the all the results the improvement is more than this

but

in this experiment so the for impostor selection method you have used the client i-vectors

our experiment our new results experimental results have shown that if we don't use the

client i-vectors

i collect i-vectors the

and the just select the particular and the i-vectors collect i-vectors from only the development

set we will see that the

we will have almost the same results then this that are very similar that actually

for our system proposed system it doesn't matter that we used the client i-vectors in

or impostor selection method or select or jobs randomly choosing a the actual and i-vectors

from only the background i-vectors

and the main conclusions and

in this paper or b and b have the problem of the impostor selection method

for that we have shown that the helps to well outs is then to what

the

we'll have a good important for performance in multi session task

and that really been the out more i-vectors the well very sharp where each target

speaker helped the dbn system to capture more speaker and session variabilities in comparison to

the single session task

and also the final discriminative dbn per dbn based the approach showed a considerable performance

in comparison to the com conventional baseline system propose the wine is seen in this

challenge

thank you

we have time for question

thanks to talk alike extension of the background dataset selection that you on the

one question that comes to mono is when you doing a selection you looking at

all the clients that are going to be enrolled system sorry i and you know

also are not close enough again a so when you doing this dataset selection you

looking at what is just statistically important are the clients that are going to be

rolling system so you're

system itself fourteen hours information about are you going to test on

why wouldn't you just to closed set speaker i'd say that

so reading it

the when you're choosing at your impostors your before you dbn training all z norm

that selection process itself is aware of all your target speakers

yes that's correct

so why not take a further and just a closed set speaker i they for

the i-vector challenge

yes that's why i'm telling you at the experiment the results extend i told you

if we don't use the non-target i-vectors and just the and select randomly the same

number of actual and i-vectors only from the development set

and we use these a in iteration process use the for instance the one thousand

the three hundred the i-vectors randomly from the development set and do the same processes

the computing the and impostor frequencies

and then again choose the and the random i-vectors and do the same and computing

the impostors and then being the outrage overall impose an impostor frequencies and you the

same set the threshold and setting the parameters

we had almost the very similar results of these results that you have views on

the target like make so that's a that's a very

client specific selection menu not aware of the other clients in that sense

very nice

with data yes technically looking at the other clients with against the rules of the

i-vector challenge but he has a solution that didn't have the other thing is the

closed set scoring don't make here for wouldn't actually work because they are all different

speaker

i-Vector Modeling with Deep Belief Networks for Multi-Session Speaker Recognition

Neural Nets for Speaker and Language Modeling

Omid Ghahabi and Javier Hernando