Speech Transcript - End-to-end speaker recognition — why, when and how to do it?

okay

i everyone mm and you know

and i work the

brno university of technology and amelia and i will be giving this the control room

about

and two and speaker verification

so the topics

to discuss in this tutorial is we'll start with some background and definition l and

when training

and then discuss some alternative training proceed years but mean which often use

and then talk about the motivation for ends when training

and continue with some difficult this of and when training

and then

talk about that were reviewing sound

existing work on and then speaker recognition but not that you in like grade the

real

and then we will rubber with some summary and all but and

questions

i was would like to give some acknowledgement and assigns to my colleagues from but

in the media

ming

by who i'm

it's cost bayes topics a lot

so let's start we recognition

this is

i kind of typical

mm at the recognition scenario and

in steps to marry we assume we have some features x and some labels line

and we wish to find some function which is parameterized by

second let's say

and it which

given the

features critiques

some label or predict the label

which should be close or equal to the true

to be more precise

we would like to me prediction to be such that some loss function which compares

the

predicted label with the true label is as small as possible on unseen data

and the loss functions for example if we do call a classification it can be

something that used

zero if the predicted label is same as the true label and one

otherwise the basis i kind of error or

not case

of course ideally what we want to do is to

minimize the expected loss on unseen test data which we could calculate like bass

and here we use capital x and y to denote that they are unseeing random

variables

but since we don't know the probability distribution of

x and y we cannot do this

exactly or explicitly

in the supervised learning problem we have access to some training data which would be

many examples of features and labels we can complete not the most set

and

p check the average loss on the training data and we are trying to minimize

that

and then we hope that this we

this procedure here means that we will also get a low loss on

unseen test data

and this is a call empirical risk minimisation

and use expected to work uses

the classifier that we use this not to our four

the to be precise something would be to dimension should be if units and it

also requires that the distribution of the loss

needs

not to have attained but to for typical scenarios this

really into it improves in your is expected to work

so then let's talk about speaker recognition

as probably most

in the audience here knows we have these three some tasks of speaker recognition

it's speaker identification

which basically is used to classify close to all speakers of this is a very

standard

i recognition

scenario and then we have speaker verification where we deal we

open set as we say

so the speakers that we may see in testing

or not the same as we have access to in training when building the model

and our task is typically to say whether two segments utterances are from the same

speaker or not

and then there's also speaker diarization which is

to assign basically you know in a long recording each time you mean you need

to a speaker

so here i will focus on speaker verification because the speaker identification task is

quite easy you know at least conceptually

and the speaker diarization is card and then approaches are still in very rarely station

or although some great

stuff as has been done

it's maybe too early to focus on that you know tutorial

generally

it's

preferable

if a classifier

i'll codes

not a heart the heart prediction like it's this class or in this class but

rather probability of different classes

so we would like some

classifier that uses an estimate of the probability of some label given the data

in the case of speaker verification with are rather prefer it all put the log-likelihood

ratios

because from that we can

okay

the probability of a class given the labour i classes here is just target over

non-target

but we can

do this based on a specified prior probability

so it uses a bit more flexibility in how to use this

system

but some talk about and training

and my impression is that it's not completely or well defined in the literature

but it seems to enable

these two

aspects

first all parameters of the system

should be trained jointly and that could be anything from feature extraction to producing some

speaker inventing

to the back in the comparison of speaker and endings and increasing the score

a second aspect is that

and then system should be trained specifically for the and

intended task in which in our case would be verification

one could go even more stricter say that it should match to extract evaluation metrics

but we are interested in for example in right

in this tutorial i will try to

discuss

how

important

these criterias are or what is it can be

to impose this criteria or what doesn't mean if we don't do it

first

let's look at what would

typical and when speaker verification architecture

look like and

well i process first i know this was first attempted for speaker verification in two

thousand sixteen

in the paper mentioned here the mortal

and

it will be some so we start with some

enrollment utterance so as

here it's three and we have some test utterance

all of these goes through some embedding extracting neural networks

reducing in many different architectures there

we produced and bindings which are fixed size

utterance representations

one for each utterance of in three now enrollment and endings and one test reading

and then we will create one and rollers model by some kind of pulling for

example taking the meeting

of the and warm of them buildings

and then we have some similarity measure and in the and

a score comes out that says

the log-likelihood ratio for four

the hypothesis that these

test segments

it's from the same speaker as this enrollment segments

and

all of these models should all these parts of the speaker model should be

trained

jointly

to be a bit fair and maybe a for historical interest we should say that

this is

no a

new idea

we had it's already in nineteen ninety three maybe that's their list i'm aware of

at least

and the one paper at the time was about

handwritten signature recognition and another paper was about the fingerprint recognition

but they used exactly this idea

and

okay so we talk about and

training and modeling

so what would be the alternative

one thing would be

generative modeling so we train a generative model

that

means a model that can generate the data both the observations x and

labels line and it can you was

it can also give us

probability of or probability density for such a observations

me typically training with maximum likelihood and if the model is correctly specified for example

of the data really comes from a normal distribution and we have assumed that

in our model are then

with enough training data we will find the correct parameters but the

that is no

and it's may be worth pointing out that

and the lars from such a model is the best

we can have its

so to have access to the log-likelihood ratios from

from the model that really generated today that is

then we can make the model decision for classification verification is a long

other

classifier would have was more

problem with this is that when the

more than

assumptions are not correct then the parameters we find with maximum likelihood may not be

optimal for classification

and sometimes maximum likelihood training is also difficult

other approaches will be some type of discriminative training so and then training can be

seen as a where is a lot one type of discriminative training but other discriminative

approaches we can tries to train the neural network where the embedding extractor for speaker

identification which seems to be the most

popular approach right now

and then we will use output of some intermediate layer as somebody and train and

i'm not either

back end on top of that

then there is this a course of the metric learning which

there

mean kind of train the embedding extractor together with a distance matrix with sometimes can

be simple

so in principle the inventing and kind of distance metric or back end

trained jointly

but typically not for the speaker verification task

so this is kind of and then training according to the first criteria but not

according to the second

now

when we know that we will

is costs

why the end-to-end training would be preferable

we had two things one is that we should train models jointly and the other

thing is that which are trained for the

intended task

in the case of joint training is actually quite obvious selects the consider

system consisting of two modules a and b and we have fit that a which

is the parameters of model a and b which is the

only there's of what would be if we just first training module a and then

module b

it is essentially like doing

one iteration of

coordinate descent or block coordinate descent

so we train model

and we get here we train one ubm we get here

but we will not get for them that's not to the optimum which would be

so of course we could trade continue

two

do a few more iterations

and we might end up in the

optimal and this is actually kind of in principle equivalent to a joint optimization

when we have right kind of a non-convex model as one we may not actually

get the same

right optimum but as if we did

all the parameters in one go what would happen also depending on which optimize the

we used so

in principle

this is

why or so joint training would be like

really make sure that you find the optimal

also both

models and that's clearly better than just training one

first one and then the other ones

so i think there is no really argument here

that the these part of and then training is justified

the joint training of for more details

the task specific training the idea that we should training for

the

the intended task so if we do

you our application we want to do speaker verification why we should training for verification

and not for identification for example

well

first mission say that

we have some guarantee that this idea of minimizing loss on training data

we need was good performance on test a the empirical risk minimisation idea

and the only guarantee we have there is

this in this case the only holds if we are training for four we for

the metric that we are interested in with the task of very interested in

if we

trained for one task and or

you can evaluate

on another source we don't really have any guarantee that

we find the optimal model parameters for this task

but one can of course ask shouldn't is really work anyway training for

identification

and use the model for verification "'cause" it's kind of similar tasks

it does as we know

so but let's just discuss a little bit what could

go wrong

or why it wouldn't be optimal

so here is kind of toy example

we are looking at one dimensional inventing so we imagine that these have been

where rather the distribution of one dimensional and endings

so the embedding space is here and each of these colour represent the

distribution of impending is for some speakers of you is one speaker or will is

another speaker and so one

of course this is a little bit that we are

shape of the distributions i showed it alright okay kind of for simplicity

so in this kind of for example we assume that the mean of the

speakers are used a new that when you call distance like this

what would be the identification error in this case

so whenever we observe an amending we will assign it to the closest speaker

if we

observed on a bending in this region we will assign it so that no speaker

if we also observe it here

we will assign its to this end

this and

you green

speaker

and of course it means that sometimes it will be the blue speaker

when something sampled from the blue speaker will be here but we will assign its

the v is

green

style speaker area

so we will have some error in this situation

and

if we consider only the neighboring speakers the error rate will be

a twelve point two percent in this example

what would be the verification error rate

if we consider

for this type of data

we will assume that we

have speakers

which are you can be installed is to muted

like well

these stars

and

now the target trial we will sample

and bending from one speaker

and see if they are closer to each other than some threshold

based happen to the optimal special for this iteration

and if the

they are after that first we that i think so that you

thank you

okay

but i

if available

the case

the

and for nontarget trials

here in this image we could see

it would have an error rate of fourteen percent

again i'm only actually considering that the non-target trials are from neighboring speakers

that's why they're rate is high

i'm only changing this is to use a little bit

the within speaker is to me you show so

as before

the speaker means are on the same distance

like this

and

we have made them little bit more narrow here the within speaker distribution a little

bit more broad here

the overall variance the within speaker variance this the same obtain a little bit different

shape

and we will see that identification error has increased to thirteen point seven percent

whereas the verification error is that there

well

more extreme situation we have made them

the distributions equally sake or broad

do those two mixtures

now id and the means speaker means are all the same distance

like this

but the within speaker variance is

well in the within speaker variance is also the same as before

and here it would actually get

zero

identification error

but you will have worse

verification error or in any of the other example and it's because

if you sample a target trial you we very often have

and endings that are far from each other and similarly

for a non-target trials will very often have weddings that are close to each other

so this

example

should illustrate that

the within speaker distribution that is optimal for identification is not the same is not

necessarily the distribution that is optimal for verification

okay so

as another example

let us consider triplet loss which is another popular

most

function

could i

so it looks like this that

each training example you have

and bending for some speaker which we call the anchor invading

and then you have an embedding from the same speaker in which all the positive

example and animating from another speaker we should call the

negative example

and basically we want the distance between the anchor and the positive example can be

small

and the anchor between the at the distance between the anchor and the negative example

to be big

if this distance is bigger than

this class and

then these loss is gonna be zero

however

this is not

ideal the an ideal criteria for speaker verification and two show this i have a

rather complicated feed your here the illustrates

three speakers

and the embedding some three speakers in a

two dimensional space

so we have

the speaker may

with and buildings

distributed in this area

speaker be with the meetings in this area and speaker c with them endings in

this area

and

eve

we are using some and go from speaker to a the worst case would be

to use it here on the border

and then the biggest this test for a positive example would be to have it

here on the other side

and the biggest the smallest this there's to a negative example would be to take

something here

so simply we want this

and distance with the positive example

here class some margin to be smaller than the distance from the

negative example of anchor

so it's okay

in this situation

consider then speaker seen which hasn't b

wind

is the fusion of data now if we have i'm gonna here

we need

the

distance to the next speaker the closest speaker to be

be here then the internal distance

class some margie

and that's the case in this figure so that replied loss is completely fine with

this situation

but if we want to use

we do

verification on data that is distributed in this way then we should

at all well if we want to have good

performance of target trials from speakers t

we need to accept

trials as target trials whenever we have a smaller distance then this otherwise we will

have some error or for target trials of speakers e

but this means that if we have a threshold like this year we will have

would be in confusion between

speaker a and b

this

again of course they could be ways to compensate for this environment or another but

it's just to show that like to sign

these

metric is not

gonna lead to optimal

performance for

verification

so if we try to summarise a little bit about the idea of task specific

training

minimizing identification error wouldn't necessarily the minimal verification error or

but of course i was showing these on kind of toy examples and the reality

is much more complicated

usually don't optimize classification error but they're all the cross entropy

or something like that

and we may use some loss to encourage more jean

between the speaker and endings

and maybe these assumptions that the made about the

distributions here are

well to compute more realistic at all

and

so the maybe not completely clear

what would happen we knew test speakers that were not in the training set as

one

so i one and then to say is that this should not be interpreted as

some kind of proof that other object is would fan maybe they would even be

really good

but

yes to use training data be that it's not really

completely just defined to use them

and this is of course something that ideally should be studied much more

in future

but

and so we discuss that the and then training has some and good motivation

but still it's not really the most popular strategy for building speaker recognition systems today

at least in my impression it is my impression is that the multiclass training is

still the most popular

and

why is that well there are many difficulties with the and when training

it seems

no e

he's more prone to overfitting

we have additions we statistical dependence of training

trials which are we go more into detail in

i of the dislike

and

they're also maybe questionable how to do how should be trained based in the system

when we want to

and many enrollment utterances also to be mentioned of it

but one

the issue

one of the issues with using a cane of verification objective let's call it that

when we are comparing draw

two utterances and wondered say whether it's the same speaker or not

is that

the day that

we e

statistical independence i mean same y

well you know minutes about

so this is

generally these idea of training of minimizing some training also assumes that

the training data

are independent samples from whatever distribution comes from

and this is often the case i mean we have data that has been independently

selected

but

in speaker verification

the data

automation

a pair also happens then roll utterance and the testing utterance and the label is

indicating whether it's the target trial or a non-target trial

so for location i mean use

why equal one for target trial and one equal minus one for nontarget trials

the issue here is that

typically at least if we have limited amount of training data

we create

many trials

from the same speaker from the same utterance of each of the speaker and utterances

are used in many different right and then these

date time is not

these trials are not which is the training data

is not

statistically independent

which is something that the training procedure assumes they are

this can be a problem exactly how big the problem is

i think it's still something that needs to be investigated more but let's elaborately to

be what about what happens

here i brought adjust the training objective that we would use in the for a

kind of a verification loss when we train the systems and in verification

so it looks

complicated than being but it's not really anything special is yes the average training loss

target trials here and the average training loss of

non-target trials here and they are weighted with a fact or

probability of target trials and probability of non-target trials which are

some parameter that we use that to

dear the system to fit

better for the application that we are interested in

and again

what we hope is that this would minimize the expected loss

target trials and non-target trials

weighted we these

min

probability of target trials and non-target trials

on some unseen data

this loss function here is often the cross entropy but could be other things

so what are the desirable properties of training objective

here

we have

are hat which is the

and directional for training the loss

and

since the training data

use

can be assumed to be generated from some probability distribution this or have is also

a random variable

and we won't these

to be close

to the

expect that

loss

where the expectation is calculated according for the true probability distribution of the data

and for every value of

fit that because

in that case

and

the expected loss is this black line here

then

well let's say we are

we have some training set the blue one

and we check the average loss as a function of data

it may look like this

another training set it may look like this the red line and the third one

would be

the power of one so the point is that it's a little bit random and

it's not gonna be exactly like the expected loss

but ideally it should be close to this one because if we find a filter

that minimize the training loss for example here for the in the case of the

red training set

then

e we know that okay it will be also a good value for the

expected loss which means that the loss on things test data

so we want

the

training loss

for some as a function of the parameter in grammar the model parameters

can be close to the expected loss for one values of the

parameters

in order to study the effect of

statistical dependences in the training data in this context

right the

training objective slightly more general than before

use the same as before but yes that's for each trial

we have a way to be done

and if we set the to when one over and then it would be the

same as before but now we consider that we can choose some other value of

these

try and weights

in the training data

training trials

we won't

the training objective so the average training loss to have an expected value which is

same as the expected value

of the loss of test data so it should be an unbiased estimator of the

the test loss or the expected loss

and we also want these want to be good in the sense that it has

a small variance

well the expected value of the training loss is just calculated like this so we

end up with the expected value of a loss

and this is exactly are

what we what we usually denoted or

so in order for these to be

unbiased we simply want the sum of the weights to be one

and of course this would be the case when we use the standard choice of

meta which is one over and the number of

trials

in the training data

the variance

of this empirical loss

is gonna look like this

it's the

weight vector or for all the trials

and so on the matrix

times the weight vector

and this matrix is the covariance matrix for the loss of all trials with the

with this little t so that easy the one for the target trials or

minus one for the non-target trials

and one could derive that

the optimal

choice of

he does that would minimize this variance

and i look like this

so this is what we can call them you training objective

a best linear

unbiased estimate

that's the meaning of you so this is the best linear unbiased estimate of

the

test loss

using the training data to estimate what

well the test loss would be

some

details about this is that we don't really need covariance between the most of the

raw the correlation

because

we assume the diagonal elements in section matrix is

equal

then it turns out like this

and in practice we would assume that

and lennon's in this covariance matrix does not depend on cedar which

could be questioned

the objective that we discussed is not really specific the speaker verification in this is

that whenever you have a

dependence is in the training data can you could

use this idea

but for

the structure of this the covariance matrix

between the training which describes the covariances of the loss of the training data

that depends on the problem the specific problem that you're studying

so now we will look into how to

creating search a matrix for speaker verification

so here

we will use

i two denotes the

i utterances of speaker x

so we will assume that

correlation coefficients

hands on what trials i mean comments so for example

the here we have

trial of speaker a utterance one speaker to a utterance to and some loss of

that and the all several also speaker eight utterance long speaker eight

utterance three and some loss of that

and they have some correlation

it because

they involve the same speaker

so we assume there is a correlation

coefficient denoted c

at least eight here

so in total we have these kind of situation in verification if we consider target

trials

there you could have the situation that's

well okay let's look here

the

to target trials which have one utterance in common this is speak a target trial

of speaker eight

and here we have buttons one of those two and here you have buttons one

utterance trees is also has a long using both

trials there is some correlation between these trite

here

there is no common utterance but the speaker still the same and this is as

opposed to this situation where

you have

trial of speaker a and the trial of speaker a they have nothing in common

so we assume here the correlation is zero

for such trials

for the non-target trials you have more complicated situation but all possible situations are listed

here

for example

you may have that

okay

the speaker is you have one

utterance in common

so we have this utterance in common and in addition to that

these speaker is in common that's what they mean with this notation here

and so one

and if we have such weights one can derive

yes

the all the words such correlation push coefficients we can drive the optimal weights for

a speaker with this many utterances

is gonna look like this

the exact form is maybe not so important but just

we should note that one could

the right

how to

given the way to these speaker and it depends on how many utterances

the speaker s

for the non-target trials to formalize more complex

it would depend on me if the trial involves speaker names p can be it

depends on how many

utterances speech to speaker as

then comes they show how to estimate correlation coefficients one could look at some recorrelation

of some trained model

or we couldn't

learned them somehow

or which we will mention briefly later or we can just make some assumption and

into neat so for example one simple assumption is the set

this for score coefficient of target trials are five and this one which we assume

should be smaller so i'll four square

and then

to an affine this range and similarly for the non-target trials

just to get some idea of how we would change the weight for the target

trials

well

for target trials

we see here that this is the number of utterances for the speaker

on the y-axis here we have their corresponding weights

and for different values of these correlations so if the correlation is

a small

then

even when we have many utterances up to twenty here we will still give reasonable

way to each utterance

but if the correlation is a large

then we will not give so much weight to

but each utterance when a speaker as many utterances

which means that the total

and

wait for this speaker is not gonna increased a much even if it has a

lot of

utterances

and

in the past i was exploring little bits how

these kind of correlations really are

this was on the i-vector system with clearly a and the scores

here in the first

i in this

column here

it's a

okay lda model trained with em algorithm and then the score samples and instigated system

i find calibration

and the other column here is for discriminatively trained p lda

so the main thing top so here is that we

to have

correlations between trials that's how for example an utterance in common answer one

in correlations can be quite large in some situations

so these

problems seem to exist

and doing this kind of correlation composition main goals this is like again on the

kind of discriminative

clearly a

and

e does have a bit

so it's something

two

possibly take into account

the course of ssl it's four db lda but the where we train a p

lda model

using all the trials in the training set

that can be construct and then training set but of course the same

problem with the dependence exist all seen and system

no some problems that the we could encounter if we tried to do this

well mister the

results or the

compensation formless that we derive

was assuming that

all trials

stuff can be created from the training set or used equally often which is the

case if you train a backend likely p lda

discriminatively and you use all the trials

a we

well we train a kind of and system with involving neural networks

we use media bashers so one could achieve this situation by

making a

list of trials

and

then we just sample trials from years okay here is a trial is this speaker

compared to this final trial is the speaker compared to this one as a long

and this is

long list of all trials that can be formed and then we just

select some of them into the mini batch

the point is of course that if we have these speakers like this

in the mini batch and we compare this one with this one

this one we this one and so long

we are not using all the trials that we have

we have for example not comparing this one with this one in the mini batch

recall and that's maybe a bit the waste because we are anyway using this deep

neural network to produce them paintings and so once we can just as well

produced and reading or will use all of them in the in the scoring part

as well

well then

we will have a little bit different

balanced

of the trials

globally compared to what we had before

so the former lastly that we derived wouldn't be exactly valid in this situation

the

question then it is if we do decide that all the segments

that

can that be extract them ratings for

that we have in the mini batch if we want to use all of them

was in the scoring what you how are we gonna select

the data for the mini batch

they can be different strategies here

we could consider for example

strategy a

select some speakers

and then for each speaker we take all the day the segments that they have

let's say that these rates speaker has

three segments and these yellow speaker has

for speaker for segments

and then all

we can consider only five so we can have

segment one of the red speaker scored against segment to segment one scored against segment

three as a long

we don't use the diagonal because we don't consider

try segment scored against themselves

and the course here is just the same as here

a scoring segment two

i guess segment one

this would be one way another way would be constructed you be

two

select speakers but then just select to utterance for each speaker in the mini batch

you will have just one target right for each speaker

it differs here is that

we have

we are gonna have

fewer target trials

overall in the mini batch but one of them will be from different speakers and

we will add target five from more speakers

typically

needs

not exactly clear what would be the right thing but some little bit informal experiments

we have done

so just of this strategy b is a better

then again the formulas that we'd right before how to weight strives on not completely

the they were not the right on the assumption that we are doing like this

so they are not

darlene

and

and they need to be modified to be it and i mean come to that

in a minute

the second problem that can occur in and when training is that

in respect of these issues is that

we do want

to use

what we do want to have a system that can deal with the session enrollment

and it

of course of the session trials can be incorporated

it work can be handled with dances and system as we discussed in the initial

slide

by having some pruning armour enrollment utterance

but how to create a training date time is again a little bit the

complicated

because

already in the case of single session tries we had a complicated situation how many

different kind of dependent system can occurrence along and in them with the session case

it's gonna be even more

complicated because you can have situations like

these

trial

for example these two could be the enrollment and this is the test and another

trial where

these two are the enrollment

and

this is the test then you have one optimizing common here

we're gonna have a more extreme situation where both enrollment utterances

in to try to solve the same but the test utterance is different

so the number of possible a dependence is that can occur is way more complex

and i think it's

very difficult to derive some kind of formal or how the trials should be weighted

so to deal both with the mini batch the fact that we're using mini batch

as and to move the session trials and to estimate proper trial weights

for that maybe one strategy can be to learn them hand this is not something

i tried i just think it's

something that maybe should be tried

well

so we can define

i training loss

again as average of losses over the training data with some weights

and the we also neon use a development loss with some

which is an average over

another set of the average of most over the development set

and these weights here should depend only on number of utterances of the speaker

or speakers involved in that right

then one can imagine some scheme like these

we send both training and development data through the and then we get the neural

network and we get some

training loss and some

and development lost

as usual be estimate the

the grand here we take the gradient with respect to the model parameter off

for the training lost

and it

this

right in is not a function of the weights the trial weights

and we can update

the model parameters still keeping in mind that these are then value is a function

of the

the trial weights

the training try and weights

and then

we can

on the development sets

calculate

the gradient

with respect to these training weights

and then

use this to update

the training try and weights

a second

thing

to explore

or like a final note on these

and

depend statistical dependence issue is that

we just

discussed some ideas for balancing the training data the training trials for better optimization

but for example in the case when all speakers have the same

number of utterances

this rebalancing has no effect

still of course there are dependence is there is a one would think shouldn't we

do something more than just we balance the training data

and one possibility that i think would we will worth

try

is to

we assume the following

that

the covariance of

to what's a scores of the

of a trial of speaker at

which has

one utterance

in common should be bigger than

the covariance between two trials

of these

speaker which has

no often as in common

which should be bigger than the covariance between

two

target trials of different speaker this should be zero actually

so one could consider two regularized the model to be in that way

so now

after discussing the issues with

and hence training

then i will briefly mention some of the

eight pairs

or some papers

on and trend

training and i this should not be considered as i kind of literature review or

describing the best architectures or anything like that

it is

just a few selected paper that illustrate some point source on them

some of which and some good take away messages about and find training

so this paper called and point text dependent speaker verification as follows i know was

the first the paper on and ten training in speaker verification

and it also networks like this or some architecture like this feature goes in the

throes on

and neural network and in the end we are doing

we this network is gonna say

is it the same

speaker or not

the important thing here is that

the

input is fixed

so the inputs to the neural network as the feature dimension times the number of

features

the duration that is

and there was no temporal pooling which is

the done in many other situations

and this is suitable

when

when you do text dependent speaker verification as they did in this paper

so because this means that

the network is kind of aware of the word and phoneme order

and

i would say that the main conclusion from this paper is that

the verification loss was better than the identification lost

especially when you have been the amounts of training data for small amount of training

data guys

not as big difference

and the one can also say that t-norm could

too large extent to make these two things

this colossus more the models trained with these two moses more similar

but i still won't say that this kind of suggested verification loss is beneficial

if you have large amounts of training data

so this is another paper

there wasn't doing in

text-independent speaker verification and here

different from the other is that they do have a temporal pooling layer

that would kind of remove the dependence on wonder of the input

the to some extent at least and is maybe a more suitable architecture for text

independent speaker verification

and this was compared to i-vector p lda baseline down here to it was found

that really large amount of training data is needed even to be something like an

i-vector

the lda system

and this is

some study that we did and

it was

use also again text independent speaker recognition or verification

but trained on smaller amount of data and to make it work we instead constrained

these neural network here this big and time system to behave

something like a

another i-vector and p lda baseline so we cannot constrain did not to be two

different from the

i-vector purely a baseline

and

we found there that training model blocks jointly with their verification also was improving

so as can be seen here

you

little bit regrettably we data as a separate

clearly whether that improvement came from the fact that we were doing joint training

or the fact that we were

using the verification loss

another interesting thing here is that

we found that

training we verification most requires very large batches

and this was an experiment done only on the

scoring art and of course lda discriminatively lda

so if we train is gonna be p lda with

a and b if yes using full batches

that

so not i mean you match

training scheme

you achieve some

loss

like this on the development set

and this dash

blue line

whereas if we trained with adam with mini batch just for different slices front end

up to five thousand

we see that we need really be batches to actually

get close to be of q s

trained model which was trained on full marshes

so that kind of little bit suggests that you really need to have many trials

within the mini batch for you know what of four

training these kind of

system with a verification lots which is a bit of a problem and maybe a

challenge to deal with

in future

this is some more recent paper and the interesting point of this paper was that

they didn't train the whole system

all the way from the waveform is that this from features as the other

first

but it was

i couldn't to

understand completely the improvement came from the from the fact that they were

training from the waveform or if it was because of

the choice of architecture and so one

but it's interesting that

systems and going

all the way from waveform to the and

can work well

and this is paper

for this year's

in their speech it's interesting because

it's one of the more recent studies that the really proposed or showed some good

performance of using verification loss

here it was a joint

you

but i can have more details training so they were training using both identification was

and verification lost

and that's actually something i have tried to another and any

benefit from we but one thing they did here was to

start with a large weight for that it is indication of austin gradually

increase the weight for the verification will also make this is the interesting and maybe

actually the right way to go

i'm curious about it

now comes just little bits summary of this talk

we discussed about the motivation for and two and

training

and

we said that it has some good motivation

and

we show that's on

we will refer to some

experimental results the of also another first

which shows that it seems to work quite well for text-dependent task with large amount

of training data

in such case it's probably prefer able to preserve the temporal structure to avoid

the temporal pooling

in text-independent benchmark one would need to strongly like a regular station or a mix

the training objective in order to benefit from

and when training and typically we would want to do some temporal pooling their

one couldn't guess that and twenty training would be preferable choice in scenarios where we

have many training speaker with few utterances we have less of the statistical dependence in

problem

something that to me seems to be or button questions is and which would be

great if someone it explore

okay

it is difficult actually to train and then system especially for the text independent

tell us

so this is because of overfitting so training convergence this dependency issue we discussed

it's not really clear i would say

and

practical question is how to adapt search systems because see this more blockwise systems we

would of the nine at the back end

well could be trained the system in a way that we don't need adaptation

and also how could we input some human knowledge about speech into these training and

we need it

something we know about the data distribution or number of phonemes or

whatever

and we discuss that maybe

training a model for speaker identification is not ideal for speaker verification but is there

some way to

to find and bindings that are good for all these tasks

another interesting quick question is

how well

the llr is that comes from

and to end

architectures

actually could simulate the true llr

so in other words what kind of

and

distributions could be

arbitrary accurately simulate or modeled by these architectures

so completely clear out there

okay so

thank you for your attention

by right

hello this is you'll huh and no i really present the hassan session for that

and that speaker verification concordia

e these informal do not work well i don't know book

well i'm not really run cold war used rate it let's see

i mean "'cause"

one and talk about ease

two things first

i will go through the call that are using their

most of my experiments

and

after that i mean how well if you can do tricks to solve the batteries

implementation issues

that i have used and

okay so

first

the call for and final system so this is a call that i started work

on during my forestalled a but from to those in sixteen the person time t

initially horse in the on all but the now consider a while

and idea sees the

time to switch to or a data tensor able to or like torture or something

else

the links of the repository is here

and most stuff in this repository is no and is more most states there are

four multiclass the weighting well mostly to use a little because training where maybe in

combination with other stuff

but the

don't know much on a

that's uses

you're and then training with the verification loss

the paper is that we're of only stores actually based on hold close to the

on the one i think it's not so much point two

maintain that are in more

but i do have a one screen here that you to the verification lost in

combination with the identification lost so that's description we will look at

and generally

well it's a this first i'm trying to point out things in this call that

i think yes certainly well known and worked well and are known and also mention

what they we show

really them differently

to maybe give so

well at least i can say from like stressful as good an allpass time

some

small toolkit for speaker verification

i know that i didn't see and then if we hear from and the verification

lost to that identification the most

and contrary to the paper and mentioned in the tutorial

it could be that these quite complicated scheme for changing the balance between the losses

throughout the training is really ladies this may be something i don't look at some

point

and this screen i think units

you want to try to instances where only you know little normal way

the in the local but you don't want running in the not here unique feel

a little bit with the intention because

right in

cantonese in such a way that it's

here three but

some small adjustment might be needed if you actually want to run it here

i tried in these in when organising my experiments to high in the way that

there is one screen where everything that is specifically the experiment is set so that

includes which data to use and the configuration of the more balanced along

i was really i

an efficient lighting to have

input arguments to this researchers we should be to use as long because anyway you

were wireless always have to change something in this creation

for a new experiments are then you can just as long routine often a and

so on

a wrestler

but other things that a little bit more face from extend this experiment this is

just the loaded from this good

such as model on different architectures as long

so usually i use these underscore for denoted sensible variables underscore v for placeholders

so long

the kind of

models are

similar to here as models are then maybe a little bit less

fancy if you're

features

i didn't use here us here initially because when i started with this years ago

cares more flexible enough there were i quite agree pure only those of recruited two

neatly with this but i know it is definitely flexible enough

for example here is this is five where features are things that

things maybe some one would think is that are those on a you all remember

about a

seems anyway necessary to change things in this problem for every experiment i prefer you

their thing here

so you're somebody stole training data

how long as the shortest and a longer segments are trained on

some other patterns related to training batch size

maximum number of the box

and

number of bashes in an input so i don't really define

yep or as warm day a by defining that's the second number of patches that

the wine in it in a minute

also patience probably most of your familiar with it is worth mentioning

you train or

what it is score

so the next part of the screen is the bar for defining how to load

and prepare data

and here is long important points is the

so you the bashers we will

well gee chunks of feature from different utterances so randomly selected segments

if you know say that from a normal hardest and randomly select different segments from

different utterances

this will be nice too small i was to sell

so often

you can would meeting it is time varying or case at a time or can

compare a

many lashes

well

so that's one way he in all my service so i is the to the

data on missus the and then can be loaded as you wish feature shows can

be loaded randomly fast enough for that

so this is

good because it allows for a lot much more flexibility in experiments for example sometimes

you may want to load to segments from the same as is that what one

proportional

to go for some for some experiments

or sometimes you just want to change the duration of the segments

you

use our case then you have to prepare and you are case for this

so i don't say that

using is the ease

and then just load features a single going is

very good thing and as the c is really good however to invest see if

you want to

it can of experiments

i define some functions for example low fee training process given some

given and some list of finals this one we load the data and that could

so if you want remotes parcels these batteries specifically as long again

if find that here but if you want to do for example of the thing

i mentioned too low to segments from the same utterances that one then you would

have to change the function here

so this was quite the

useful way of organising is that for me at least in my experiments

i also another important thing in this for easter creates on dictionary a religious train

is sixty four conversation other missionaries of for example a closest eager not be

and thus to fine off a thing and the law

and that's

created here

and he's

no means are used to create a media batches

and a little bit later down here i create a generator for media batches and

it takes the this stationary off

mappings across a speaker mapping as a long and i have different the generators depending

on what kind of media matches i won't for example you want

randomly selected speakers and older data are going to the actual remote randomly selected speakers

and for example two apples each or something like that

so that's its shape by changing on a gender

then the next step is to

so that the modal

and here i'm using here

t v in a artificial light expect or other comics or

and it i also a det lda model

a half to the school and endings from this

or text editors still called

we should

do kind of verification

i mentioned is minor differences from the holiday architecture is that i found it necessary

to have some kind of normalization layer alter the temporal coolly better or more just

at feast elicitation but estimated on the data that supports in the beginning works fine

as well

i guess line is needed here could be because we use a simpler optimize the

we use just stochastic gradient descent as compared to colour the use that are most

of the monster

so in this conan columns

definition of the are they show like number of layers their sizes

activation functions

and so

whether we should have a normalization of features normalization all the are truly

and whether they these

normalizations

or you don't face of the data being initial last

auctions for regular stations the lower

we initialize the model here and we provide

when you do this at the rate or the generator for the

they the training data and this is used to initialize to model the normalisation layers

this is something that creates be a mess and i probably wouldn't song i

differently if i work right and you are

maybe some knowingly initialization and that's around a few

iterations

in the before starting the can you just

initialize the layers the normalisation layers

you if we apply a smaller to today the which is in this place holders

here

and

then

what comes out will be this and endings the classifications

and

so or

then ratings in this particular we will send them to

in the lda model

basically here

we make some settings for here

and

probabilistic lda model we can get the score

scores

and for all pairwise comparisons

it in the dash and also loss for that can provide

labels for it

so next car is to and are defined lost and train functions along

we have lost as a weighted keisha lost it has lost and a single because

the verification loss

well here's in binary and their average fits weights in the original one point five

and still one seventy five respectively

and maybe one important thing use here we these forces are normalized in there and

from be so that's

minus

log of their probability in the case of so long we're number of speakers

i mean for around a classification of random quotes

and the reason to do this is

if the model is yes initialized or just a round of relations that the loss

maybe one or approximately well

and we do the same thing for the verification loss

you this means that all these also source data you know similar way and it

becomes easier to choose to interpolate between them

and the end of these the screen we define a training function which takes

the data for actually in school and to one article

the more

next for please for a

defining functions for a set i think parameters locating parameters for the more

and

define

i function two

change of the easy to shake some kind of validation lots of the each block

so this starts just for setting parameters and getting parameters

and

maybe no so importance

it can find

function for changing the validation was here

finally the training is to combine these

function here which takes these

function and therefore

changing validation loss takes many other parameters

and things that the undefined

okay

for example in function for training and so on

so these the way we trained here is basically so

alternately she for which was defined as

alright for however bashers

and this is because we don't really have a case you just complete equal continues

every random statements

this is

as long as they won't work so there's really clear idea what is the what

is data

but anyway

we do training if he doesn't include one the one additional also be a good

review

try a few more times o and two patients number of times and you is

to include that we will

research around there's to the best on the whole the learning rate increase but okay

i don't know this is the best

"'kay" be seen but as for well enough for me

yes for the whole piece

and

going on i would like to mention a few weeks

not very complicated things

it was maybe slightly difficult for me to figure out

and

they are related to back propagation and the things i wanted to modify their

so let's just first briefly review the back propagation algorithm

basically

you know that the neural network is just

some serious of affine transformation followed by nonlinearity then again affine transformation and again only

the install

so that's a result in some you will be applied affine transformation

i guess is set here and then we apply some nonlinearity and

i mean yes the a that's going to and we do that over and over

and that's called a final

i'll put four and then we have some cost function

i'm on that for example cross entropy

and we if we you know function composition bit is the reading here's basically means

the compositional g and h is just like

an h on the data and energy and still then we know that we can

write the whole neural network s

applying the first affine transformation of the input

next door first the nonlinearity

all the way

but the output

it can be written like these

and is also easy to write well the

gradient of the

loss with respect to that you could point using the chain rule i is

so it's just

basically everybody will see with respect to improve this just

change like this study video scene with respect to a time period well a i'm

this dataset i install

so i have this

funny thing brackets here just and you know that these are

just covariance so the multivariate shaver looks

same as the second one just that we need to use digital us instead of

this is not normal productive

so forceful

the

relative lc with respect to a

this is i criterion is really right because it's a vector so

when all these elements like these here

criminal a with respect was

easy just gonna be a diagonal probably unlike is because f is the functional design

elements bias

and the other one three that you off

san interesting to a

if we look at this thing here we will see maybe a little bit for

this is just the weight matrix

so then back propagation is

okay we start by calculating the

d c

this is a i

and that's just these two

and then

we can

continue with

get it is easy with respect to some other set i by just taking that

are that we have and multiply for example we these two then we get an

extra and still

so it's

but course process like that so that yes you lost the remote people loss with

respect to include in the of what we want this of course with respect to

model parameters which is that

biases in the weights

which we have

here and here

those are given by these extensions here

for the biases is just these

a second down here

for the weights model can claim that the corresponding part of the weight matrix

this is just sorry within corresponding part of the

ye activation and a here we are interested in contributing with respect to this also

we need more like the corresponding part this

okay so no i'm talking about when we are fresh test

and here we also to really good references for these if you want to

further into it

no where we have

mentioned this

i would say

well buffy different issues that i run into their that require some

little bit of thinking in relation to this

first thing is that you see here that in order to calculate the derivative existing

weights you need the our schools of each layer is a here

and so that means that we need to see all of those memory from the

forward also needed you the main memory okay we look that passed and if you

have to be batches many utterances also long utterances this can become too much

it can go up to many gigabytes several makes sense for example

or larger batches

both

the no and sensible well as on printing home way of getting around this

and that is that you

or where the data

then they have the option in case of ten some for the case of the

angle you have the option to discard the

intermediate file was from the for us then maybe you that there are also you

will recalculate then when you need that so you basically just have the

in memory for one dollar score one on this time

that's the floor one have the same thing about a little bit better because data

to discard the corporate like to the cu memory which is generally bigger

there you family

in that case

or to use this we can

we you over the inputs a until probably layer and all the pooling layer we

put all these

close together so that we have now a kind of

tests or with the old adding store

and then that can be processed normally

and then you would just calculated los and ask for the right so or at

least one okay so that to think carefully

this of course also has the advantage that we can have same and different directions

well we may things like

but for a bit complicated or maybe not even possible

i'm not showing the congo

these people sees me see so many other things hours and makes is very difficult

to see what's going on

i have it does not seventeen

scripts

but the i was hoping to write some small for example but they didn't manage

to do it in time

okay so that's one three

a second

tree is related to parallelization

suppose that we have some or detection like this because feature but and then we

are probably

and then we have some processing all them things and finally scoring

no if we want to

well normally if we want to do parallelization will be training for some multiclass okay

it doesn't really a problem because we just is to give the day on different

workers each of them calculate some radians and we can actually right yes

or we can not irish the updated models

but in this case seems this scoring large when we do use the verification lost

in the scoring or we would like to have a comparison of all trials all

possible trials

so we need to do

time delay and the things on individual workers the sound of all the and endings

to the master where do this scoring

no we do back propagation a to them but he's and then we sell those

tries to each worker

and the they can continue the

the back propagation

the thing is this is not exactly and by normal to the case when you

have

calculated the loss here then you

a propagation but also the includes what is known has included which was just everybody's

then you basically they try to loss with respect to and endings

and how to use that s two

continue the back propagation on

the individual nodes

one single tree to do this is defined like in a sequence only a loss

like this here so i define a new loss which is yes

this is the remote zero

see the cost

with respect to the embedding elements which is what we have to change

problem most or no

times now ready or just like doesn't all probably like this

and if we know

optimize these loss you will get

what we won't be cost

let's consider right and the order derivative of these loss increased a cell to some

model parameter of the neural network

okay we apply here

just take this started in here

here is something that it has on these

there are so we are right yes here and this is i certainly exactly the

loss

the relative that the are

off looking for so

the remote view

for these loss with respect to model or anything will be exactly the same passed

a law that we are interested e

is possible that some newer tutees has

what is actually just do this without using some tree i'm not sure that

this was as though to achieve this

it's

final tree

ease

related to

something the holocaust repair saturated rental units

right is the sum operation function so let us remember we have a fine transformation

formal by so

activation function and if it's the revenue proposal is one of the

problem on then

whenever the goal is always below sea able to these rental will or when everything

but this is close to the red will put zero so if or includes or

below zero then this rhino is basically never all putting anything in because it's a

vector

useless

and we there is also the opposite problem if they but is always a zero

then there are n is just a linear units so we really models

the includes threatens to be

in a

be sometimes

positive and sometimes negative then the railways brady units

nonlinearly and

the network is doing something interesting

so how we have these is that usually checks if read a unit has problem

like this and in that case

they will ask

some a little also

to test a

so that everybody will see with respect to set

a problem to do this in some of the standard neural network is that we

don't really we can't really we don't have an easy way to manipulate this stuff

that

which is used in the back propagation

so we will be set to manipulate the derivatives with respect to model parameters directly

and

seeing

how

these relations lou

we wanted us from the data that and

the derivative with respect to be easy just

is the as we were asked thing is achieved in a place you can just

at that it leads to this

do not here we usually get from model to

and similarly for the way it's is just as we also need to multiply these

articles and a because that's called it remotely calculate

so these for some small three weeks and there may be summary i can say

that is quite helpful to when you were neural network to

based on the back propagation probably so that you know what's going on

and then you can easily too small fixes like is

so that's

or well from the hands on session thank you for attention and by

End-to-end speaker recognition — why, when and how to do it?

Tutorials

Dr Johan Rohdin, Brno University of Technology, Czech Republic