Speech Transcript - Short-Duration Speaker Modelling with Phone Adaptive Training

good morning everybody my name is one is all the and that's being gonna present

that were work short duration of speaker modeling we form of that training

i will first start with a brief introduction on which i will explain the minimal

the main motivation of our work

i will that continuum now we've explaining our approach form of that training and the

present experimental setup results and i would finally conclude the we will at some conclusion

and the some future work directions

linguistic variation is a significant source of already shown in many of them i speech

processing application such as a short duration of speaker verification and speaker diarization

in both cases we find ourselves to deal with short duration segments when we want

to lure a speaker model

so let's suppose to have a ubm model

and that if we want to learn a speaker model from upper that they show

well we use a long utterance or we have plenty of data that the estimate

it speaker model

a will be near to the idea speaker model serious a the phonetic variation will

be marginalised

while when we use the short utterance for example three seconds do you know what

muppet that they some process then this team at the speaker model is far from

ideal speaker model c insert the phonetic variation on is not marginalised

sold objective of our work keys to improve the speaker modeling while the decrease in

the variation due to the phonetic content and the while increasing the speaker discrimination

a to do these and we started from a method called a speaker that it

training set up to but is technique commonly used in i automatic speech recognition and

the is used to reduce the speaker variation a to estimate models that are more

a phonetic that the morphed weighted discriminant

and to get better estimation

still

the idea is the to let us to model you have it in the original

the acoustic feature space and we say we can discriminate between speakers and that we

can discrimate the also between phones

so in that is at a scenario what sub that's is to project the acoustic

features in a space in which all the phonetic information is retained while the speaker

disk the speaker information is discarded

a if we interchange at the roles of speaker informs we can that reach the

opposite results

that means suppress the phonetic variation while at emphasizing that the speaker variation

so we have always our original feature acoustic space which we can discriminate between phones

and speakers and the what but in this idea of snarl but does is to

project the features in that their acoustic feature space in which all the speaker information

is entertaining while we can not discrimate anymore between phones

still

but that was first applied to speaker diarization we have a relative increase in the

speaker discrimination on the of twenty seven percent and the additive decrease in phone discrimination

of features by percent and the speaker and phone discrimination was calculated for the fisher

score as we will see later

and the however the improvement in the diarization error rate a was disappointing

still

what we do in that what we now used we try to optimize and the

bottle with butter in installation the from the convolutive a complexity of speaker diarization

and the this is done by a addressing the problem of speaker modeling in the

case of a shot when the training data nice cars and the by performing a

small scale a speaker verification experiments are

by using a database that the man only lap to look at the phonetic level

so i would proceed to a in our approach for that that's training we use

a extensively a constrained maximum likelihood linear regression cmllr

so all cmllr is the technique to reduce be screen and mismatched it with the

reduced the mismatch between an adaptation on the dataset that and the a initial model

it by estimating an affine transformation

so let's assume we have it a gaussian mixture model we've initial mean more and

the a initial covariance sigma so as similar transform estimate a and times and my

six eight where n is the dimension of the feature space and the and then

dimensional buyers vector b

and sets that we cannot that the mean and the covariance signal far initial model

through this equation by we just chief the mean and we calculate the various in

this way

and the what the real important think of similar out that we use than simply

but these two that we have the possibility of transforming an acoustic feature on of

the patient feature a we can my back to the our initial model a by

applying the a line the

the inverse transformation

and the and

at the base of but where is similar i and the but the as we

said that a aims to suppress the form that variability in order to provide a

more speaker discriminatively features

and the so we supposed to have a set of s speaker and p phones

and the and set of utterances that are parameterize the a

by a set of acoustic features

and the we suppose also to have a set of initial that a speaker models

always the craft a mixture models

so what

but thus is to estimate the a set of gaussian mixture models a that are

normalized across phones and the for each phone p we aims to estimate a similar

transform a that the cutters the for evaluation on the across the speakers and this

is done it

by solving the problem of maximum likelihood and the it is it a

done it iteratively

still

that's supposed to have a fixed number of iteration and the in the initial iteration

on that we start by a giving as input the initial features and the initial

that a

my initial speaker models

so for each phone we estimate the similar transform that is common to all the

speaker models and that we estimate this transformed by using a the data is for

that particular phone for all the speakers

so we use the i'll transform to normalize the feature vectors by using transformation that

we so before

and the by applying the inverse transformation

the normalized feature of be normalized features are then used to estimate the is set

of speaker models that are normalized across phones

so if for each at the last iteration on the band the we got a

normalized features our fast scores and that our finally our finally a speaker models that

are normalized across the phones

otherwise we give we give as input

the am

obtain a features and that the in that a

a speaker models

a in our case when we deal with short duration utterances of a full so

that we don't that much data to it to estimate the transform for each phone

so what we do is to estimate the transform for an acoustic classes acoustic class

so that the

these a set of phones so i transform a and acoustic classes that are mean

by using good a binaural regression free it which the main know what is initialized

with the all the phones

and the according to linguistic rule we split these main all the a by choosing

the split the maximize the likelihood of the training data

and that this is don until the increase in the like to you the a

is i urban that fixed frazzled

so when

when it or we reach the last the

the last iteration we calculate each and which a transform

forty eight acoustic classes

and the

it's phoning that acoustic classes that share the same transform

in our experimental setup so what we need the two

to evaluate and we'll to my spotting ideal scenario

is to a database you which we have short duration sentences

we have clear and accurate phonetic transcription and the we have a limit the level

of noise and channel variation this is because we want to

to see to estimate where

performance of part in that the l and able to my scenario

by taking into account these consideration that we a we concluded that nice database are

not the

i are not a

does not fit our the nist database that don't fit our needs because you to

the lack of a target speaker and the phonetic transcriptions

to the channel variation and to the different types of noisy compromise the recordings

so we choose the we base our choice on the timit the database because the

a is a collection of i quality and the read speech sentences

and the

it's end this is a which is last three seconds and the

average and the is manually transcribed at the phonetic level and the you know the

database very is a limited noise and the

bodies a

that is not channel variation

however i database is composed of six under speakers of which are four hundred fifty

eight are males and what i don't and each word females it each speaker contribute

to this ten sentences we've average duration of three seconds

and the

we said we divide the database that so that well data for from for under

six to speaker where used to learn the ubm while the remaining speaker the recording

from the remaining speaker

a what i'm sixty eight where used for a city experiments automatic speaker prediction experiments

and the but performance is the analyzed by using her from one to seven sentences

are it to learn the speaker model

so the first opposed to a it was too

a segment the our utterance easy speech and non-speech as segments according to the ground

of transcriptions

we then extract the features that were canonically mel-frequency says that a comfy sent twelve

plus energy blast delta and acceleration coefficients

we've and we've an estimated speaker model by map adaptation from the ubm models that

estimation from four to one thousand ready for gmm components

and that by using an initial feature and the initial the thing to speak in

the initial speaker model we applied but that is starting from acoustic classes that where

where a obtain a from the initial set of the thirty eight phones and we

finally got our a normalized features and our normalize speaker models

so but performance was assessed on two different this piece used them at a traditional

gmmubm system and the state-of-the-art high vector p of you system

a baseline and we perform our baseline experiment with the initial set of features that

we defined before was be

at the a

is be experiments we've part where you where the would perform by using the for

to normalize speaker features

so i without with the experiment the results

so to as a set before to assess the speaker and the phone discrimination we

decided to use the fisher score discriminant the fisher score the future score

and that's supposed to have a the

as classes and the a set of and a lot but feature

bilateral i mean each feature is the in

not at the with the

class belong to

so the speaker phone discrimination

it's calculate the a fruit the feature score

that the

at which in than where it at the numerator the inter class distance where you

is the mean of each class

and the at the denominator we have at the intra

a and

intra class distance

basically a represent the spread of the features are around their own mean in the

class

if we want to

in if we

if the inter class distance increase it means that the numerator is i are while

the if the we have more normalisation more the features more spread out there are

on their mean when it means that the denominator is i of a numerator

so that

in our experiments we calculate the speaker discrimination and the phone discrimination after ten iterations

of but and we show that the speaker discrimination as a relative increase of forty

percent of the ten iteration

while at the phone discrimination

as a

relative decrease of fifty percent

so a disease a

this is good because it is

it goes along with the previous results that we've got in our previous work

however i would bust the and now to the automatic speaker verification experiments

so as it possible to serve

a we a

for our speaker verification experiments by using them all those from

for

to one thousand before gmm components and the whole for gmm and that they vector

ple the fist thing about these is that the a an i-vector p lda

performance but much better but gmm-ubm system the scale is different from like that

at the

also we bought to a we can see that we have always a

but the performance is rather than a than the baseline system

and the

another thing to not this is that is that for lower model complexity we can

reach but the performance is then the baseline or a similar performance is

then that the baseline

and the a result of the models training we one-sentence when we deal we one

sentence

and the for seven centers is that it is the

we carried comparable performance with the baseline

but we've

the word model complexity for example in the four we forty two jim ubm components

we get the same performance it as the baseline when using two hundred fifty six

components

and the same in the i-vector system where we got better performance with forty two

components

is it

by using but

and the a

compared to the baseline system that we what do you we two hundred fifty six

components

in this two tables i'm going to

the present the results

a where a

independently from the model size views

so these are results

i've the result by using good the an optimal model size for the speaker model

and that we can see that for the i-vector p lda system at than

fifty percent a increasing the performances of course in an ideal and optimize the environment

while a for the gmm ubm sister system the could but the performance see for

the first four by when the using one that and three training sentences and we

got comparable the results when using five and seven sentence

and the

in these that you that the plot use a we can still see the results

as before these are results when using one single sentence and the we can see

that we have

fifty percent degrees in the are of the i-vector the lda system and the in

the gmm-ubm system at the lines are more are less far apart but we have

the and the degrees from near are of forty two to three six percent

and the

to conclude

a this works in this work we address the problem of speaker model e in

the case that when training data is cars shall by you when using short duration

utterances

and the we optimize and the value it but a at the speaker modeling level

but performing the small the speaker verification the experiments

and the by using timit database that it's lateral at the phonetic level

and the we show that but it's skies formally in that using performed by s

well it probably significantly the performance of those two systems gmm ubm and i-vector p

lda

and the what is worth nothing is also that but is able to provide the

equivalent or but the performance by using a lower model complexity

and the

for the future work we aim to go back twelve original goal that these but

for speaker diarization

a we want to explore approaches automatic approach is to a in a closed acoustical

but class transcriptions because of but the but actually doesn't need to

it might be no doesn't need phonetic transcription but as long as we are we

are able to label that adding that way that we can map the features to

a particular acoustic class

we can transform we can calculate the transform for that particular class and we can

finally improve the performance is a in the in the system

and one final it was problem as passionately but am and speaker-independent approaches to four

and normalization

effect of rotation

you're i-vector extractor i was trained with sort channels and so as well because be

lda was obviously trained on i think with the sentence is okay so we didn't

manage for example sentences from the same channel speaker to create a big

sentence so that you train the i-vector extractor this way

for example for a one speaker we use of the centres is when we put

it together and we

okay so it's okay so character selected to understand was we used short sentences so

that is exactly what

wise that's used you don't have a couple of minutes

i think balls

r is a

and the team it

a much too simple databases for

because of all the

you a beetle

well close to zero percent

so what does it challenges in a

text dependent i mean this should be applied everywhere i mean if it works so

well

what is what are the challenges

sorry what which to what in the real life i mean this is not be

used in many systems

right so

i mean you should be employed if we have no ever

so what actually this one was that of the work to optimize and the bottle

with button ideas value because as i said a we try to apply the as

a respectively speaker diarization but the problem was that we didn't that comp in this

enough to

because there is out that was disappointing where like a we gotta really little improvement

in the it their decision error rate so we said okay but is not

we tried to see how to try to find out the

upper limit performance that can be shown off but i mean you have been using

timit that there are many versions and t meet what they the timit

with a noise condition all sorts of more we or telephone bandwidths condition

then it has been transform in many ways so why don't you uses

the because as a set of timit where the phonetic transcription also for these we

since we want to what demise dismantling the ideal condition we

this

so i would think would be interesting to see that this is one

of the risk of a primary all is very quickly

the major impediment so to progress in this field lack of data

i towards been very generous and making available we also data but that's pretty much

three

the only

dataset a recent phone search the so that we have so to work on

i mean one was experience with realtors novels that the problem really as part of

our

we're not going to be able to make progress almost as far as like can

see

on that's we find some way of showing various at some are among researchers we

need a on this program

you are working with the industrial partner that probably collect some that all right

mutual benefit to sharing data

then we could probably

make sure progress in this way we otherwise it's not software or how we're going

to the one

thank you for those points patrick

i just one and that mention that in odyssey two thousand one when we became

odyssey there was considerable effort put forward to a creating these standard text dependent corpora

to distribute to the participants both per se converse and new ones

put together these nice text-dependent datasets we distributed them to the odyssey members in advance

and plan to have a whole track with a text dependent speaker verification

and the sad news with only a couple of sites participating

so i think craig greenberg was

imply maybe a similar issue with the hazy or evaluations so a lot of these

you know it has to be a two way street to go to the f

for an expense put together corpora

and then have a reasonable number of participants want to take on the challenge so

if there's been a shift in interest to

text-dependent verification i

i think would be good is a community to get together in figure that out

and put together some evaluation

Short-Duration Speaker Modelling with Phone Adaptive Training

Text-dependent Speaker Recognition

Giovanni Soldi, Simon Bozonnet, Federico Alegre, Christophe Beaugeant and Nicholas Evans