Speech Transcript - Study on the Effects of Intrinsic Variation using i-Vectors in Text-Independent Speaker Verification

well

two

roughly

since

and as a student of which risky in computer science and college

i'm glad shows you

the study of the effects of it just a nation using i-vectors interesting dependent speaker

location

best i would use the main challenge in speaker verification

and then i will

the

is actually about the problem of research

and their proposal

have

then i would use the i-vector framework for discrimination model

including the from what the intersection of all the signal speech

and then i would just use the

the elements

in those

excuse in the daytime

description of speaker verification systems

and the experiment results

i don't of the solutions

and backchannels in speaker verification comes from two it first one is

extrinsic the right but

and the other one is interesting there are P G

the best alignment the associated with that is

come outside of the speakers such as mismatched channels

or environmental noise

the intrinsic variability is associated with that is that

from the speakers

such here is speaking style

emotion

speech one and state helps

and it can there are a lot of research

focus on the extrinsic drive each

but an example of research about

in this to the right which has been proposed so

in this paper we focus on the intrinsic the remote but

the one stack is fess of we use the right but

in speaker verification

the problem with focus on

on the performance of speaker verification

i'm at best yeah that the right into the remote speech

so there are two questions

best one is

how the speaker verification system before

where enrollment and testing on the and mismatched conditions between just arrived at

and the second parties

how the colleges focus on model that was over at each

okay in addressing the effects of interesting eighteen speaker verification

so wait one

yeah

would be the proposal more than the signal right but with i-vector framework and want

to say that that's

and

first we have to define the variation forms

because interested over but comes form

all the data associated with the speakers

but they are still practise

so waste best

define the base form that is neutral spontaneous

speech at normal rate and a four inch at least

for many cases

basic well

weight

you

either that you know

variation forms

from six aspects including speech rate

with the state S

speaking by

emotional state speaking style and the speaking language

for example in the speaking rate

we have

fast speech or slow speech

you think you basic skaters

oh well

clean i zero

for example the model of speech means

the speakers have a candy is a mouse

talk

in that way

the recognizer with

the other night they are to use the speech data

has a cat qualities noise

and

the speaker why don't including not so hot and whisper

in the emotional state but have happy

emotion and their own body motion

and the

the speaking style

a reading style

yeah

about the speaker which we have most chinese language recognition

so for me six aspects

we have to have

variation forms and the way

recording for the data i

for experience

then

are we just use the i-vector framework point is the variation more

and is the i-vector modeling has been successful in the application

for the channel compensation

the i-vector framework is composed to pass festivities

we can project the supervector

and

into the i-vector the total she so the total variability space

he sees the low dimensional space

the second part is that i that was

okay

we can use the cosine similarity score

to actually use the

similarity between a test

utterance and yeah

training

please

how baltimore in nineteen score i'd be please

i-vector framework

because

before they also partly

studies

about the i-vector format for modeling the

can compose compensation

channel

we want to see if it is derived for the

what we are interested about ivy

seconds how to label the effects of images

ratios we use a set of technologies

which is used to have to be the best soulful

channels

there were having we use to lda and this is a P

the idea behind the lda is

minimizes the within speaker variety by maximizing the between speaker for speech

we have

define the compression and the

the lda projection matrix are obtained

is composed also

it can batters

which is how to decrease the eigenvalue of the equation

and

within class

very well as normalization

do not it

the lowest weight the idea is that

you

exact the direction of high inter-speaker each

which she's

though the partition

the taxes in projection matters is obtained by

could cut computation so equation with

chomsky

people

the composition

i G E is

partition magic's

and the buttons as we use process

partition methods

that using it was that was since direction

partition magic's

and they use

you don't ten

finally compose the eigen vectors of the within class

covariance normalization

metrics

so i would use the experience about how to use that perform well

in the interesting

relation box

one best or we use the line junction tree which involves we have recording

yeah we went into

so i don't she for the tree and the test

then we'll description about all

so the speaker recognition system you

yeah

which use the gmm-ubm baseline system

it's just as it's the speaker recognition system and then ways you would

so we'll

i've based speaker recognition system with different

interested over verification

instance

i'm thinking of a large and then we use the expression

results

the ranges over the variation corpus that we use

these counts for

one hundred

we must use events

which she has

they to try to solve the speech chinese

yeah it used for eighteen years ago to tell you i guess

yeah

two how variation forms just a

still people

yeah

each student speaks for stream units

for each variation form

so that the

then each day what is it about two ten

parts

so each part not for

eighteen seconds that is used for training and testing

and some of that

okay resolution is a parts

and these or model soundtrack

we use the data machines in the intrinsic variation corpus

the function

have been for a specific you present to apply use for training would be a

we just thirty speakers

and fifty male physician variables that to which uses gender dependent and gender independent ubm

the last for eighteen hours and the current the trial

orientation forms

then we use thirty speaks

around six P

data to train

the total reliability space which is a much extreme

also it is not for eighteen hours

and of course we have to

we use straight

different

interesting the composition a large

lda up to their energy so have to train the projection last week's forty

and you

and speakers

which asking for time outs

for training partition a six

asked we used one speakers which included in two thousand four hundred utterance

for the task

and

all tell variation forms

and that way you five

speaker recognition systems

we use the gmm-ubm speaker

six

speaker recognition system as a baseline system

which is

the gmm-ubm is composed of

several

also

mixture

the feature volumes days thirteen on original mfcc and ubm is composed so if you

want to five hundred child gaussian mixtures

and that is a speaker verification system is

use the lp in terms of them but also with a combination of whatever you

know

and the i-vector dimension of that are these two hundred

this table

you incorporate for you for each enrollment condition when testing utterance

so the total variation forms

and

for us to use the speech recognition

you to include we choose the spontaneous speech is that this case

then we have

a six aspects including speech studies that you know one speaking rate emotional state physical

state

speech and language

there are

well calibration forms

and for each variation forms

way

we use them

for the enrollment condition

and trust

this year we said well with water variation forms and we can see that E

yeah i based system

perform much better than the gmm ubm

baseline system

the best results obtained of the egg

which is a combination

of lda and wccn

and also we have

see

in what a different variation forms

we found if you used to whisper

as you won't match

then

the eer is

or not

so that perform a whole

that way

calculated avoid for speaker repeated and

yeah

iteration calls

and from this table we can see that

i-vector system i-vector be used in a speaker tracking system is better

then the gmm ubm

speaker locations

in reducing the variation corpus

and

the best results you obtained in the i-vector

based

speaker consistent with the relation okay

yeah

and

we lately

section six

as an

icsi's a det curve or a speaker system

i S gmm ubm based on this

pitch and the

so that these two

see in system with that would be a and wccn

we can see

there are three the improvements for the performance

this to this paper shows

the camera the reason between gmm-ubm system and i-vector system

you

matched and mismatched conditions

so faster we can see the first two comes is used matched conditions

the last two is for mismatched conditions

and they use

we can we computed for each

variation forms

and we can see for each variation forms

mismatched

in this to matched conditions

the huge the yard is much bigger

there is a match the conditions

and the second we can be always

can you know the gmm-ubm system and the i-vector is the system

and we can see

for example for spontaneous

margin for

the one the ones for the gmm-ubm the yellow ones for the i-vector systems and

the

there are the

when the whole whisper

version of all the i-vectors this system is that

have a

significant

we actually

and

this table shows for each testing condition when spontaneous

utterance find for enrollment

when the

cost

the most are you know the whole way we speak

we spontaneously so it can see when testing with each iteration vol

"'cause"

turn moment for me is that spontaneous so if you castaways it also the spontaneous

for the yeah using it should be a small and the best results we obtained

with obvious isn't it

and

also in the past few enrolment we use it

spontaneous bombard castaways the whisper

duration and they were found that

the

yeah is it might speaker

and the whole performance

shot duration

this

speaker say that

so since the whisper variation used to

but different from the heart a very simple

so we do so we

presented is table which shows if you

norman we see whisper utterances

what about the eer for a for each testing condition

and we can see that

the results using

using become much worse

for example for the gmm-ubm system is wrong

what you

percent

then the best results are obtained in the matched recognition which she's

seventeen percent

yeah

also

for the whole picture we can see that

the i-th basis people in system

is still

perform well

the problem better than the gmm-ubm system

the combination of lda and which is an

we also performed best

so we have well occlusions that's

mismatch using you just a confederation course channel variation in speaker recognition performance

and the second these the i-vector framework one but then gmm ubm you modelling agency

the variations

and especially with a combination of four

lda an adapted and the best they can get the best results

this that the whisper utterances that much different form of the variation forms

is that brings the matched condition of speaker recognition performance

so of to work will in the model domain there will try much more useful

just iteration compensation

and also in the visual domain will

will propose some in

i don't mess between four

for example we do you

the

whisper where the whisper variation in the best results

maybe best if

after the vad

the list the

the

whisper low quality is much shorter the model

rep a speech

the second these whisper the speech she is

different

for is much different from other speech sound which involves so we can do some

work in the feature domain

to include just the performance of the speaker but he system

that's all thank you

yes

we will record the this database in the fatter profitable and they all use the

one

they all students and the i and why they

some

what in a paper

tell

which you

they have to act the emotion

yeah i something

target

how to act

some

yes for example

yes for example if you if you speech parameter we may be

we have to you can alter so those listed you model and motion stays at

so when we are part of the database we try to just a change you

one mation also some

some of deformities relation

so we just to try to

asked to separate are the eyes signals on

elation

assume we have

investigation

in future work

some of it

thank you

Study on the Effects of Intrinsic Variation using i-Vectors in Text-Independent Speaker Verification

SESSION 06: Speaker Recognition - Channel Robustness

Sheng Chen