oh the last paper you know or section is a reason the

wise

and a big

speaker our recognition

i the paper is going to be present the back

style

one

this work come to my talk so um this the some colour but rate of work with the sri international

speech lab

so that the title is uh recent progress in tick speaker verification

so it's about prosodic speaker verification but

well that's okay

a general really it's like for a work uh

on what we call subspace multinomial model

we present a last into speech

this is mainly like a modeling techniques similar to the total variability modeling to obtain i

but that's not uh use to for uh got shouldn't but the parameters cost mean

but uh rather to mode uh model multinomial parameter

so what are the claims of these um

this work

is the first one was to

uh the should i stick quite complex system building process

but some simple toy example and some

some figures

um

and the the main claim was uh to introduce a new modeling uh to model deep but like i vectors

we obtain

but a probabilistic linear discriminant analysis

and um

we wanted to compare these this approach to uh

two main um of the a prosodic systems that are are all there in the field

and finally of course because prosodic is always just

higher level of speech so we want to combine it with a a state-of-the-art cepstral baseline system

"'kay" now come see by example to explain the system building process and

just uh imagine we have some conversational speech utterance so

yeah it's just

the some

except from the nist or

so that's S and uh where are you where which state are you in

so as we prove um

use prosodic features

we only extract um

some um

some from the mental frequency measures so we

extract the pitch

second we have some

was "'em" men so

we just use the uh a normalized energy for that

and finally we have some duration measure

and this is due from a a V C S R system so it's quite a Q right

and it's from initial one so

and this case we have like ten syllable

so a syllable words so the

sort of a segmentation is the same as the words

and

actually now we have these three measurements and now we obtain snaps from them

i'm these so snow of that's as uh

um

the best non union your the extract region features

so

we we saw like new many measurements of pitch energy and duration

for instant we to cut out the segments based on the syllable and we

measure the duration like for the onset the code

or we measure like race and for all of the pitch and the energy mean maximum

i just uh X amplify this with some

few measurements so we just take the mean and the

the the mean

each of each um segment here for pitch and

and the blue square and

so that cross the energy

and also a them and what we use in this example is to

just a duration the number of frames for each syllable

oh now we have like a a combination of uh a discrete values and

and um continuous value so how to to model of these

so what

i so i can with this like they said okay we try to

then these into some discrete class

and the best way to found was to use a soft binning

that use of

that's a chain

small small uh

caution mixture models

was these

so you really take the

something the mean parameter chain a

that model and so on and so

we extract use measurements from audio background data

and then and all example we we train like to mix just

and this one dimensional duration measure

you see here

and we do the same for a for the pitch in this example we use to three of them

three mixtures components

and for the energy a we use a

for mixtures

so now we have this

background model

for each of the snow

and now we can start like

parameterising are

our um which are again speech as we X

that from the snow so

and now example i show

i i just plot

the um ten values from the top

to the to the um

to the gmms

and now you can see

are they hit the gmms and now we can compute just a posterior probabilities

but each to and generated this frame

so this is like a soft binning and

actually can just generate

counts sense soft count and you see for the duration of K the first caution

but a con from six point three and the others to three point seven and so on

and S you C these is like

this is the same thing you um computer if you do um

or compute the weights for gmm

so this is not a

um um

you don't compute any means an you things so these is

like C on the ground line model is so you moody you norm a model

if you would plot the moody me model space for the

duration shouldn't you would see that you have um

like just a line

because it always restricted to sum up to one

so for the pitch you would get a

two D simplex

i can move around but it's or is restricted

and for the energy in this example you would get a

three D simplex

so this would be the for remote you know a model space

that but we tried to do with the subspace mode you know me model is two

to learn some structure

and the data some low dimension structure

well moves inside the parameters if you change

because

so

and furthermore these i really like independent

a features

so are E

we should and um but we also tried to learn correlations between the

does we can just a

and

inside side or all of these sub-spaces that ones with but a but that put the meters

so

but that's shown now was just to plot um a single dimensional subspace in these

but controlled or all these uh space is that once

you see that

we should project back to these uh moody a

space to really get some um yeah

the dimension dimensional

lines

and

so you really move like

that's just one number you would move to i vector

you would move like was the colours if you increase the number goes to read and

so

is just the black but it's just one extract extracted i vector

so and then we use this or model two

extract the i-vectors like for the

stan i-vector extractor

to but a lower of rent a low dimensional representation of the

i a whole utterance

so

that's

zoom into these four dimension energy part

and

that's look at a two dimensional subspace

and now we see

we get and nonlinear a a

plane into the three dimensional space

where the movement is restricted

actually Z black dots C here

so that really like real data are are really like um

that's

so it i like just projected back

from the data and if we resume in there

can the the colours stay show the different speakers so it's like ten speakers

ten utterances each

so the funny things that we have a

it can have a two dimensional space is here we already see that it somehow in one dimensional space

so this seems to be even a smaller the subspace so even probably one

and and mention would be enough but this

example here

and you can already see that you can distinguish the speakers quite were here

then of course this is and the multimodal modal space so but we in the end we

you yeah i vectors and model them in the

i vector space so

if we go plot in the i-vector space just i like the parameters

a really a gets is nice um

yeah class to as for speakers

and this is without any

a compensation of anything

so

and that's where than the P I D M model comes in

so but you go to another a um artificial data

also in two D space

so you we have again like for speakers that i vectors isn't two D space for four speakers

so the big dots are the mean of the speaker

and then we have flex several utterances each

and sell no we use a a a linear just come and of this assumption so we have say we

have some

but i'm across class variability

so

this is the um

the solid lines so really see that the uh the variability between the classes

and then we have a shared

common uh

within class covariance matrix

you see that all the

speakers a

so the individual or

um

utterances a share the same

same uh variable ugh

yeah of covariance matrix

so of this you can really see that uh even though

we have like this and that talks from the red

what's the um

recognise some from the same

same speaker because you of the

big variability in this dimension

and if you go from here to that you would see okay these a different speakers they don't belong together

and this is

the the

playing an yeah have some some but what we use is um

do we use that a probabilistic model

and the nice thing was that we can really train the parameters of the in the matrix as

the core about and sis uh the em algorithm

and secondly we

we can

use a P I D a model two

directly evaluate the likelihoods the likelihood ratio that use you compute with the ubm

and we can even compute a proper

a um like like you to to that really we can look

a these two i-vectors generated by the same speaker or not

if you look at it this is

okay is is quite complicated in go

but then you look at the numerator

you really

see that you would um

oh okay we have to the vector W

and then the prior as a given speaker

and we have two i-vectors and with the prior as the same so it's P Y and then we just

integrate over all speakers

so we we don't really care that which speaker is we just say are there from the same or not

and then the denominator we really have the margin

hmmm

probabilities are like what's for

if that they come from different speakers

i'm the nice thing was that this income come be uh

evaluated analytically

and we can solve it

and

can be uh

to scoring can be performed very efficiently

so

to experiments um um i present or on

and the nist sre two thousand eight task

so well

we presented on the uh you what be used for the nist two thousand ten

um developmental or what S so i find for it

so it's a menu the telephone condition

so

or target samples of the same but the number of impostor samples as

to create a a increase a lot too

to um

"'cause" of the new uh

there are measurements and you

a new dcf

because the that emphasise the very low false alarm rate

and the ubm um um or of a uh the model the P D everything is trained on

as so you all four five and switchboard data

so we will a uh evaluate

three different system the first two you

state of the art system that are there

and the so is the one we propose here so the first is is probably normal the J fa system

a call

so john factor and of this modeling of course mean parameters

and it just uses

quite simple provide a you know can to features

so this is

that's subset that is in this snow features but there are just thirteen dimensional

and they really just um

uh approximate the can two or over the syllable

and

second is the that's snow of svm system that

up to the point where we

as have soft counts of the smurfs it's exactly the same system as this one

that just the modeling is then done by

putting these are really high dimension there about thirty thousand dimensional

put them to S svm and train it so it's

quite demanding

and we

when the um i-vector extractor

to go down to a the dimensionality T of about two hundred from thirty two thousand

and in this low dimensional space we can use

really lies to be a machine learning i and and the P L D A model seems to be a

very uh

a very nice to do this

and finally we have the baseline system that yes so i system for

i that's a oh to a ten

and we fuse it

or or or is that that plot showing four

to be a single a prosodic systems

so the red line as the polynomial system the

who is uh

at some of system and the green is the

of P L A system

so the the are the new dcf

well this yeah and that you could uh rate from left to right

and we see that we get a big improvement over both systems on the equal could right so we reach

like six point not percent

with the uh P L D in modeling

and all the others are around ten percent

and also on the old dcf we get a big improvement

and i've

quite quite strange we have or is that

this now have

svm system always somehow or performs slightly if you go to the very low false alarm region

but some behaviour what that we really can't explain now

and

a second it and the next uh

results will be on the key and so we have the baseline system

that that you could have read of one point six percent and the new dcf is

and for two

so we do a done that score level fusion by a logistic regression some

and check knife thing approach

so we see that we can

most of the fusion of for the only normal and that

but the protein on the system include a

and then you new dcf we get better results on the you could have it you get even here

or the of system we get even the best system

but quite confused we get the best results on on the equal error rate of one point four seven

but we get the best improvement on new dcf

for our system and if you memo one

so

i'm this but it was the other way around

so we were but on you could right and the other system was better on the new dcf

well in the fusion

some of the other way around so

we a really a uh we want to try this that we is got that all and two thousand eight

data so that you

can train the fusion on two so eight and applied to to sound sense so maybe the fusion but

but uh

probably change in this case

so

where and conclusion

we can say that the P I D A high the or performed

so okay i didn't mention that because it

well the is that asked um paper

we did to

called than distance stocks scoring

with that T a and W C C and then we get about

relative improvement of twenty percent yeah with that

to to the P L A

and

generally the uh vector P L a system gives the best prepare or work performance of six point nine percent

equal error rate

and that's to our knowledge the best

best score for prosodic simple prosodic or the prosodic system

um

we have to investigate in this decrease in the low false or you reasons

and yeah the fusion gives around ten percent relative improvement on the new dcf measure a which use quite nice

and yeah for future work we want to investigate in

have a as channel and speech stuck um this is just and telephone but

and the nist still iteration there's only like microphone and not conversational speech

and and

another thing we we are trying already is

the i to modeling with P D a for the simple polynomial features

that be used in the j-th a modeling before

and then to combine these both system one on

or shouldn't me modeling of the other on the which know the modeling and even based one on snow one

on the other thing and

to combine this and

hopefully you can see that on the interspeech speech

so

that's it and thank you

yeah i i spoken to

ask

so

that was the uh

what was your baseline result without adding any prosodic

how much did it

any any the prosodic sat of your baseline

and you mean this

vector base yeah

but what's is uh

one is one to six

and point six percent equal error rate and

first line it's

and top of the table

all sorry i did see that think okay

i think you talk a i'd

can't remember did you actually a a score normalization in a nice and all that's that's an night thing i

didn't mention here that usually for P I D it we don't need any score normalization and that of about

for the S and system

i yeah the svm system has uh that T in minute i D i i just this is just a

bit of speculation on an that uh sometimes having the

a a back and dataset set uh in the S in training can act

a score normalization as well

and

perhaps like can point you to some that where it's sorry

that

this

dependent on background set

it can actually write type the debt of a little bit

a to take care in and in C S region is talking about a

sorry you

uh perhaps what's happening this is

still just a collection

yeah might be saying the uh improve min dcf in yes and system for this a reason

and perhaps the fusion

this can't track acting something which school normal more like an

say that it

not really that good the svm system probably

a a in combination not anymore

uh yeah perhaps the normalization affecting in me and and is

can track and sign problem

that fusion is can't

okay okay

also thanks

or from the country

and you go ahead could use used

so

a need to here the at you normalized the of a vector V is no used as a without any

of these fancy tricks

yeah because i was wondering if uh because you have this

a in a male

i don't think we the steering i

if if we have the same have here

a

when you

yeah the i actors the i "'cause" in the end they really gosh and distributed if you look at the

distribution

but they they are again look yeah that's a while last can yeah but you had we have here

so normalization for doing that the in the A this case are not know and it even

from you never helped i always at tried all these chicks and they don't help me at all all of

are for the the system but i don't know not

a work for me that's

things to know

yeah

and

okay is there no constraint that's that's like

speaker okay

i that's the of the session say