a segmentation

so a a good are known and thank you for coming

yeah and present now where we have do not the university of that was i needs so on

this is a variability ability compensation and for the segment eight for speaker segmentation and to speak that phone conversation

and we are also present in our

a a technique to a it several hypotheses hypothesis and

a for a given recording and to select the that best hypothesis

there a segmentation the

and so uh

these work is focused on a the segmentation of to speaker conversation so it's

is a speaker diarization problem

a a a a a a so we we i'm at the answering the question was but one

but

and

um and it's an is your task seemed since we and number of speaker is known

and and limited to two

so in this case now win the boundaries

the speaker of that is of the that decision problem so it

we can

can um

we could it as a a a a segmentation problem

so i only there's mean a that in the field of the speaker verification and this has motivated and you

approach just for this some addition of to speaker conversation

yeah many base some factor analysis use eigenvoice voices

and in this approach is uh the um

the

the speaker I D model a a bit a gmm supervector

and that can be a present but this small set

oh or a a a a has a small they mention vector that we use a record the speaker factors

was i mention is much lower that they do you not a gmm supervector more

so a a a the main idea is that

a a a using a a a a a compact speaker a presentation we can estimate the a the parameters

of these presentation

what is more segments

and that's what we do for

for

speaker segmentation

so we eh

a start i stream of speaker factors over the input signal

so we are over or a one second window and

frame by frame we start uh

uh a set of

a a a a a sequence of speaker factor of but that's

and then we cluster these the speaker factors and the two clusters

using pca A plus k-means clustering

a once we have the two class we find a single got a single full covariance

a cost four four

it's just speaker or and we wanna be that we segmentation obtain a first

a a some addition up with and then we would find this some addition using now

some uh for segmentation step

using them to mfccs features and the mm a speaker models

and

so the

the but the main contribution of this work is T the

the the in general

uh the of for in on whatever the T

so first

that

i about that that's of but a do we can find a the set so

and and only when a if we have a similar or similar recordings containing in different speakers

we analyse the

by a T percent on these recordings

a a we can fight that there's the by ability this actually this by really the and it's mainly you

as because percentage recording so you use you referred to as well with yes and the speaker what every

but

we analyse a a set of a

the accordance don't to the same speaker

a so we can see that there's also whatever lead them on this record

and use usually do you to a

two aspects like the channel or

the most chance of the speaker

and this

this but every is usually um

no known as

and system whatever

but in in addition if we if we analyse

a a recording contained a single speaker and with that many in smaller

so i say

and that we

and a light but in really the this just slices

a a we have see that there is also whatever it with you now recording

and

this but i bit is usually do to a phonetic balance or

or they

they won't young on the one channel of the of the

of the recording

and we will refer to as by big guess

and that's system whatever you

no approach for speaker segmentation we are only a

but we we are only modeling in speaker whatever every

so the question is a a are you the types of whatever the inter or interest

my ability a fact some of the in performance

well

and

so we just one do we need to compensate for inter system whatever they don't this is somewhat of a

you

we note that in the system whatever decomposition is but important for speaker recognition

uh yeah but

well i was able to say that it's not

i so be important for speaker some shown but the

is presentation yes so that

channel factors it you helps us well

so i i i have some preliminary experiments that P

then

the help but

may keep my

but we believe that it's it should and had so much because you not that is decision that's you don't

see the same a speaker over different sessions

so it's i was the same as the the speakers are

a single session

you don't have a higher information of the speakers

actually we we believe that it does is a whatever ready to make up the body the speakers and diarization

task

because the channel is information that can

how do you to separate the speak

and what about the some whatever it

so what what usually in the feel of speaker recognition a a state-of-the-art system doesn't take only a a a

a that that's system and don't take a

and take into account intersession variability

seems yeah um they used a whole conversation to to be a model

uh but

we we think it's

but important for speaker some of this and their efficient because um many

and the state of the systems are based on

the clustering of various mark be or segment

and

we can compensate do but i believe them and be segments for a given a speaker

okay

clustering process should be seen

so that's

but we try to do so you been a

a each dataset contains several speakers sensor out according to per speaker

we kind of thing a team of speaker factors for from each recording

and then

a a we can see that every such and as a different class

and we

more more to the speaker and that's system some whatever the guess between class body on seems we believe that

more in a speaker and that's as them whatever the have

to two separate speakers

and in a recording

and we model the session and whatever you S within class but yeah

so we we this framework in

it's C C to to apply a one known techniques has

linear discriminant analysis of much might you can class but as to minimize we class variance

are also within class covariance normalization

a a to normalize the variance

for every class

so it's

the identity map fix

so this to thing this have been successfully applied the uh

for intersession compensation and in a speaker recognition

they

the in i of sister

so to evaluate this this of these two approaches we use a a a a a a nice are we

weight

a a a some channel condition containing more than two thousand five minute telephone conversations

i and that the speech nonspeech a what's are given

and we miss the performance and that's of the speaker segmentation or or or speaker or

a a a a a part of the that is as an hour rate so a as we

we have some parts that the speech nonspeech segmentation and "'em" we don't take into account overlap speech

the that is as an hour rate is the same as T segmentation or or or speaker are

and uh a C us use what we you we assume a don't twenty five second people score

and here we have the the results

for the system using in a small ubm

a a two hundred fifty six gaussians

a prior

and mfccs features

and in this case we we don't we don't use that the segmentation of steps

a a a a we can see that our get in two percent some or or

um using intersession session variability compensation and W C C and we

twenty a speaker factors

a a the sum an or is that used to a two point five

you also can see that another other another baseline

a a with fifty speaker factors that it

a slightly better

know

not much but slightly better

and to to try a L D A for them dimensional direction

and we can see that the L use had been

a a a a but i an i and W C C N any

is better

the obtained a two percent a a of segmentation or

and even the combination of what you are is is not better than

now

a and directly W C N

but uh we try with this

systems uh a after the resegmentation and it was surprising pricing that

there was some of the step

make more or less equal that was also

using twenty you're fifty speaker factors so

the inter just but every decomposition with W Z and just your working

and

giving the a an improvement

but it seems that it's a useful to improve the number of speaker factors so

we were a little disappointed with

this

because with all it to suit help

and we that i know where use but it meant to that not in the paper

representing here

a without a target ubm M um more features

so in this case increase in the number of is because fact of his help

so we can see that in this case or baseline E we the speaker factors

thanks so one point eight

segmentation or which is a a lower what before he was

two point one

and now that we use channel again but use the your work to one point four

in in in this case a a in this case we we also increase the number of speaker factors and

to test the L A

and we see that the the eighties

yeah is had been you've and

a a more than before

and and also that

our but configuration now is used to combine L D A plus W C C N

so it seems that uh it's is it's to question that a

the base and we'd have the speaker factors he's not better than the base them with fifty speaker factors but

that there and the egg

a we can take advantage of how more speaker factors

a our best result is

one point three some our

so on and on the other hand

you we propose you know so i think need to to you know it several segmentation a

hypothesis

and to select the best one a

base of based on a set of from mister

a so what we do is we it adaptively pretty it become the composition to to have this

a a a a a a did we obtain four levels of splitting in as we can see the figure

a a a a a and we segment every um

every a slice with a propose a a system three

then for every level we set at the best the slides this

and we combine them to be able to speaker model

and then we this to speaker models we to segment

the whole recording

using a

i there with some segmentation and mfccs features

you and i speak a speaker model

a to select the what to select the best segment that slices is and the best level um on this

four

we use a a a a a a complete as missiles and also my you're voting stuff

sorry components most of that were using this work were where a bias use information criterion

a a a a a a we using mfccs speech their sign you a speaker models to compute a

big

and and the K yeah these things these dancing the speaker factors space

so we were using gaussian and

a a speaker models and a that space and computing the K U

this stuff between what more

a and

to fuse both compute as were are using the a

a a quite toolkit

a well no for speaker verification

and uh the in the weights of the

a diffuse fusion weights were optimized to separate do for those

a a a a a of time less that one percent someone channel

okay

a a kid we have the results

for these uh i but these is you know channel selection

a strategy

we can see when that when we we are not using seen a

inter session variability compensation

a a a a a a what this solution is improving the results which just but our baseline which choose

two point one

and we're getting one point nine with to our started you

and if

we have a an idea

a a coffee that's much some we could select the best

a a level at every time

we could go that to one point one segmentation or

but

the of competence was of our remote idea

at the mall

and using system but every to compare that you would we then get

in a significant improvement improvement was

was not the statistic that

that's it's of this signal is significant

a is so we try to my are what in a that the any help

but that we we wanted to to make it what we

it's some set of complete myself to

because the

the possibilities of for are complete and mysteries

of computers was of a high

and

simple uh stuff that you to fuse

a a segmentation hypothesis

yeah

we were not really happy with this

was also with try again again with a lot ubm more features

and

and our best racial for intersession variability compensation

and and also a new set of complete missiles

but this

this is not in the paper a new results

but

oh show in two

a a a a a and we we could but use this segmentation addition or or from one point three

to one point two

and one what some additional or one point zero

and if we put select

i always the best level that we could read used to get two point seven

the channel or

which is

but good use so compared to the

based

one

well one point

ask completion sort of this work we we have presented to to make those for it a somewhat every to

compensation

we have some that they have for speaker segmentation

a a a a a a change that W C N of things better performance than that done of the

eight

and it's

some somehow similar to of the a plus but C C N

that

but similar to the combination of both

in the number of a speaker factors increase greece of the computational cost

a it seems that W C N it's

but there's that the word for

should should word for low computational cost applications

but the of course a for our best computer and computational cost is not a problem all was computation is

using

a high number of speaker factors and all the egg it uses here

and

we we we have a summary of the results yes we might they our but so that this is one

point three

we the a system where a the from one point nine to one point three

and

and also

a a a note that

probably that but used in is had been a lot because

a a a a a a a close or or or because of for but that in the study so

or in is that that use you seen pca plus k-means means

as initialization

so not i seen the the

he within class covariance for a be a speaker is probably had been the K means that assumes that they

all the

and class is have the same class so

at yeah why i are not quite as

a think i

so it probably because of the

a W C i have some

and also we have present a hypothesis generation and selection technique

which C can prove to this like the results

and for our best configuration we can use D some addition are from one point three to one point to

with a large you

think that's all

thank you match

you you we use time for questions

yeah

i just to one they mention

i didn't mention it because it's

it's in another

but but so this is much more to produce on

on really ability but

yeah P C A

okay we keep just one mention

a a a to initialize a a and then became means we use all that mentions but we need to

like this

the a means of a k-means means we

pca C a of show

yeah

hmmm

yeah

well

yeah yeah

sure

yeah but uh i i mean

now experiments i i am i keeping one i'm and C N N

maybe

you know

is not the best you can do but

to one they mention it's not

but it was so in is is usually the first dimension

of the pca put these is the best want to to it this because but it still i we are

getting a about eighteen percent that is is an error right

just using one dimension

so we are not sure that i

the best presentation

yeah

hmmm

so you we we were try to

just plug yeah C A output

be more they mentioned for you and all all the images to the came

the are questions

and let's thing the speed than then one the speech was reduced