so um this is the second talk

i about uh i J again a a speaker diarization them what we are trying to focus on multistream approach

use

and it's uh actually detect in the the the baseline technique which we are using a

is the same as in the previous talk which is uh

information but to like system

and to uh we are made me as the P saying the are need trying to look at the a

combination of the outputs or a combination actually of for different different seems on different levels

and these are only acoustic strings cell so no prior information

from brawl statistic so

again

and the third

for order here

is this was done by D

was was D is you them to a T D a we let

for for should L D

and um

i interaction a motivation

as the same or

kind of close

um again we set holes uh or we assume that's uh

the recordings which we are working with a are recorded with multiple distant microphones

i i as um actually features what you are using a two kind of a "'cause" to features and that

mfcc features which are kind of standards

and then

i that time delay of are right and i was features

each loop they are pretty uh a compliment to mfcc

and uh um people nowadays they they use quite quite a lot for

for uh diarization

actually this combination

winning acoustic feature combination

for we uh

uh uh information but like a technique is

a a key

less a a state-of-the-art results in a meeting data stations

um so back to O two motivations so usually the feature streams are combined or a model level

there are separate models for a gmm models

for

different uh actually speak uh streams

and this is are those way away

and the and

these uh actually uh a look like use in the and are combined

with it's some you know waiting

a and there also some other approach is like a voting schemes between

these uh systems

i diarisation systems already

or or actually the initialisation

i run system is done on the output of the other system or some the graded approach

our are actually question a is uh if we can if

and if you see or do this to kind of different acoustic features

can be integrated using independent diarization systems

rather than independent

models or in other word

but but actually D add some advantage of using systems are then

a a combination

but do we mean by system or a combination i hope is going to be clear

uh uh sure or

a to slides

so maybe the last one about i'd like blind of the talks so for let me say a few words

of all this

information about but like principal which we use

and which is actually done on single stream that a station so no combination of before features

and also if few words about to model based combination about

system based combination some he bit combination

and the experiment a result

again uh a state-of-the-art results using actually

uh this uh but to make uh

information but the like a technique

um we we are getting state of the results with such system and that is not too much of a

computational

complexity in in that

so this is uh are the can the advantage

uh how does it work

these information about button like principle

um actually this kind of intuitive div approach each has been borrowed from uh

from a a document clustering so

at the beginning sample that we have some document that you want to class or in

C clusters

in our terminology

and

and uh

what these actually

a a what is added did that as a as the information is some body Y which is about but

be of interest

a a or we call it as are but i of a body able which it surely no

or something about discussed ring so some in these uh

a document clustering these why uh why why able can be

a can be words

oh all the vocabulary which

of course to was about uh a about these uh

discussed serves and has information about

a about six a

also so actually some all that there is a a normal condition distribution P you white X so like given

X is available

and back

and going back to this uh a problem or speaker diarisation

our X got to i X is actually set of elements

oh and the speech so again

speech uh segments

again you need for segmentation we we set and

these need to be

uh a cluster into C C class or

so we to this information about the like a principal state

uh that the clustering should be press the ring as much information as possible between

a a C a Y

or by minimizing the distortion these distortion we can see as a

uh some

compression for example

also in a our

our way it's actually some regularization regularization so if you don't have uh

these distortion C N N

which is actually but our terms uh

i'm each information

oh oh X and C for i X and C

uh uh if you don't have a it's probably going to

cussing to one one global class or which which is not so the case C one

so i get this i'm

i intuitive div approach

but in the end it looks that uh

or you can be proved

but

if we actually

you are going to

have to my this objective function which is again

uh a mutual information C Y

and my nose

some

uh like i to rate or uh X and C

uh yeah are going to

actually uh

to move the problem to the

uh to the way you where

uh the properties

those

that he's Y given X are going to be

measure don't can but using a simple divorce and

but the point so we don't need to look for some

especially divisions of the as your which is saying

which got a of we should we should be him together

in this uh in do

and so intuitive approach

i be due the derivation we will find out that actually that should be jensen jensen channel uh the imagines

used for

for clustering

so in the end uh the approach is pretty simple or

going to be is

so here it's actually a got marty for a

a also in each iteration them the are

we are uh

we are thing to clusters together are based on the information

uh from these uh give chance so we take those clusters which have

the small the and we just met jim

and you do it it's that to the um

until

should is some stop criteria

stop it that you know

is again pretty simple and it is actually a normalized

but you or from

i go back

uh this a mutual information between C and Y

so so again mm to somehow O

i i know finalised this uh i the approach

right is good we have us to pink daddy and we have actually

the the um

where you how to measure your the

the similarity between between clusters

and uh

it's pretty simple

to to and coded it you know

oh just a a few information about uh are those properties which are actually here so

would be fairly suppose that uh by but you of C given an X where C is cluster eight X

is input uh segment

is going to be hard

partition meaning

it all

all these bills only to one class or

but is no like

a a week a uh weighting between several class er

and place probability why given C which is actually

a a some yeah but about a viable

yeah distribution

which which is used to a actually to do this so merging

and um

everything should be more clear to on this

on this up your

so i mean suppose we have input speech which is uniformly segment it

oh for example mfcc features in this single

some the approach

we have uh elements of

these

and among variables

i still didn't say what it is but

i i it's probably in T if in

our case is just universal background model

you just on and tired speech

and uh

uh this is actually defining body able to what you to do the thing so

actually actually state or which you see in the middle or are back doors P why you an X which

are

probabilities

for a vector Y given

uh you the input segments

and um

the clustering which is a a again competitive technique and in the end we get some initial segmentation

and finally we do refinement using ca

training a gmm and doing viterbi decoding

that are let's go back to

to the feature combination

so in case of uh

uh a feature combination which is based on the big around what else so suppose that we can have to

features again uh a few just a at is and and tdoa away

and we have to big our models

uh each are trained on on such features

uh what we can simply do that

we uh we can just wait can nearly weights

these uh

B Y given X uh

vectors or probabilities

with

put some weight

and it's going to be us new mats weeks

oh for these settlements sorry abilities

in the

a these weights

how to get a to of course we trained them or estimate them on the development data so

we should be juror rising or different data

L so one

we have actually these P Y X is make it's the rest of the diarization system is same so P

actually do it just at the beginning where we combine these

i are buttons where is

and then we just just do a iterative

approach to

to do clustering

so actually this is not a new these has been already but uh

published be i row last the interspeech

um this is just again the gap how how it is down

a again there is a matrix cold

thus be white X

probably

the vectors like an vectors

and they are simply

a a it's uh by by alright right

yeah and then there is a clustering operation and refinement

now what is actually knew and what uh what we are type in this paper is uh

multiple system combination

so so

a set of doing the combination before clustering uh what would happen if you do combination after clustering

again with a of that they are to big our models

oh trained on different uh features

and they are two diarization systems in the end so

uh we

actually it actively

get some clusters

a stopping titanium actually can be different

meaning

can have different number of clusters for

for a feature a a or four it should be

the end to be get a this in these wide given X

or a you see actually

and

a time to go back

from this class to initial segmentation

have been that would D Y you X

i to do is just simple by bison operation

and um

again there is um

something you image how how this is done

so again and that two diarization systems

which are doing complete clustering

and in the end we are again getting a

um some

we are getting

some clusters and to get actually back

two

to this initial segments P Y given X

uh we just a apply those uh a simple operations um

and just simply

uh integrated over all be like C

why why this should actually work uh is uh again between two intuitive

in this case uh these be Y X

after combination are actually estimate it on

a a large amount of data so if they are not estimated on those short segments

as in case so for a your combination

before for clustering

now each actually white a is uh

estimated it or not

on a lot of data because you have just you cost in the end of course

um um

the third approach so

a actually keep it system so each is just the combination of those two but also

uh are before passing and after clustering

so in one case

what we can do use just

that before a as we just uh

and a one in one a a simple stream just do uh

a a system combination and then we just uh

a combine such output with a

yeah are the others

stream uh

and she's to be before to cussing so maybe it's it's more seen here

i into two streams

in one case we do this system combination so we two clustering and from these be white C but is

we go back to be Y X

to get initial

we show segmentation or initial properties for for the segmentation

and in in the second case actually be

we just do these uh um

she's uh did you always stream

just

i try to do these combination before

for for clustering

that's a those to the kings are simply combine of course

i i D and we have some you Y X uh

but takes

a P Y C about six N B just the i'm and as before

of course there are two possible K sees uh what should be done on beach kind of theme

and uh this is going to be seen the results are going to be the seen in table but again

maybe it's into a D for how this should be done so that we say a few words about the

experiments

uh we are using the same but each transcription data uh system me sister uh sending meetings so no i

mean data but the only rich transcription

um the mfcc features and these uh

uh tdoa features

and uh

and she or the speech is coming from and the and they again

um be

single and hence speech signal

again the was weights which between the estimate are are estimated on the open set

um as before we are only many shopping diarization error rate with respect to speaker or or so not speech

or speech nonspeech there

a a are the results each be a shift if you remember from the previews uh to talk

the baseline was around fifteen or

fifteen point five uh percent

was uh

actually use

single stream techniques so just mfcc features

you do can nation

oh for mfcc and tdoa features

in case of information but to technique and

kind of the H M and gmm

uh we may see that to because we get to you and twelve percent

and the second but is just to a being but are the weights those are weights for

reading the

but different features so in case of

because these are different quantity so in our case of some properties which are actually

which we are combining

in case of a and uh J and those are a look like people so

a that's why also be so uh weights are different

and again in our case the combination is done using can of variables

and this is actually as you see you can see a perform K the

the a of system

so these are the results for combination

uh but combination

one no

on the um

actually after clustering so

combination system level as as we call it

so in that is this base like you and point six percent comes from the previous table

you do system

combination meaning after a can my these tolerance of labels after

clustering cut a you may C V are getting pretty high uh almost forty percent uh improvement

and then they are of course two possible combinations of system and model

and a weeding

actually

looks

and again it's pretty straightforward that

it's better to

to do see stan

combination or system waiting we the tdoa features because they are usually

mm more noisy

and they need probably more data to were to be what estimated it or at least those of viable

to have more data to to to be but estimated

in case of a and that's is is features uh it looks at works so much better

so that's why reason

also you may look at the table

a race

if the the weights goals close to the

those weights uh which we need to estimate the goal the go close to the system combination so instead of

zero point seven zero point three

we go to zero point eight

and then estimated on different data but

to generalise

for this case

just a B to explain why

possibly why we are getting such improvement

a if you look at the single the stream

a a results

for each meeting seventeen meetings can this case

so are

but model combination and system combination

um um

and you look at the button or which is just simple and S C and D do you do away

information but to neck techniques so there is no combination of different features

may see that

most of the improvement comes in case

but is a big gap between

those two single stream techniques

we have the course you don't get to improvement but you is

a a big gap between mfcc and tdoa single stream

but system combination works so P develop for such a meeting

and um

just to conclude the paper

uh so here we are present a new technique for or new weight of combination of of the streams of

a was six teams

so rather as we did before uh

before clustering to to way the the acoustic features here we are present technique which

actually is trying to do we after clustering

and the reason uh a simple for that this uh probably the these on the variables which

which are used to then to what you're to match different different uh a clusters or different segments

are

going to be estimated on are more data

or not just on on

short segments

and uh actually uh as it was seeing in uh in uh

the results you are getting pretty cool to improvement for

for such a technique so forty percent uh

that were all seventeen meeting

um i think i'm done

the on spoken

since something that i mean

i no not i think some a specific question

for them

yeah

i for all of the

and and goes to P