come come from us

and to they i would like to present that will work work and type full uh linguistic in sees on

bottom-up and top-down clustering

for speaker diarization

so let

give a short of view of the work

um so for

i we give a short introduction giving the motivation of this work

and one with the

formulation of the problem

to finally move want to compare is no these two clustering systems

and finally he straight that or ideas with

some experiments or word

so

um um have seen during the last reason two nine evaluation but was basically two main approaches for

speaker diarization one he's bottom-up up also called agglomerative hierarchical clustering

and the them

or or a device a hierarchical clustering

um

we but released uh we sent to you well actually last year a class like as a a bit per

uh give a brief education process

for

uh speaker diarization system

and we we so that it get some consistent improvement for the top-down system

but to trying to apply it on the bottom up

we so that uh the result to word totally in consistent

so that's what

is that's the motivation of work is to know why it does not work and it's leads to

try to have a look on what is the in front of an be sticking reasons on bottom-up and top

top-down

so that start with the formulation of the problem

so here here you have an now just stream

and so we want to solve the problem you spoken one so we proposed to cold G is the segmentation

so so

is the group of boundaries at each speaker down

and

uh uh S

which is is

speaker or grants so the list of the successive speakers

so we is as and when is a G in this case

and so we can

summarise is

setting and by the following questions so finding the optimum S and the optimum G as the argument of the

maximum

of

is the probability given as a set of observations so it's is case a or B the audio stream

so uh just using as a base and from that to stand

uh a a question

we can get the second mine you see on the screen

and uh and use the dean and a it can be it is does not depend on a as of

to so giving the the question number one there

so a a use with this a question we can see that

as is or you know there's which are required to solve this optimization task

the first one you know to compute

P or a given as and G

is my acoustic speaker mother's

off on uh uh so in this case it's often on gmm in may not the approach we we use

currently a

state-of-the-art of

and the second

model model so P

S and G

which is often on me it's uh

maybe except in the

someone prayer was work to to the just been presented now

uh and

so i looking at is a question was is that we have two main difficulties

first

of course we know what the speaker

he's

and secondly is acoustic model defined

a perfect word

from than thirty on the speaker but it can depend and as well

oh on other is and C is like to the linguistic content

so for the next part of this presentation we do the following assumption

is that the major and reasons

but i shouldn't is only you

to the linguistic content

so that's what we go are gonna like a sense

on on is the difference of one times

and they're gonna be written Q

so

considering this assumption is this assumption option can just we formulate a i a question

uh take

speaker and boundary and "'em" out that that are are possible speakers sequences

so now a looking of the optimum as and G plus the optimal speaker and read that all

so consider a now as the inference of the front and we can move on that the second nine on

the screen

which should correspond to monte guys a the probability of or or or or to different for names you

and um and the third line is does a set just explained it with the bayesian rule

um

and next

we can propose to do to S and she first

a speaker diarization and do the following assumption that or the speaker a a or babble

so we can just a a speaker john mother's so P

of S and G can just disappear

and the second assumption is

that's

we can expect

the from and to be in that the and of the speaker and independent of G as well

so that's why we can just from problem in the prior of Q

so finally

we got to a question the first for simple approach

the second line for of maybe more complete approach

and in comparing but of them which will lead mean to same results in perfect board

we see that um

uh the second question a phone are normalized

and

that in the first one

we should have a normalized know that as well

it means that's P or a given as an G has to be trained

we

a can think about a different for names

so

to summarise i

see from this equation that

the speaker in mentoring delta has to be up to nice to get a or with S and G

and so that a an called solution for the top

so um that the reason why it is to the fine was try to and you are search

um um

which are uh main a bottom-up and top-down approaches

so if we move on out comparing these two approaches

see

i'm are with is

just just to is one cluster and divide

i to actively in order to get the optimum number of clusters white bottom-up is the opposite scenario we stop

was a plenty of cluster and about them uh i to civilly

um

so so not is by far more popular approach

um you get the best result of the law

nine evaluation

i top now is

uh maybe a bit less than its but achieve competitive results

um i work for sentence

show that for single distant microphone and can lead to compare a result

but the question is okay we start with an artist will converge into some clusters

and how sure that this cluster corresponds to a speaker

or another acoustically sans is like the final

so

yeah required so that this approach converge to a local maximum

um in the perfect word

operations dominates over the intra-speaker variation

and if

i mean

could uh M and size resize to we should say okay bottom-up and top down should lead to exactly

the same

results

but

a of course has nothing is perfect

yeah is there is as well the inference of linguistic contents can which can be very significant

and may since one the speaker more there's are not well normalized

uh i

is the system can converse to a local maxima and we can uh

B not

a speaker unit but uh other acoustic units like the phone and Q

so in the case of a down

so the a new speaker out to and from uh

normalized by grand mother

so this model is to with or of the at by a lot available speech so we can expect small

well to be we've

and that is the

speaker uh uh i iteratively introduce was a large amount of data us so

we can expect

have this new model quite a more normal light as well

so is a huge risk as well

a a a a a a a zero sum of their

uh to a of the linguistic is to normalize it

uh uh i

to as the speaker by motion as well that's of course what we don't want to get we want to

get the highly speaker-discriminative system

by comparing the bottom up

we should has a system was some very small clusters

so which are which can

to am i mean a local maximum and a highly uh discriminative

a so that my from this point of view

and the

nation compared to bottom-up up

but the problem is

has a cluster a very small as a big

is that a a is a you would risk that's the system converts but a a a a a some

of the acoustic it and we

normalized

so finally just some

i think but of the system may have the own drawback a and there or the advantages

according to the so so speaker discrimination and the optimization to linguistic nuances

so that's you just right now is is

where with some

but one work

so

here is a our experiment set so

we have a a a a a a a speech activity detector is for but of the system

um i

on the left i think it's on the left for you as well yeah

uh uh you have a bottom-up systems so it's a classical system

of the art system

yeah is the following reference you can see you are going to

to spend too much time to

to do about this but uh uh and you decide you as the top-down down sister

so typical top-down system as well uh

the so this is these are the two "'cause"

S parents

and next we use so a pretty freakish

as long the following paper shown here

so this is an option step will see the difference lead

and a the by a and map based resegmentation segmentation and a of the this and and bodies edition of

the features and a final the segmentation

or the that that's sets so on

a top training from conference meeting

yeah from the list out you of four five six evaluation

and for the evaluation sets so the proposed to use a a to a seven out to nine

uh

uh that that set which are

a of T V shows right cord want to shook is a function of T B shows corpus

here ah

no additional preference

so the first call um is that a can be the better

and

is the score one

of speech

uh of course as i our system

the help

does not process and and is the overlapped speech we just focus on the second one

and

so a for

the we can see

and and just looking at the is that was also apply

occasions fixations that

and see that okay first sub of the system

a a better to an ounce for two Y for

top down a well there is a uh

um is a result a much worse for of

for uh um

the T V shows

yeah signal

it is not to as the best system which provides the best with a

or to the that that's set and see for example T a part to a seven top down Q better

result why for out you nine that

as at the bottom up

and

we and also consider hmmm

the results be a simplification

so that i can see it

vacation

just

uh but a a degradation in performance for the bottom-up

a that for the top-down down

it's a way a proof of um is the system

so

it's a question is uh a okay may be purification you the discrimination between clusters

i i am a as has a down

the propagation

bottom-up

well unless one normalized against phone but yeah sure

the in this case the propagation an is you last

so that's explain a bit it a a clear of the cluster purity

so we propose to look at all the cluster to at by one of the system and compute the purity

for four all of this cluster the card

uh so the is computed

one is the fist

so we takes a double speaker time seconds

and we divide by the that optimal number

uh a uh of second of the cluster

so that a difference a situation

if we have a high purity and a small number

of

cluster

yeah well i i one has a pretty is a purity of

cluster

we can expect a system

a to be lightly to converge to some speaker you

and are very you do not of clusters

like like to as the system converts to as or acoustic it

we as it to as their have been

a

uh we do and what happened difference in audio was a are possible and the same for the last case

so we doing at the true G and the number of cluster

um the for tab and so we see him in

we we don't use we do the propagation process

a a top down as compare about a priority

more less with

but

the top down as the that's class

and C

um of the right the number of clusters

we have a as uh clusters them the bottom-up up and them about cluster it's clusters none the idea and

number of cluster to for the ground truth

so we can expect that top down to be a in the first situation also converge

to some speaker

as a button up is probably the for case

so

well

to see uh what happened

right

the purification

we see that

a first for the top down the purification

use pro as a pretty is improved

um

i cluster

so for sure how is uh the system to converse to speaker then without purification

uh

there is a consistent in purity

uh i i have a cluster them for the top that down so

that is not or even i have to say a in which it situation we however

uh

a last not for this experiment to part is

uh looking at the from musician

for this case a so

we take a different clusters

we take all the clusters um

for a system and for each of these cluster

right

histogram of the different for names

we do this for all the clusters generated for the top down and the sample for the bottom-up up

and for

all of the

the four

compute the to a cluster distance between D histogram

is uh is the colour back like the distance

so

and

X is the average of all

for these distances for each of the system

so um

we can expect uh is this average distance to be small

uh uh uh as a

a distribution in the different phone that uh in the different clusters

so which means that's the system my

and we can expect it to be a high ones as a higher degree of conversion to have problems

and so i

the distribution i'm not equality is the different cluster

a a the is exposed the result in seen first

sound propagation step

i

are used

is a bottom-up

which show really that's in this guy's a cluster are better normalized

a a pill now the propagation

we see that there is an improvement for bus of the system

but um a plus but if a cash

am a very high or than the top down

plus but if question

which just it's explained why the purification prove that that i

of the bottom up so to conclude

um

we have seen in this slides that's

but approach products bottom-up and top-down down

give some compare but results but

is

uh_huh you different behaviours

but up not isn't more disk

because but

often

a a uh a a trade off from some clusters which are last normalized against linguistic content

well i is a top down

uh a a off from some cluster which are better normalized but less

speaker discriminative

so a

uh i i think and

one of the conclusion of this work is a there is a good thing to note to

nation of this two approaches

so we recently published a bit but but i think that a lot of the or other

things to try

and has a a future work

we can expect maybe design a specific propagation process

for a a bottom-up

taking into consideration of this linguistic in which is quite particular

or or on a of this approach

he france's

and that's it

thanks

okay

any question

okay that and if i one a quick question

i the can i think with

and are going to take a hard thing and

um

can i oh oh oh

for

right

as a we stick like to use

i have seen that is

the core of these two approaches which you are are not what the provocation was just a motivation

which lead to these work

but as the core of the bottom-up and call of the top down acts differently

is is the mystic in which is isn't this case the phone and content

of the speech

so uh

but it

but to a question you

i and

you

i think i think