come come from us
and to they i would like to present that will work work and type full uh linguistic in sees on
bottom-up and top-down clustering
for speaker diarization
so let
give a short of view of the work
um so for
i we give a short introduction giving the motivation of this work
and one with the
formulation of the problem
to finally move want to compare is no these two clustering systems
and finally he straight that or ideas with
some experiments or word
so
um um have seen during the last reason two nine evaluation but was basically two main approaches for
speaker diarization one he's bottom-up up also called agglomerative hierarchical clustering
and the them
or or a device a hierarchical clustering
um
we but released uh we sent to you well actually last year a class like as a a bit per
uh give a brief education process
for
uh speaker diarization system
and we we so that it get some consistent improvement for the top-down system
but to trying to apply it on the bottom up
we so that uh the result to word totally in consistent
so that's what
is that's the motivation of work is to know why it does not work and it's leads to
try to have a look on what is the in front of an be sticking reasons on bottom-up and top
top-down
so that start with the formulation of the problem
so here here you have an now just stream
and so we want to solve the problem you spoken one so we proposed to cold G is the segmentation
so so
is the group of boundaries at each speaker down
and
uh uh S
which is is
speaker or grants so the list of the successive speakers
so we is as and when is a G in this case
and so we can
summarise is
setting and by the following questions so finding the optimum S and the optimum G as the argument of the
maximum
of
is the probability given as a set of observations so it's is case a or B the audio stream
so uh just using as a base and from that to stand
uh a a question
we can get the second mine you see on the screen
and uh and use the dean and a it can be it is does not depend on a as of
to so giving the the question number one there
so a a use with this a question we can see that
as is or you know there's which are required to solve this optimization task
the first one you know to compute
P or a given as and G
is my acoustic speaker mother's
off on uh uh so in this case it's often on gmm in may not the approach we we use
currently a
state-of-the-art of
and the second
model model so P
S and G
which is often on me it's uh
maybe except in the
someone prayer was work to to the just been presented now
uh and
so i looking at is a question was is that we have two main difficulties
first
of course we know what the speaker
he's
and secondly is acoustic model defined
a perfect word
from than thirty on the speaker but it can depend and as well
oh on other is and C is like to the linguistic content
so for the next part of this presentation we do the following assumption
is that the major and reasons
but i shouldn't is only you
to the linguistic content
so that's what we go are gonna like a sense
on on is the difference of one times
and they're gonna be written Q
so
considering this assumption is this assumption option can just we formulate a i a question
uh take
speaker and boundary and "'em" out that that are are possible speakers sequences
so now a looking of the optimum as and G plus the optimal speaker and read that all
so consider a now as the inference of the front and we can move on that the second nine on
the screen
which should correspond to monte guys a the probability of or or or or to different for names you
and um and the third line is does a set just explained it with the bayesian rule
um
and next
we can propose to do to S and she first
a speaker diarization and do the following assumption that or the speaker a a or babble
so we can just a a speaker john mother's so P
of S and G can just disappear
and the second assumption is
that's
we can expect
the from and to be in that the and of the speaker and independent of G as well
so that's why we can just from problem in the prior of Q
so finally
we got to a question the first for simple approach
the second line for of maybe more complete approach
and in comparing but of them which will lead mean to same results in perfect board
we see that um
uh the second question a phone are normalized
and
that in the first one
we should have a normalized know that as well
it means that's P or a given as an G has to be trained
we
a can think about a different for names
so
to summarise i
see from this equation that
the speaker in mentoring delta has to be up to nice to get a or with S and G
and so that a an called solution for the top
so um that the reason why it is to the fine was try to and you are search
um um
which are uh main a bottom-up and top-down approaches
so if we move on out comparing these two approaches
see
i'm are with is
just just to is one cluster and divide
i to actively in order to get the optimum number of clusters white bottom-up is the opposite scenario we stop
was a plenty of cluster and about them uh i to civilly
um
so so not is by far more popular approach
um you get the best result of the law
nine evaluation
i top now is
uh maybe a bit less than its but achieve competitive results
um i work for sentence
show that for single distant microphone and can lead to compare a result
but the question is okay we start with an artist will converge into some clusters
and how sure that this cluster corresponds to a speaker
or another acoustically sans is like the final
so
yeah required so that this approach converge to a local maximum
um in the perfect word
operations dominates over the intra-speaker variation
and if
i mean
could uh M and size resize to we should say okay bottom-up and top down should lead to exactly
the same
results
but
a of course has nothing is perfect
yeah is there is as well the inference of linguistic contents can which can be very significant
and may since one the speaker more there's are not well normalized
uh i
is the system can converse to a local maxima and we can uh
B not
a speaker unit but uh other acoustic units like the phone and Q
so in the case of a down
so the a new speaker out to and from uh
normalized by grand mother
so this model is to with or of the at by a lot available speech so we can expect small
well to be we've
and that is the
speaker uh uh i iteratively introduce was a large amount of data us so
we can expect
have this new model quite a more normal light as well
so is a huge risk as well
a a a a a a a zero sum of their
uh to a of the linguistic is to normalize it
uh uh i
to as the speaker by motion as well that's of course what we don't want to get we want to
get the highly speaker-discriminative system
by comparing the bottom up
we should has a system was some very small clusters
so which are which can
to am i mean a local maximum and a highly uh discriminative
a so that my from this point of view
and the
nation compared to bottom-up up
but the problem is
has a cluster a very small as a big
is that a a is a you would risk that's the system converts but a a a a a some
of the acoustic it and we
normalized
so finally just some
i think but of the system may have the own drawback a and there or the advantages
according to the so so speaker discrimination and the optimization to linguistic nuances
so that's you just right now is is
where with some
but one work
so
here is a our experiment set so
we have a a a a a a a speech activity detector is for but of the system
um i
on the left i think it's on the left for you as well yeah
uh uh you have a bottom-up systems so it's a classical system
of the art system
yeah is the following reference you can see you are going to
to spend too much time to
to do about this but uh uh and you decide you as the top-down down sister
so typical top-down system as well uh
the so this is these are the two "'cause"
S parents
and next we use so a pretty freakish
as long the following paper shown here
so this is an option step will see the difference lead
and a the by a and map based resegmentation segmentation and a of the this and and bodies edition of
the features and a final the segmentation
or the that that's sets so on
a top training from conference meeting
yeah from the list out you of four five six evaluation
and for the evaluation sets so the proposed to use a a to a seven out to nine
uh
uh that that set which are
a of T V shows right cord want to shook is a function of T B shows corpus
here ah
no additional preference
so the first call um is that a can be the better
and
is the score one
of speech
uh of course as i our system
the help
does not process and and is the overlapped speech we just focus on the second one
and
so a for
the we can see
and and just looking at the is that was also apply
occasions fixations that
and see that okay first sub of the system
a a better to an ounce for two Y for
top down a well there is a uh
um is a result a much worse for of
for uh um
the T V shows
yeah signal
it is not to as the best system which provides the best with a
or to the that that's set and see for example T a part to a seven top down Q better
result why for out you nine that
as at the bottom up
and
we and also consider hmmm
the results be a simplification
so that i can see it
vacation
just
uh but a a degradation in performance for the bottom-up
a that for the top-down down
it's a way a proof of um is the system
so
it's a question is uh a okay may be purification you the discrimination between clusters
i i am a as has a down
the propagation
bottom-up
well unless one normalized against phone but yeah sure
the in this case the propagation an is you last
so that's explain a bit it a a clear of the cluster purity
so we propose to look at all the cluster to at by one of the system and compute the purity
for four all of this cluster the card
uh so the is computed
one is the fist
so we takes a double speaker time seconds
and we divide by the that optimal number
uh a uh of second of the cluster
so that a difference a situation
if we have a high purity and a small number
of
cluster
yeah well i i one has a pretty is a purity of
cluster
we can expect a system
a to be lightly to converge to some speaker you
and are very you do not of clusters
like like to as the system converts to as or acoustic it
we as it to as their have been
a
uh we do and what happened difference in audio was a are possible and the same for the last case
so we doing at the true G and the number of cluster
um the for tab and so we see him in
we we don't use we do the propagation process
a a top down as compare about a priority
more less with
but
the top down as the that's class
and C
um of the right the number of clusters
we have a as uh clusters them the bottom-up up and them about cluster it's clusters none the idea and
number of cluster to for the ground truth
so we can expect that top down to be a in the first situation also converge
to some speaker
as a button up is probably the for case
so
well
to see uh what happened
right
the purification
we see that
a first for the top down the purification
use pro as a pretty is improved
um
i cluster
so for sure how is uh the system to converse to speaker then without purification
uh
there is a consistent in purity
uh i i have a cluster them for the top that down so
that is not or even i have to say a in which it situation we however
uh
a last not for this experiment to part is
uh looking at the from musician
for this case a so
we take a different clusters
we take all the clusters um
for a system and for each of these cluster
right
histogram of the different for names
we do this for all the clusters generated for the top down and the sample for the bottom-up up
and for
all of the
the four
compute the to a cluster distance between D histogram
is uh is the colour back like the distance
so
and
X is the average of all
for these distances for each of the system
so um
we can expect uh is this average distance to be small
uh uh uh as a
a distribution in the different phone that uh in the different clusters
so which means that's the system my
and we can expect it to be a high ones as a higher degree of conversion to have problems
and so i
the distribution i'm not equality is the different cluster
a a the is exposed the result in seen first
sound propagation step
i
are used
is a bottom-up
which show really that's in this guy's a cluster are better normalized
a a pill now the propagation
we see that there is an improvement for bus of the system
but um a plus but if a cash
am a very high or than the top down
plus but if question
which just it's explained why the purification prove that that i
of the bottom up so to conclude
um
we have seen in this slides that's
but approach products bottom-up and top-down down
give some compare but results but
is
uh_huh you different behaviours
but up not isn't more disk
because but
often
a a uh a a trade off from some clusters which are last normalized against linguistic content
well i is a top down
uh a a off from some cluster which are better normalized but less
speaker discriminative
so a
uh i i think and
one of the conclusion of this work is a there is a good thing to note to
nation of this two approaches
so we recently published a bit but but i think that a lot of the or other
things to try
and has a a future work
we can expect maybe design a specific propagation process
for a a bottom-up
taking into consideration of this linguistic in which is quite particular
or or on a of this approach
he france's
and that's it
thanks
okay
any question
okay that and if i one a quick question
i the can i think with
and are going to take a hard thing and
um
can i oh oh oh
for
right
as a we stick like to use
i have seen that is
the core of these two approaches which you are are not what the provocation was just a motivation
which lead to these work
but as the core of the bottom-up and call of the top down acts differently
is is the mystic in which is isn't this case the phone and content
of the speech
so uh
but it
but to a question you
i and
you
i think i think