so um this is the second talk
i about uh i J again a a speaker diarization them what we are trying to focus on multistream approach
use
and it's uh actually detect in the the the baseline technique which we are using a
is the same as in the previous talk which is uh
information but to like system
and to uh we are made me as the P saying the are need trying to look at the a
combination of the outputs or a combination actually of for different different seems on different levels
and these are only acoustic strings cell so no prior information
from brawl statistic so
again
um
and the third
for order here
is this was done by D
was was D is you them to a T D a we let
for for should L D
and um
i interaction a motivation
as the same or
kind of close
um again we set holes uh or we assume that's uh
the recordings which we are working with a are recorded with multiple distant microphones
i i as um actually features what you are using a two kind of a "'cause" to features and that
mfcc features which are kind of standards
and then
i that time delay of are right and i was features
um
each loop they are pretty uh a compliment to mfcc
and uh um people nowadays they they use quite quite a lot for
for uh diarization
actually this combination
winning acoustic feature combination
for we uh
uh uh information but like a technique is
a a key
less a a state-of-the-art results in a meeting data stations
um so back to O two motivations so usually the feature streams are combined or a model level
so
there are separate models for a gmm models
for
different uh actually speak uh streams
and this is are those way away
and the and
these uh actually uh a look like use in the and are combined
with it's some you know waiting
a and there also some other approach is like a voting schemes between
these uh systems
i diarisation systems already
or or actually the initialisation
i run system is done on the output of the other system or some the graded approach
our are actually question a is uh if we can if
and if you see or do this to kind of different acoustic features
can be integrated using independent diarization systems
rather than independent
models or in other word
but but actually D add some advantage of using systems are then
a a combination
but do we mean by system or a combination i hope is going to be clear
uh uh sure or
a to slides
um
so maybe the last one about i'd like blind of the talks so for let me say a few words
of all this
information about but like principal which we use
and which is actually done on single stream that a station so no combination of before features
and also if few words about to model based combination about
system based combination some he bit combination
and the experiment a result
again uh a state-of-the-art results using actually
uh this uh but to make uh
information but the like a technique
um
um we we are getting state of the results with such system and that is not too much of a
computational
complexity in in that
um
so this is uh are the can the advantage
uh how does it work
these information about button like principle
um actually this kind of intuitive div approach each has been borrowed from uh
from a a document clustering so
at the beginning sample that we have some document that you want to class or in
C clusters
in our terminology
and
um
and uh
what these actually
a a what is added did that as a as the information is some body Y which is about but
be of interest
a a or we call it as are but i of a body able which it surely no
or something about discussed ring so some in these uh
a document clustering these why uh why why able can be
a can be words
oh all the vocabulary which
of course to was about uh a about these uh
discussed serves and has information about
a about six a
also so actually some all that there is a a normal condition distribution P you white X so like given
X is available
and back
and going back to this uh a problem or speaker diarisation
our X got to i X is actually set of elements
oh and the speech so again
speech uh segments
again you need for segmentation we we set and
these need to be
uh
uh a cluster into C C class or
so we to this information about the like a principal state
uh that the clustering should be press the ring as much information as possible between
a a C a Y
or by minimizing the distortion these distortion we can see as a
uh some
compression for example
or
also in a our
our way it's actually some regularization regularization so if you don't have uh
these distortion C N N
which is actually but our terms uh
i'm each information
oh oh X and C for i X and C
uh uh if you don't have a it's probably going to
cussing to one one global class or which which is not so the case C one
so i get this i'm
i intuitive div approach
but in the end it looks that uh
or you can be proved
but
if we actually
you are going to
um
have to my this objective function which is again
uh a mutual information C Y
and my nose
some
uh like i to rate or uh X and C
uh yeah are going to
actually uh
to move the problem to the
uh to the way you where
uh the properties
those
that he's Y given X are going to be
uh
measure don't can but using a simple divorce and
so
but the point so we don't need to look for some
especially divisions of the as your which is saying
which got a of we should we should be him together
in this uh in do
and so intuitive approach
i be due the derivation we will find out that actually that should be jensen jensen channel uh the imagines
used for
for clustering
so in the end uh the approach is pretty simple or
going to be is
so here it's actually a got marty for a
a also in each iteration them the are
we are uh
we are thing to clusters together are based on the information
uh from these uh give chance so we take those clusters which have
the small the and we just met jim
and you do it it's that to the um
until
should is some stop criteria
stop it that you know
is again pretty simple and it is actually a normalized
but you or from
i go back
uh this a mutual information between C and Y
so so again mm to somehow O
i i know finalised this uh i the approach
uh
right is good we have us to pink daddy and we have actually
the the um
where you how to measure your the
the similarity between between clusters
and uh
it's pretty simple
to to and coded it you know
so
um
oh just a a few information about uh are those properties which are actually here so
would be fairly suppose that uh by but you of C given an X where C is cluster eight X
is input uh segment
is going to be hard
partition meaning
it all
all these bills only to one class or
but is no like
a a week a uh weighting between several class er
and place probability why given C which is actually
a a some yeah but about a viable
yeah distribution
which which is used to a actually to do this so merging
and um
everything should be more clear to on this
on this up your
so i mean suppose we have input speech which is uniformly segment it
oh for example mfcc features in this single
some the approach
we have uh elements of
these
and among variables
i still didn't say what it is but
i i it's probably in T if in
our case is just universal background model
you just on and tired speech
and uh
uh this is actually defining body able to what you to do the thing so
actually actually state or which you see in the middle or are back doors P why you an X which
are
probabilities
for a vector Y given
uh you the input segments
and um
the clustering which is a a again competitive technique and in the end we get some initial segmentation
and finally we do refinement using ca
training a gmm and doing viterbi decoding
that are let's go back to
to the feature combination
so in case of uh
uh a feature combination which is based on the big around what else so suppose that we can have to
features again uh a few just a at is and and tdoa away
and we have to big our models
uh each are trained on on such features
uh what we can simply do that
we uh we can just wait can nearly weights
these uh
B Y given X uh
vectors or probabilities
with
put some weight
and it's going to be us new mats weeks
oh for these settlements sorry abilities
in the
a these weights
how to get a to of course we trained them or estimate them on the development data so
we should be juror rising or different data
L so one
we have actually these P Y X is make it's the rest of the diarization system is same so P
actually do it just at the beginning where we combine these
i are buttons where is
and then we just just do a iterative
approach to
to do clustering
so actually this is not a new these has been already but uh
published be i row last the interspeech
um this is just again the gap how how it is down
a again there is a matrix cold
thus be white X
probably
um
the vectors like an vectors
and they are simply
a a it's uh by by alright right
yeah and then there is a clustering operation and refinement
now what is actually knew and what uh what we are type in this paper is uh
multiple system combination
so so
a set of doing the combination before clustering uh what would happen if you do combination after clustering
so
um
again with a of that they are to big our models
oh trained on different uh features
and they are two diarization systems in the end so
uh we
actually it actively
get some clusters
a stopping titanium actually can be different
meaning
can have different number of clusters for
for a feature a a or four it should be
the end to be get a this in these wide given X
or a you see actually
and
and
a time to go back
from this class to initial segmentation
is
have been that would D Y you X
i to do is just simple by bison operation
and um
again there is um
something you image how how this is done
so again and that two diarization systems
which are doing complete clustering
and in the end we are again getting a
um some
we are getting
some clusters and to get actually back
two
to this initial segments P Y given X
uh we just a apply those uh a simple operations um
and just simply
uh integrated over all be like C
uh
why why this should actually work uh is uh again between two intuitive
in this case uh these be Y X
after combination are actually estimate it on
a a large amount of data so if they are not estimated on those short segments
as in case so for a your combination
before for clustering
now each actually white a is uh
estimated it or not
on a lot of data because you have just you cost in the end of course
um um
the third approach so
a actually keep it system so each is just the combination of those two but also
uh are before passing and after clustering
so in one case
what we can do use just
that before a as we just uh
or
and a one in one a a simple stream just do uh
a a system combination and then we just uh
a combine such output with a
yeah are the others
stream uh
and she's to be before to cussing so maybe it's it's more seen here
i into two streams
in one case we do this system combination so we two clustering and from these be white C but is
we go back to be Y X
to get initial
we show segmentation or initial properties for for the segmentation
and in in the second case actually be
we just do these uh um
she's uh did you always stream
just
uh
i try to do these combination before
for for clustering
that's a those to the kings are simply combine of course
i i D and we have some you Y X uh
but takes
a P Y C about six N B just the i'm and as before
of course there are two possible K sees uh what should be done on beach kind of theme
and uh this is going to be seen the results are going to be the seen in table but again
maybe it's into a D for how this should be done so that we say a few words about the
experiments
uh we are using the same but each transcription data uh system me sister uh sending meetings so no i
mean data but the only rich transcription
um the mfcc features and these uh
uh tdoa features
um
and uh
and she or the speech is coming from and the and they again
um be
uh
single and hence speech signal
um
again the was weights which between the estimate are are estimated on the open set
um as before we are only many shopping diarization error rate with respect to speaker or or so not speech
or speech nonspeech there
a a are the results each be a shift if you remember from the previews uh to talk
the baseline was around fifteen or
fifteen point five uh percent
was uh
actually use
single stream techniques so just mfcc features
you do can nation
oh for mfcc and tdoa features
in case of information but to technique and
kind of the H M and gmm
uh we may see that to because we get to you and twelve percent
um
and the second but is just to a being but are the weights those are weights for
reading the
but different features so in case of
because these are different quantity so in our case of some properties which are actually
which we are combining
in case of a and uh J and those are a look like people so
a that's why also be so uh weights are different
and again in our case the combination is done using can of variables
and this is actually as you see you can see a perform K the
the a of system
so these are the results for combination
uh but combination
one no
on the um
actually after clustering so
combination system level as as we call it
so in that is this base like you and point six percent comes from the previous table
you do system
combination meaning after a can my these tolerance of labels after
clustering cut a you may C V are getting pretty high uh almost forty percent uh improvement
and then they are of course two possible combinations of system and model
and a weeding
um
actually
looks
and again it's pretty straightforward that
it's better to
to do see stan
combination or system waiting we the tdoa features because they are usually
mm more noisy
and they need probably more data to were to be what estimated it or at least those of viable
to have more data to to to be but estimated
in case of a and that's is is features uh it looks at works so much better
so that's why reason
also you may look at the table
a race
if the the weights goals close to the
those weights uh which we need to estimate the goal the go close to the system combination so instead of
zero point seven zero point three
we go to zero point eight
and then estimated on different data but
to generalise
for this case
um
uh
just a B to explain why
possibly why we are getting such improvement
a if you look at the single the stream
a a results
for each meeting seventeen meetings can this case
so are
but model combination and system combination
um um
and you look at the button or which is just simple and S C and D do you do away
information but to neck techniques so there is no combination of different features
may see that
most of the improvement comes in case
but is a big gap between
those two single stream techniques
we have the course you don't get to improvement but you is
a a big gap between mfcc and tdoa single stream
but system combination works so P develop for such a meeting
and um
just to conclude the paper
uh so here we are present a new technique for or new weight of combination of of the streams of
a was six teams
so rather as we did before uh
before clustering to to way the the acoustic features here we are present technique which
actually is trying to do we after clustering
and the reason uh a simple for that this uh probably the these on the variables which
which are used to then to what you're to match different different uh a clusters or different segments
are
going to be estimated on are more data
or not just on on
short segments
and uh actually uh as it was seeing in uh in uh
the results you are getting pretty cool to improvement for
for such a technique so forty percent uh
that were all seventeen meeting
um i think i'm done
oh
we
the on spoken
since something that i mean
i no not i think some a specific question
for them
yeah
i for all of the
and and goes to P