a this is such a are going to a uh and uh i we're talk about
a binary that is a to they
this is a during what we jump and so last from university of a you're i'm coming from that of
funny nick research
so what do not that way
use
has these out
it is in front of a
"'cause" is outlined
i'm going to use what speaker diarization use or at least
for the ones of you that the remember from the produced
talk
i i'm about a binary speaker modeling
and then when during the two things into the binary speaker diarization system that we just developed
experiments and then conclude a future work
first first a speaker there is a and yeah as we have a a a of you are we split
the give the speakers
we see who spoke were and and we don't know
need
speakers is or how many speakers are there K is like the P three dogs
no
is the art days
well we have done
oh as in the people in the last year as we've got an
do around seven ten percent in but cousin was even though these this
something that since two thousand four uh
it's not a part of the nist evaluations and i bet nowadays it's
C even lower than that
and we've got an to twelve fourteen percent for meetings even maybe nine percent now on uh
on making the me
this is a a a a a great to result these makes that a shouldn't be able to use for
all there a you know as a a a as a blocking as a block step for other applications like
a speaker I D when there is a a multiple speaker that that there
but still we have a problem
it's too small
example of have some numbers an in in uh standard systems if you develop a diarization system you the do
anything about it
it's
most probably gonna go way up of one time real
and if you try doing something about it
the two systems at base so that the people that they saw the were trying to do something about it
that for mixing
the first one is
one couple years ago
i was going down to point ninety seven for real time that's a on a model or and they were
do some tweaks
to to a gmm based algorithms to or it's a hierarchical bottom-up system
a and the they were getting to a just on the real time
and a father on they said okay let's go to do you know
so we can use uh sixteen core whatever the uh our you much this
and they went down to zero point zero seven i nine all the nowadays these this still even five you
and faster but is to P O so
you don't have to be you in a mobile phone or you don't have you know these
uh
the is have to work on used in
depending on what architecture so
and this is what you
have a system
that the really really is very fast and it doesn't matter and one and or what uh architecture running it
and still have a
this case we by adapting a recently proposed the uh uh uh a technique called binary speaker model
we also have another poster or in uh getting i on using this for a speaker I
and so in this case we that it to there is a show and i'll tell you i'll tell you
how we it
uh uh to know what we'll do that to know the basics of what's uh by speaker modeling as were
i'm explain about it a little bit more now
so
and this is a a six of it so we have a a an acoustic C have some input acoustic
data and one uh and that we
a a vector well actor
of J
a zeros and ones
so that it is uh basically in a very general way like explain here we just extract some to sleep
parameters mfcc or whatever want
and and we use a
and he back background model K B M which is basically a ubm but trained in a different way
to
yeah this acoustic they and then with these K B M we obtain this uh
these minor to case
for each uh acoustic
they say which could be a a data for one speaker or data for a couple seconds a
that C D's T M
be T B M you the understand it different ways this is basically a set of options
position
in a particular way in the acoustic space in that more be them may show acoustic space
in you have just one they may so that
we can see the example
so we first position be
acoustic options in the space and then we take up put that that were acoustic stick or or or of
a speaker data
i we see which all these options
at most are present in the best our would data
and uh from there extract a binary fingerprint which uh or by taking which
has to does
are present in the positions of discussions that do not really represent well than that and ones
oh on the options that are are ending our data
and is you right
so how do we do it for a a a for an obvious to how the we all these together
well we can see here on the left side we have already puts signal
where we compute that were uh mfcc acoustic features at at the and on the right side we have a
can be yeah
and and the
so
and that's the vertical vectors
he's we have uh vector as
which each uh whose dimensionality is and is than a certain number of options we have
you in our what be a model
and for each input feature vector we select
the best
we could say other nor the one percent best two percent best that ten best whatever
we wanna use whatever
for a scroll one of us
the
and but feature vector
that that for X one to X and a
our where data one a model
we can get down to this uh camping vector
the first of the of the result of actors which basically
counts how many times
each of these options have been
has been selected as one of the best representing options for the acoustic data
and then i C we just say okay
that
a and know that by or whatever the options are present in the data of our once on the rest
are also does
so
once we have a a a a E
a binary vector
for a two speakers of for two sets of acoustic data
it is very fast and very easy to
to compare them to combat how close they are
in here is just an example and a is the the type of a few that should be a some
uh uh in the form of a
as in the top one of the model
and uh a basic this one possible
i mean that is many possibilities it in the working in by them are you just need to find a
way to compare to binary signals in this case
well we used in this paper is
in the uh not we just need the sum
oh of uh uh you know it's uh some supplies one whenever in the to back to we have a
one
and the denominator just uh
do are so we some but as one of a number in a a in you that of the vectors
we have a one
and this gives as a score or from zero to one no
the zero either use
a this in a body not seem that and one is the the same back
a a a a a speaker by and models
and that they said to we have a poster experiment more about and you can go back to uh to
a post we cut speech
and that's see now how we apply
to us
to speaker diarization
so this is basically the system the new system that was into they
this is uh uh
just
even if it was a because different of strange this is just and a minute if but the map system
we can see that the is a but if clustering down B and we have a
kind of a stopping therapy or or or a cluster selection
but the see
the the of about so first the bottom it
uh its uh D feature extraction to extract mfcc whatever we run
training the next to eight
so we need to train these K be a models in this case we train them from be they the
the data itself we don't use external up
features i did by stations
well the we take the acoustic features
and we
like
a i'm interested in in summarisation we always need to initialize as or a system as we are doing about
the bottom-up system we need
many more clusters than actual speakers are there so
we need some how to create those clusters
and
this part is that was processing
that is
in at this is just a nice of using should would just
a just a little bit of time of the computational time of the system
after that the of minute of clustering
which is what we uh keep blasting keep joining together or those clusters that are closest to that this is
all going in the binary space
and final once we have reached to one and this is one difference from a standard
have a minute of clustering system go
from a and to one
we have reached a one
use an algorithm to select how many
a terms of to multi we have
as a said
uh of mfccs
we use like be they have to C is a standard uh and B ten millisecond T five miliseconds
and can be um well as a said that
a model but train to the you know a special way
i in a special
if you use a uh you a model train it we stand standard em M L techniques
you going to have the options positions at the average points
modeling optimal more the in the late that uh and this a but it's is are we so that they
are not
uh uh uh are present in the particle it is of the discriminative information that the speakers have that the
speakers of your all you have
so we try to do something different that can model that
and uh and the this and X so that it can be anything higher than five hundred options we can
go to ten thousand the the performance
those an the neither neither a uh that rates
how to do this
so in this case
in this paper or to it in the following way
to the uh we to be the that these is uh a i would put audio and we first train
as to option for them
i believe it's two seconds of speech we some overlap
so we and that is parental
oh
second that the house and options
oh two thousand a all the options
the options of was and very small portions of the only so whenever the is speaker they represent the speaker
very discriminatively
and and we use that can do to uh medic to adaptively
yeah shows shows that we're
this space is optimally ultimately
model the space
like more do more separate between them
the whole acoustic space
and that's it
this is actually much faster than than doing with additive splitting uh yeah M L
no
right
and a is of the data
these these binary vectors
from the acoustic data and in two steps
to do stuff so
a step which is
oh in the
first best the the K best
uh captions for each acoustic feature that we have to do
we one time only and then on the second step
for every subset of features that we to compute a fingerprint from that's gonna meet only the evenly in our
uh
that is that is addition
hmmm
a time we need it then this is actually very fast
so that
we have the mfcc vectors
a in top
and for each of them yet is
this best options you may not working in
for the time
and that is our were first part and we can store in the score memory
and that's done on one time this is a little expensive because evaluating option mixture models
but this is
one time only
then at every time when i can be here
speaker model just have to get
that that the
and
the counts
and from those counts get a binary vector
okay and this is like fast
a five
acoustic have to talk about initialisation
and he just uh uh did something so for simplicity just use we use the can be M
the kingdom
and then initial clusters which you you just to bit any options that where
uh chosen first
i mean
as that that the segmental or segmentation and we've it there and with those we assigned
we got the clusters that
we than the most
now are in the binary the me
okay and we have that
this is have for us is is is is exactly the same as
for example the icsi system is a format for them map clustering
except that now or anything the domain
so for example to is
fingerprints from our approach of
of a cave as options
a close per T is completely a binary
a between all the models are that all the cluster models and just choosing the two that are closest
to merge them
i
i am and we
there are we just take
three seconds of data
in one second at time and assign compute a fingerprint from T for each of them
and assign it to the to the better speaker model
last but
the last part of the system
these ones we not to one so we have one a cluster we have to choose how many clusters is
our optimum number of clusters
so for bad
a a a that the S uh
to test this terms that was present
but i i two are people in interspeech two thousand eight
and the uh in a fit of time
so five it all in the paper are but we just a is estimated to in the uh just a
relation between the uh in and inter distances between the power of the terms
which allows us to select the optimal number of clusters
as as i have to say
he's about in the system that i'm less happy about and the have to improve this by
about eight
of course we use that as a should it but also use a by a factor
and
because
the diarization results of so freaking decided to use
a nice to rich transcription evaluation that he's is about thirty the six
uh shows
and uh i
to say that he's
runs see in just a but an hour in in a lot the P C so it's pretty fast
they
maybe results
the first aligned use the results using uh a big easy could a gmm system but just an implementation of
the
um
basic one
a a is as about twenty three mm send and average that position of than a running down of about
one point nineteen uh real time
he's is optimization here that is no i mean is just an implementation
the standard implementation
a at the last two lines
but do that this is a uh to uh configuration depending on the number of options we do we take
for the K B N
two possible implementations of by system
we can see that in
a five
or that position it is this is slightly higher than the baseline instant
a the real time factor is ten times
faster
so is pretty good
and uh i mean was to importance of the training of the K B in
a a a a uh we
the the that's we used just a standard gmm just T V if too
that's the second line of results are we see that it just breaks
i mean the reaches as if at a speaker
characteristic a speaker discriminant down shown it just doesn't work
i also about so that
a a selection of the number of clusters
still those and
do the job
a number of clusters after running the system
we actually get to the five percent of the error rate
which is
a better than the than our baseline
this is just a a a two show that's right and
all depending on the number of options that the position error rate we have
how we can see just think of the the black is the average
we can see that event
and have nine hundred but after five hundred a our sense for the K B um the results are more
less flat
so we doesn't matter of five hundred six and the
that's fine
and this is a body is so i've shows was or meetings
oh
are our proposed system or the baseline of and see that in most cases they have
but was the same out of course a sum up is that
make these two percent difference but
and
and and that a couple of shows that are are better
so
we
so that that is a shown was kind of a a a a a star
shown is more uh a was but the things on top of of a standard system that i to get
these little gains in performance
but
just start a a a a system that we call we can even get that
and and and when i'm working the next to uh uh we can improve the by key fingerprinting
we gonna find a better of stopping at the hopefully
and uh also
that the system always monocle in and maybe working in cell phones will
thank you very much
that's can like to think of making did not
and he's my
okay
no
no this is this is and the M
oh
oh
oh sorry L merging and speech key detection is on right at the beginning
at the very beginning
so as just the stand like a standard uh
that action system
it's just not it's
mean
the
the
and see if it goes back
justin the acoustic feature extractions at the beginning of the system
and but uh used uh the speech taking that action from for you to come
thanks to that
no no i just i just to acoustic
i don't merge
i use M D and that's multiple microphones but just been for than the use a single channel then
many ideas but that work at the no
have to try
okay since we ran out of a nice thing