one one um and which reach an on and i'm gonna talk but
an utterance comparison model that we're proposing for um
speaker clustering using factor analysis
so uh i'm first gonna define what we exactly mean by speaker clustering because the term is used under different
context with like you know a a subtle variations
and in our study we define speaker clustering as a the task of clustering a set of speaker homogeneous speech
utterances
such that each cluster corresponds to a unique speaker
and one i say a speaker homogeneous speech utterance a means a each utterance which is like a set of
speech features a feature vectors of contains speech from only one speaker
um the so
and the number of speakers are are uh no
so i the applications of this um um
the plan speech recognition for example when you want use a predefined set of speaker clusters to do uh robust
speaker adaptation when when test data is is very limited
um this also used uh in a very classical for a class called method of speaker diarisation where you want
so
i spoke when
problem
so um this is a a a very classical setting um speaker directories reason is when
he you given an unlabeled but the recording of an unknown number of unknown speakers talking
you to determine the parts spoken by each person so if you have an example here it's just a sixty
second
um recording of a conversation but they can do is in just divided up into small little chunks and
assume each chunk is one utterance meaning it only contain speech by one person
a you do some kind of
like the of of these of each chunk
and and then
a can you have a clusters first cluster here a second and and that there
and uh if the number of clusters is a number speakers in each cluster actually contains speech but only one
person then you have perfect speaker diarisation
of course in reality um
you may have then you may have actually done some like this force them letters as you little board
sometimes you a there may actually more speakers and clusters or there may actually less speakers
a cluster or see that kind of errors that can occur
so um this is a just a sort of classic speaker diarisation method uh of course the the more sort
of the art uh methods is that that don't use this widget for with for example a variational inference inferred
a but here it was let's look at this class of method for C we have a speech signal use
segmented into
these speaker homogeneous uh is
and that you use some kind of distance measure to compute the distance between the headers is you merge the
close to study or addresses check whether some stopping criterion is met
but is not that you look back in and you continue clustering until you until your done
so i have some pop a distance measures for for this task
um
a to arbitrary speech utterances X of a an X of P what is the distance between them
uh you have things like the generalized likelihood ratio or there that cross likelihood ratio or the uh
a a bayesian information criterion just
and uh again for both yeah why and see that we are uh you have have to you uh estimate
some some gmm parameters from from each utterance
and then you that compute uh uh likelihoods and then use those two
create some kind of a really show that determines you know how close these utterances
are to each other
so
a why we we're is it can to be a a a better to measure i mean and we for
example if if you look at you look at these
that's G C a wire the the mostly really mathematical constructs i mean
a a you're not be a really have a rigorous just as justification and on how they compare uh is
based on
a physical a of speaker similarity
um there's no real a statistical training training involved
um so it in that sense they're they're kind of a hot when you when you just that the men
into a
uh you know in a speaker clustering task
yeah and that's to address these problems there been trained up a distance metrics that have been proposed and eigen
voice uh voice
eigenvoice voice based
a methods
um especially at the i didn't voice and i did channels and and factor analysis uh do this
provides a very elegant and uh and a what framework for for modeling uh inter speaker and
and intra-speaker variability and we
we want to try to use this to come up with something that we think is is a more reasonable
uh distance measure or method of comparing letters
so the first thing we thought was
what
what what how do we define a uh a a a eight
that and a way to compare other since M what example exactly were trying to do
a one we cluster it if you have to a speech utterances
but we think that they can from the same speaker then we should cluster and
if we don't think they came from the same speaker and then we should cluster
that's what we're to
basically data
so
so we just define higher
uh
no i a probability that the two speakers were spoke them by the same person
and uh and that's that's or similarity
that metric
so how to define the probability well
i
if you
a perfectly that posterior probability
uh of each speaker clip and and um arbitrary utterance this P that we i given an
if that then you could simply right
uh this
a a probability each one which is the probability that
i which is the at the hypothesis that X of a an X that be which are to arbitrary utterances
or the same speaker
and i can just simple we set up a question this way i just using for a basic probability
a a probability of X a
a a probability of of
a um X A of producing a speaker W Y
or let's say that that the of the uh i don't six a big was the probability of your speaker
being W I
and then the probably a given an X a be what's your
a probability that you're speakers W like you just much by these two and then you just sum up over
all the speakers in the world so that's so of W is
but is basically the population of the world
so i
and we can also uh in and
no some but that the five
uh uh the uh the null hypothesis were X of and that would be come from different speakers and then
you simply do this the notion
a for the i-th jay's which are different
and then
it's very easy to show that these two uh probably are are going to add to one
so so these are
exactly
you could just very basic probability
one can question these
of course but
but are like impractical um
i mean there's no we can really
a a are these posteriors
so this is where a a factor analysis
um the
um so are uh if you if you have a speaker-dependent dependent gmms mean supervector
uh you you can model that has a ubm mean supervector plus
and a some uh eigenvoice matrix much by by speaker factor vector
plus and i can tell matrix
uh multiplied by by channel factor
fact
and um
a assume that each speaker uh in the world is mapped to a unique speaker factor vector Y
but you can just change your uh uh uh the previous equation we had a we just replace the W
use with wise
of course this still doesn't have any
any any practical that we
what we wanna do that the more to some kind of analytical form where we're we can
uh a you know introduce the uh the priors that we have on on Y
a and Z
so um
a first a step is uh you have a we have that's
because the estimation
of the piece
um so we just to a summation two
uh a and then it about
so
and
this as well
okay do this
um
a first we have to realise is that the summation is over a speakers uh not the wise wherever ever
whereas the integral is done over the why
a uh a you have to actually get a to uh just a really basic capitalist and and the probability
of break comes down to the uh
room a summation forms
and you actually get uh this is actually the correct form from you get
uh for the probability that a that the two others is uh are from the same trick
and this and equation of for "'em" actually i it actually terms up it in that the different contexts to
um
so which is quite interesting ah a here you see that you have a W you um
yeah that that amount of or
which means that if you if W goes to infinity then this probability goes to zero
which intuitively makes sense
uh you trying to calculate the probably that they came from the same speaker but
if you of infinite number speakers
then yeah that probably should go to zero
so now
are we need is closed form expressions for or uh the prior P X and
uh the conditional P of X
um
given Y
so um
first we want uh the first thing we did was we we simplify the problem by ignoring the intra-speaker variability
so let's just so that you zero
and it just use a S is and plus V Y so we you we just have the eigen voice
not be eigen channels
um
a and the second that assumption that we said
was that um
well
yep i i got into that
a a two add them use that we have to
a use um
a
just take just
but just of these these to have them is use that first
uh a in the house and that that
and i have to and can be written as a glass in with respect to the mean
a the second i'd we use that the product of two thousand is is also a gaussian that's all you
really need to know is that be a normalized gaussian there's gonna be some
some scale factors that at the beginning but
is essentially just gonna a gas
um
and then another sub that that would make
a uh is to simplify the be computation
a a is that we just assume that each vector in in in in each utterance was just generated by
by only one gal in in the gmm not up a whole mixture because once if you use of whole
mixture sure than the cup to to becomes
to complicated
so now you you can see here that uh uh the uh mixture summation is just spare place by a
a single gal C
and and how to decide which mixture
generated which a each frame
well one way is to just obtain the uh maximum like to estimate of of the Y
a for each utterance
uh which then for we described a parameters in the gmm
and then just use
and then for for each frame you just find a gal sing with with the maximum
occupation probability
so uh
now uh you can see that this condition is basically just been a multiplication of gas since that's that's all
we have is just a whole string of gauss is mark what together
we we know that when you multiply gas is you get another gaussian although those not we normalize
so i you you just continuously apply that i'd eighty two two pairs of of the absence
and and the whole string of of multiple
and uh i you we to pay too much attention to the map D appear
but just to is that if you keep going
you basically just gonna get run they have C and what put by some some complicated uh remote
um
a factor or uh which is now inter depended on just the your or uh observations and you are
or eigen voices
and you a universal background model
and the also so of us to up to like a form solution for
for the prior as well um
and here are again uh everything that in a but just can be multiple of gaussian
at the end just that with one thousand that's and out from negative infinity infinity so just an increase to
one
so now that you
you've basically destroyed you're integral
and i you you're just left with a with all these
these factors there just based on your
but put observation and and your model
for and there's and then your a pre-trained to um eigen voice
so i i everything here and again pretty much do go through the same process and
i this is actually a a the the final form
a that they can get for a for me to arbitrary speech utterances X of in X to be
a you can find you can actually compute the probability that the came from the same speaker
we we don't we don't doesn't matter which speaker that is your we actually much over all the speakers in
the world
yeah um and this is is basically the the uh close form solution
uh that you can to ford
and uh if you look at this uh solution
you can actually see that
uh for each utterance um
uh you just need a a a a set of sufficient statistics uh D
P N J A um and these are sufficient enough
to just come your
or um
uh utterance comparison function than this probability so
a in some settings i one but you don't want to keep
a a a a a uh the input observation data you can just
uh
a extract be statistic a sufficient statistics
and then just um
discard
yeah yeah the observations
uh if you're in a constrained by ring uh environment
so
a sound uh and that's just as measure uh we we just a pilot to
uh make the classical clustering a method of of doing speaker diarisation
um for the for the call
um data set
and uh we just used uh a a uh measure for
cluster purity
and then a measure for uh uh how accurately we uh us we estimate of the number speakers
we actually have to use both of them in conjunction um
that's really make sense to just use one of them
and these are just the optimal numbers that we were able to get a
um using
uh of for different
uh distance functions
um we use stick center at phone conversations number speakers range from two to seven
i just twelve mfccs
with energy and out to
um dropped up the non-speech frames
a we use eigenvoices is trained using uh uh uh G
uh we got trained using a um
i i think it was the uh
that is the switchboard um database
um
and and and a here you can see see that uh the proposed model uh as much better performers than
and the others uh of that that we tried
um
and uh at this is a really in the paper but you can actually uh uh uh a do use
to an extension to the model
uh we actually are originally uh of P eigen channel matrix for
for a a a you know simplicity but not we can actually included and then go through the same process
is actually a lot more
that's actually have more involved but again you can of actually get a this kind of close form solution were
now but also uh involving B B eigen channels that model T
the intra speaker of very abilities and uh you can actually easily show that this
a close this simplifies to that the previous one we had a if you
if you set all the uh if you set the i can channel matrix to zero
and so we actually tried this to has an additional experiment using a interest or
of their is uh using eigen channels matrices that that we trained a i think a um
but use a microphone database
and that actually improve the uh the accuracy of of the column task by
i one or two percent point
and the actually more sessions that you can do here you you can actually also uh derive of this equation
of for for for a general case and speakers and instead of a set of just two
so
that's
pretty much it
and are much
and choose we use them for one two questions
so i is the a question about then of these the cool on the the is do than the overlapping
speech and
for
um there was but um
is uh there were all
each channel was recorded separately
so when there was overlap things a speech i i basically just discarded
a one channel and then just just
just use one channels as to ensure that there's only one speaker talking
for each utterance just where doing the clustering task
i i just use the at manual transcriptions to to just
to to obtain be
to to pretty segment the the utterances so the other she's where basically person
and and so you enjoyed just see what happens when it's the the living just to see whether
it's a single
a a a a a new speaker or something
um them
but that that would of interest try
you
ooh
a question
oh
so we my vision in first or
yep
oh
that
a
yeah i did actually to try with the back
um the performance actually wasn't to great
so
i just a mention it
yeah for for this task um
uh uh it just seemed like a a a the G a large you gave better results
have and the big
you know
i
use can be very well
oh
yeah yeah i actually did better
hmmm
yeah i mean i wish i had be had missed T database
but we do have it
so
hmmm
the movies because this a the simply greedily
it's from calls
hmmm
that's from goes that you recorded two
it to the at so own clue it's
um maybe it's because of the the range
frequency considering
hmmm
yeah i i i i don't remember a of uh was a K or sixteen K
okay
i can thank you like and