one one um and which reach an on and i'm gonna talk but

an utterance comparison model that we're proposing for um

speaker clustering using factor analysis

so uh i'm first gonna define what we exactly mean by speaker clustering because the term is used under different

context with like you know a a subtle variations

and in our study we define speaker clustering as a the task of clustering a set of speaker homogeneous speech

utterances

such that each cluster corresponds to a unique speaker

and one i say a speaker homogeneous speech utterance a means a each utterance which is like a set of

speech features a feature vectors of contains speech from only one speaker

um the so

and the number of speakers are are uh no

so i the applications of this um um

the plan speech recognition for example when you want use a predefined set of speaker clusters to do uh robust

speaker adaptation when when test data is is very limited

um this also used uh in a very classical for a class called method of speaker diarisation where you want

so

i spoke when

problem

so um this is a a a very classical setting um speaker directories reason is when

he you given an unlabeled but the recording of an unknown number of unknown speakers talking

you to determine the parts spoken by each person so if you have an example here it's just a sixty

second

um recording of a conversation but they can do is in just divided up into small little chunks and

assume each chunk is one utterance meaning it only contain speech by one person

a you do some kind of

like the of of these of each chunk

and and then

a can you have a clusters first cluster here a second and and that there

and uh if the number of clusters is a number speakers in each cluster actually contains speech but only one

person then you have perfect speaker diarisation

of course in reality um

you may have then you may have actually done some like this force them letters as you little board

sometimes you a there may actually more speakers and clusters or there may actually less speakers

a cluster or see that kind of errors that can occur

so um this is a just a sort of classic speaker diarisation method uh of course the the more sort

of the art uh methods is that that don't use this widget for with for example a variational inference inferred

a but here it was let's look at this class of method for C we have a speech signal use

segmented into

these speaker homogeneous uh is

and that you use some kind of distance measure to compute the distance between the headers is you merge the

close to study or addresses check whether some stopping criterion is met

but is not that you look back in and you continue clustering until you until your done

so i have some pop a distance measures for for this task

um

a to arbitrary speech utterances X of a an X of P what is the distance between them

uh you have things like the generalized likelihood ratio or there that cross likelihood ratio or the uh

a a bayesian information criterion just

and uh again for both yeah why and see that we are uh you have have to you uh estimate

some some gmm parameters from from each utterance

and then you that compute uh uh likelihoods and then use those two

create some kind of a really show that determines you know how close these utterances

are to each other

so

a why we we're is it can to be a a a better to measure i mean and we for

example if if you look at you look at these

that's G C a wire the the mostly really mathematical constructs i mean

a a you're not be a really have a rigorous just as justification and on how they compare uh is

based on

a physical a of speaker similarity

um there's no real a statistical training training involved

um so it in that sense they're they're kind of a hot when you when you just that the men

into a

uh you know in a speaker clustering task

yeah and that's to address these problems there been trained up a distance metrics that have been proposed and eigen

voice uh voice

eigenvoice voice based

a methods

um especially at the i didn't voice and i did channels and and factor analysis uh do this

provides a very elegant and uh and a what framework for for modeling uh inter speaker and

and intra-speaker variability and we

we want to try to use this to come up with something that we think is is a more reasonable

uh distance measure or method of comparing letters

so the first thing we thought was

what

what what how do we define a uh a a a eight

that and a way to compare other since M what example exactly were trying to do

a one we cluster it if you have to a speech utterances

but we think that they can from the same speaker then we should cluster and

if we don't think they came from the same speaker and then we should cluster

that's what we're to

basically data

so

so we just define higher

uh

no i a probability that the two speakers were spoke them by the same person

and uh and that's that's or similarity

that metric

so how to define the probability well

i

if you

a perfectly that posterior probability

uh of each speaker clip and and um arbitrary utterance this P that we i given an

if that then you could simply right

uh this

a a probability each one which is the probability that

i which is the at the hypothesis that X of a an X that be which are to arbitrary utterances

or the same speaker

and i can just simple we set up a question this way i just using for a basic probability

a a probability of X a

a a probability of of

a um X A of producing a speaker W Y

or let's say that that the of the uh i don't six a big was the probability of your speaker

being W I

and then the probably a given an X a be what's your

a probability that you're speakers W like you just much by these two and then you just sum up over

all the speakers in the world so that's so of W is

but is basically the population of the world

so i

and we can also uh in and

no some but that the five

uh uh the uh the null hypothesis were X of and that would be come from different speakers and then

you simply do this the notion

a for the i-th jay's which are different

and then

it's very easy to show that these two uh probably are are going to add to one

so so these are

exactly

you could just very basic probability

one can question these

of course but

but are like impractical um

i mean there's no we can really

a a are these posteriors

so this is where a a factor analysis

um the

um so are uh if you if you have a speaker-dependent dependent gmms mean supervector

uh you you can model that has a ubm mean supervector plus

and a some uh eigenvoice matrix much by by speaker factor vector

plus and i can tell matrix

uh multiplied by by channel factor

fact

and um

a assume that each speaker uh in the world is mapped to a unique speaker factor vector Y

but you can just change your uh uh uh the previous equation we had a we just replace the W

use with wise

of course this still doesn't have any

any any practical that we

what we wanna do that the more to some kind of analytical form where we're we can

uh a you know introduce the uh the priors that we have on on Y

a and Z

so um

a first a step is uh you have a we have that's

because the estimation

of the piece

um so we just to a summation two

uh a and then it about

so

and

this as well

okay do this

um

a first we have to realise is that the summation is over a speakers uh not the wise wherever ever

whereas the integral is done over the why

a uh a you have to actually get a to uh just a really basic capitalist and and the probability

of break comes down to the uh

room a summation forms

and you actually get uh this is actually the correct form from you get

uh for the probability that a that the two others is uh are from the same trick

and this and equation of for "'em" actually i it actually terms up it in that the different contexts to

um

so which is quite interesting ah a here you see that you have a W you um

yeah that that amount of or

which means that if you if W goes to infinity then this probability goes to zero

which intuitively makes sense

uh you trying to calculate the probably that they came from the same speaker but

if you of infinite number speakers

then yeah that probably should go to zero

so now

are we need is closed form expressions for or uh the prior P X and

uh the conditional P of X

um

given Y

so um

first we want uh the first thing we did was we we simplify the problem by ignoring the intra-speaker variability

so let's just so that you zero

and it just use a S is and plus V Y so we you we just have the eigen voice

not be eigen channels

um

a and the second that assumption that we said

was that um

well

yep i i got into that

a a two add them use that we have to

a use um

a

just take just

but just of these these to have them is use that first

uh a in the house and that that

and i have to and can be written as a glass in with respect to the mean

a the second i'd we use that the product of two thousand is is also a gaussian that's all you

really need to know is that be a normalized gaussian there's gonna be some

some scale factors that at the beginning but

is essentially just gonna a gas

um

and then another sub that that would make

a uh is to simplify the be computation

a a is that we just assume that each vector in in in in each utterance was just generated by

by only one gal in in the gmm not up a whole mixture because once if you use of whole

mixture sure than the cup to to becomes

to complicated

so now you you can see here that uh uh the uh mixture summation is just spare place by a

a single gal C

and and how to decide which mixture

generated which a each frame

well one way is to just obtain the uh maximum like to estimate of of the Y

a for each utterance

uh which then for we described a parameters in the gmm

and then just use

and then for for each frame you just find a gal sing with with the maximum

occupation probability

so uh

now uh you can see that this condition is basically just been a multiplication of gas since that's that's all

we have is just a whole string of gauss is mark what together

we we know that when you multiply gas is you get another gaussian although those not we normalize

so i you you just continuously apply that i'd eighty two two pairs of of the absence

and and the whole string of of multiple

and uh i you we to pay too much attention to the map D appear

but just to is that if you keep going

you basically just gonna get run they have C and what put by some some complicated uh remote

um

a factor or uh which is now inter depended on just the your or uh observations and you are

or eigen voices

and you a universal background model

and the also so of us to up to like a form solution for

for the prior as well um

and here are again uh everything that in a but just can be multiple of gaussian

at the end just that with one thousand that's and out from negative infinity infinity so just an increase to

one

so now that you

you've basically destroyed you're integral

and i you you're just left with a with all these

these factors there just based on your

but put observation and and your model

for and there's and then your a pre-trained to um eigen voice

so i i everything here and again pretty much do go through the same process and

i this is actually a a the the final form

a that they can get for a for me to arbitrary speech utterances X of in X to be

a you can find you can actually compute the probability that the came from the same speaker

we we don't we don't doesn't matter which speaker that is your we actually much over all the speakers in

the world

yeah um and this is is basically the the uh close form solution

uh that you can to ford

and uh if you look at this uh solution

you can actually see that

uh for each utterance um

uh you just need a a a a set of sufficient statistics uh D

P N J A um and these are sufficient enough

to just come your

or um

uh utterance comparison function than this probability so

a in some settings i one but you don't want to keep

a a a a a uh the input observation data you can just

uh

a extract be statistic a sufficient statistics

and then just um

discard

yeah yeah the observations

uh if you're in a constrained by ring uh environment

so

a sound uh and that's just as measure uh we we just a pilot to

uh make the classical clustering a method of of doing speaker diarisation

um for the for the call

um data set

and uh we just used uh a a uh measure for

cluster purity

and then a measure for uh uh how accurately we uh us we estimate of the number speakers

we actually have to use both of them in conjunction um

that's really make sense to just use one of them

and these are just the optimal numbers that we were able to get a

um using

uh of for different

uh distance functions

um we use stick center at phone conversations number speakers range from two to seven

i just twelve mfccs

with energy and out to

um dropped up the non-speech frames

a we use eigenvoices is trained using uh uh uh G

uh we got trained using a um

i i think it was the uh

that is the switchboard um database

um

and and and a here you can see see that uh the proposed model uh as much better performers than

and the others uh of that that we tried

um

and uh at this is a really in the paper but you can actually uh uh uh a do use

to an extension to the model

uh we actually are originally uh of P eigen channel matrix for

for a a a you know simplicity but not we can actually included and then go through the same process

is actually a lot more

that's actually have more involved but again you can of actually get a this kind of close form solution were

now but also uh involving B B eigen channels that model T

the intra speaker of very abilities and uh you can actually easily show that this

a close this simplifies to that the previous one we had a if you

if you set all the uh if you set the i can channel matrix to zero

and so we actually tried this to has an additional experiment using a interest or

of their is uh using eigen channels matrices that that we trained a i think a um

but use a microphone database

and that actually improve the uh the accuracy of of the column task by

i one or two percent point

and the actually more sessions that you can do here you you can actually also uh derive of this equation

of for for for a general case and speakers and instead of a set of just two

so

that's

pretty much it

and are much

and choose we use them for one two questions

so i is the a question about then of these the cool on the the is do than the overlapping

speech and

for

um there was but um

is uh there were all

each channel was recorded separately

so when there was overlap things a speech i i basically just discarded

a one channel and then just just

just use one channels as to ensure that there's only one speaker talking

for each utterance just where doing the clustering task

i i just use the at manual transcriptions to to just

to to obtain be

to to pretty segment the the utterances so the other she's where basically person

and and so you enjoyed just see what happens when it's the the living just to see whether

it's a single

a a a a a new speaker or something

um them

but that that would of interest try

you

ooh

a question

oh

so we my vision in first or

yep

oh

that

a

yeah i did actually to try with the back

um the performance actually wasn't to great

so

i just a mention it

yeah for for this task um

uh uh it just seemed like a a a the G a large you gave better results

have and the big

you know

i

use can be very well

oh

yeah yeah i actually did better

hmmm

yeah i mean i wish i had be had missed T database

but we do have it

so

hmmm

the movies because this a the simply greedily

it's from calls

hmmm

that's from goes that you recorded two

it to the at so own clue it's

um maybe it's because of the the range

frequency considering

hmmm

yeah i i i i don't remember a of uh was a K or sixteen K

okay

i can thank you like and