0:00:13one one um and which reach an on and i'm gonna talk but
0:00:17an utterance comparison model that we're proposing for um
0:00:20speaker clustering using factor analysis
0:00:23so uh i'm first gonna define what we exactly mean by speaker clustering because the term is used under different
0:00:29context with like you know a a subtle variations
0:00:32and in our study we define speaker clustering as a the task of clustering a set of speaker homogeneous speech
0:00:39utterances
0:00:40such that each cluster corresponds to a unique speaker
0:00:44and one i say a speaker homogeneous speech utterance a means a each utterance which is like a set of
0:00:48speech features a feature vectors of contains speech from only one speaker
0:00:53um the so
0:00:54and the number of speakers are are uh no
0:00:57so i the applications of this um um
0:01:00the plan speech recognition for example when you want use a predefined set of speaker clusters to do uh robust
0:01:05speaker adaptation when when test data is is very limited
0:01:08um this also used uh in a very classical for a class called method of speaker diarisation where you want
0:01:14so
0:01:15i spoke when
0:01:16problem
0:01:17so um this is a a a very classical setting um speaker directories reason is when
0:01:22he you given an unlabeled but the recording of an unknown number of unknown speakers talking
0:01:27you to determine the parts spoken by each person so if you have an example here it's just a sixty
0:01:31second
0:01:32um recording of a conversation but they can do is in just divided up into small little chunks and
0:01:38assume each chunk is one utterance meaning it only contain speech by one person
0:01:42a you do some kind of
0:01:43like the of of these of each chunk
0:01:46and and then
0:01:47a can you have a clusters first cluster here a second and and that there
0:01:51and uh if the number of clusters is a number speakers in each cluster actually contains speech but only one
0:01:55person then you have perfect speaker diarisation
0:01:57of course in reality um
0:01:59you may have then you may have actually done some like this force them letters as you little board
0:02:03sometimes you a there may actually more speakers and clusters or there may actually less speakers
0:02:09a cluster or see that kind of errors that can occur
0:02:12so um this is a just a sort of classic speaker diarisation method uh of course the the more sort
0:02:17of the art uh methods is that that don't use this widget for with for example a variational inference inferred
0:02:24a but here it was let's look at this class of method for C we have a speech signal use
0:02:28segmented into
0:02:29these speaker homogeneous uh is
0:02:31and that you use some kind of distance measure to compute the distance between the headers is you merge the
0:02:36close to study or addresses check whether some stopping criterion is met
0:02:39but is not that you look back in and you continue clustering until you until your done
0:02:44so i have some pop a distance measures for for this task
0:02:48um
0:02:49a to arbitrary speech utterances X of a an X of P what is the distance between them
0:02:54uh you have things like the generalized likelihood ratio or there that cross likelihood ratio or the uh
0:02:59a a bayesian information criterion just
0:03:02and uh again for both yeah why and see that we are uh you have have to you uh estimate
0:03:06some some gmm parameters from from each utterance
0:03:10and then you that compute uh uh likelihoods and then use those two
0:03:14create some kind of a really show that determines you know how close these utterances
0:03:18are to each other
0:03:20so
0:03:20a why we we're is it can to be a a a better to measure i mean and we for
0:03:25example if if you look at you look at these
0:03:27that's G C a wire the the mostly really mathematical constructs i mean
0:03:32a a you're not be a really have a rigorous just as justification and on how they compare uh is
0:03:37based on
0:03:38a physical a of speaker similarity
0:03:41um there's no real a statistical training training involved
0:03:45um so it in that sense they're they're kind of a hot when you when you just that the men
0:03:49into a
0:03:50uh you know in a speaker clustering task
0:03:52yeah and that's to address these problems there been trained up a distance metrics that have been proposed and eigen
0:03:57voice uh voice
0:03:59eigenvoice voice based
0:04:00a methods
0:04:01um especially at the i didn't voice and i did channels and and factor analysis uh do this
0:04:06provides a very elegant and uh and a what framework for for modeling uh inter speaker and
0:04:11and intra-speaker variability and we
0:04:14we want to try to use this to come up with something that we think is is a more reasonable
0:04:18uh distance measure or method of comparing letters
0:04:21so the first thing we thought was
0:04:23what
0:04:24what what how do we define a uh a a a eight
0:04:27that and a way to compare other since M what example exactly were trying to do
0:04:31a one we cluster it if you have to a speech utterances
0:04:34but we think that they can from the same speaker then we should cluster and
0:04:37if we don't think they came from the same speaker and then we should cluster
0:04:41that's what we're to
0:04:42basically data
0:04:43so
0:04:44so we just define higher
0:04:46uh
0:04:47no i a probability that the two speakers were spoke them by the same person
0:04:51and uh and that's that's or similarity
0:04:53that metric
0:04:54so how to define the probability well
0:04:57i
0:04:58if you
0:05:00a perfectly that posterior probability
0:05:02uh of each speaker clip and and um arbitrary utterance this P that we i given an
0:05:07if that then you could simply right
0:05:10uh this
0:05:11a a probability each one which is the probability that
0:05:14i which is the at the hypothesis that X of a an X that be which are to arbitrary utterances
0:05:19or the same speaker
0:05:20and i can just simple we set up a question this way i just using for a basic probability
0:05:25a a probability of X a
0:05:27a a probability of of
0:05:29a um X A of producing a speaker W Y
0:05:33or let's say that that the of the uh i don't six a big was the probability of your speaker
0:05:37being W I
0:05:38and then the probably a given an X a be what's your
0:05:41a probability that you're speakers W like you just much by these two and then you just sum up over
0:05:46all the speakers in the world so that's so of W is
0:05:49but is basically the population of the world
0:05:51so i
0:05:53and we can also uh in and
0:05:55no some but that the five
0:05:57uh uh the uh the null hypothesis were X of and that would be come from different speakers and then
0:06:02you simply do this the notion
0:06:03a for the i-th jay's which are different
0:06:05and then
0:06:06it's very easy to show that these two uh probably are are going to add to one
0:06:10so so these are
0:06:11exactly
0:06:12you could just very basic probability
0:06:14one can question these
0:06:16of course but
0:06:17but are like impractical um
0:06:19i mean there's no we can really
0:06:21a a are these posteriors
0:06:22so this is where a a factor analysis
0:06:25um the
0:06:26um so are uh if you if you have a speaker-dependent dependent gmms mean supervector
0:06:31uh you you can model that has a ubm mean supervector plus
0:06:35and a some uh eigenvoice matrix much by by speaker factor vector
0:06:39plus and i can tell matrix
0:06:40uh multiplied by by channel factor
0:06:42fact
0:06:43and um
0:06:45a assume that each speaker uh in the world is mapped to a unique speaker factor vector Y
0:06:50but you can just change your uh uh uh the previous equation we had a we just replace the W
0:06:54use with wise
0:06:56of course this still doesn't have any
0:06:57any any practical that we
0:06:59what we wanna do that the more to some kind of analytical form where we're we can
0:07:03uh a you know introduce the uh the priors that we have on on Y
0:07:07a and Z
0:07:09so um
0:07:11a first a step is uh you have a we have that's
0:07:13because the estimation
0:07:15of the piece
0:07:16um so we just to a summation two
0:07:18uh a and then it about
0:07:20so
0:07:21and
0:07:21this as well
0:07:22okay do this
0:07:23um
0:07:24a first we have to realise is that the summation is over a speakers uh not the wise wherever ever
0:07:28whereas the integral is done over the why
0:07:31a uh a you have to actually get a to uh just a really basic capitalist and and the probability
0:07:37of break comes down to the uh
0:07:39room a summation forms
0:07:41and you actually get uh this is actually the correct form from you get
0:07:45uh for the probability that a that the two others is uh are from the same trick
0:07:50and this and equation of for "'em" actually i it actually terms up it in that the different contexts to
0:07:55um
0:07:56so which is quite interesting ah a here you see that you have a W you um
0:08:01yeah that that amount of or
0:08:02which means that if you if W goes to infinity then this probability goes to zero
0:08:06which intuitively makes sense
0:08:08uh you trying to calculate the probably that they came from the same speaker but
0:08:12if you of infinite number speakers
0:08:13then yeah that probably should go to zero
0:08:17so now
0:08:18are we need is closed form expressions for or uh the prior P X and
0:08:23uh the conditional P of X
0:08:24um
0:08:25given Y
0:08:27so um
0:08:28first we want uh the first thing we did was we we simplify the problem by ignoring the intra-speaker variability
0:08:34so let's just so that you zero
0:08:36and it just use a S is and plus V Y so we you we just have the eigen voice
0:08:40not be eigen channels
0:08:41um
0:08:42a and the second that assumption that we said
0:08:45was that um
0:08:46well
0:08:47yep i i got into that
0:08:48a a two add them use that we have to
0:08:51a use um
0:08:52a
0:08:53just take just
0:08:54but just of these these to have them is use that first
0:08:57uh a in the house and that that
0:08:59and i have to and can be written as a glass in with respect to the mean
0:09:02a the second i'd we use that the product of two thousand is is also a gaussian that's all you
0:09:06really need to know is that be a normalized gaussian there's gonna be some
0:09:09some scale factors that at the beginning but
0:09:11is essentially just gonna a gas
0:09:14um
0:09:15and then another sub that that would make
0:09:17a uh is to simplify the be computation
0:09:20a a is that we just assume that each vector in in in in each utterance was just generated by
0:09:24by only one gal in in the gmm not up a whole mixture because once if you use of whole
0:09:29mixture sure than the cup to to becomes
0:09:31to complicated
0:09:32so now you you can see here that uh uh the uh mixture summation is just spare place by a
0:09:37a single gal C
0:09:39and and how to decide which mixture
0:09:41generated which a each frame
0:09:43well one way is to just obtain the uh maximum like to estimate of of the Y
0:09:47a for each utterance
0:09:48uh which then for we described a parameters in the gmm
0:09:51and then just use
0:09:52and then for for each frame you just find a gal sing with with the maximum
0:09:56occupation probability
0:09:58so uh
0:09:59now uh you can see that this condition is basically just been a multiplication of gas since that's that's all
0:10:04we have is just a whole string of gauss is mark what together
0:10:07we we know that when you multiply gas is you get another gaussian although those not we normalize
0:10:12so i you you just continuously apply that i'd eighty two two pairs of of the absence
0:10:17and and the whole string of of multiple
0:10:20and uh i you we to pay too much attention to the map D appear
0:10:25but just to is that if you keep going
0:10:27you basically just gonna get run they have C and what put by some some complicated uh remote
0:10:33um
0:10:33a factor or uh which is now inter depended on just the your or uh observations and you are
0:10:40or eigen voices
0:10:41and you a universal background model
0:10:44and the also so of us to up to like a form solution for
0:10:48for the prior as well um
0:10:50and here are again uh everything that in a but just can be multiple of gaussian
0:10:55at the end just that with one thousand that's and out from negative infinity infinity so just an increase to
0:11:00one
0:11:01so now that you
0:11:02you've basically destroyed you're integral
0:11:04and i you you're just left with a with all these
0:11:07these factors there just based on your
0:11:08but put observation and and your model
0:11:11for and there's and then your a pre-trained to um eigen voice
0:11:15so i i everything here and again pretty much do go through the same process and
0:11:20i this is actually a a the the final form
0:11:23a that they can get for a for me to arbitrary speech utterances X of in X to be
0:11:28a you can find you can actually compute the probability that the came from the same speaker
0:11:33we we don't we don't doesn't matter which speaker that is your we actually much over all the speakers in
0:11:38the world
0:11:39yeah um and this is is basically the the uh close form solution
0:11:43uh that you can to ford
0:11:45and uh if you look at this uh solution
0:11:48you can actually see that
0:11:49uh for each utterance um
0:11:51uh you just need a a a a set of sufficient statistics uh D
0:11:55P N J A um and these are sufficient enough
0:11:58to just come your
0:11:59or um
0:12:00uh utterance comparison function than this probability so
0:12:03a in some settings i one but you don't want to keep
0:12:06a a a a a uh the input observation data you can just
0:12:09uh
0:12:10a extract be statistic a sufficient statistics
0:12:13and then just um
0:12:14discard
0:12:15yeah yeah the observations
0:12:17uh if you're in a constrained by ring uh environment
0:12:20so
0:12:22a sound uh and that's just as measure uh we we just a pilot to
0:12:26uh make the classical clustering a method of of doing speaker diarisation
0:12:30um for the for the call
0:12:32um data set
0:12:33and uh we just used uh a a uh measure for
0:12:38cluster purity
0:12:39and then a measure for uh uh how accurately we uh us we estimate of the number speakers
0:12:44we actually have to use both of them in conjunction um
0:12:47that's really make sense to just use one of them
0:12:50and these are just the optimal numbers that we were able to get a
0:12:53um using
0:12:54uh of for different
0:12:56uh distance functions
0:12:57um we use stick center at phone conversations number speakers range from two to seven
0:13:02i just twelve mfccs
0:13:04with energy and out to
0:13:05um dropped up the non-speech frames
0:13:08a we use eigenvoices is trained using uh uh uh G
0:13:12uh we got trained using a um
0:13:14i i think it was the uh
0:13:16that is the switchboard um database
0:13:19um
0:13:20and and and a here you can see see that uh the proposed model uh as much better performers than
0:13:27and the others uh of that that we tried
0:13:30um
0:13:31and uh at this is a really in the paper but you can actually uh uh uh a do use
0:13:36to an extension to the model
0:13:37uh we actually are originally uh of P eigen channel matrix for
0:13:42for a a a you know simplicity but not we can actually included and then go through the same process
0:13:46is actually a lot more
0:13:47that's actually have more involved but again you can of actually get a this kind of close form solution were
0:13:53now but also uh involving B B eigen channels that model T
0:13:57the intra speaker of very abilities and uh you can actually easily show that this
0:14:02a close this simplifies to that the previous one we had a if you
0:14:05if you set all the uh if you set the i can channel matrix to zero
0:14:09and so we actually tried this to has an additional experiment using a interest or
0:14:13of their is uh using eigen channels matrices that that we trained a i think a um
0:14:18but use a microphone database
0:14:20and that actually improve the uh the accuracy of of the column task by
0:14:24i one or two percent point
0:14:26and the actually more sessions that you can do here you you can actually also uh derive of this equation
0:14:31of for for for a general case and speakers and instead of a set of just two
0:14:36so
0:14:37that's
0:14:38pretty much it
0:14:40and are much
0:14:47and choose we use them for one two questions
0:14:56so i is the a question about then of these the cool on the the is do than the overlapping
0:15:02speech and
0:15:02for
0:15:03um there was but um
0:15:05is uh there were all
0:15:07each channel was recorded separately
0:15:09so when there was overlap things a speech i i basically just discarded
0:15:13a one channel and then just just
0:15:15just use one channels as to ensure that there's only one speaker talking
0:15:19for each utterance just where doing the clustering task
0:15:22i i just use the at manual transcriptions to to just
0:15:25to to obtain be
0:15:26to to pretty segment the the utterances so the other she's where basically person
0:15:31and and so you enjoyed just see what happens when it's the the living just to see whether
0:15:35it's a single
0:15:37a a a a a new speaker or something
0:15:39um them
0:15:41but that that would of interest try
0:15:45you
0:15:47ooh
0:15:48a question
0:15:53oh
0:15:57so we my vision in first or
0:16:01yep
0:16:04oh
0:16:05that
0:16:06a
0:16:07yeah i did actually to try with the back
0:16:09um the performance actually wasn't to great
0:16:12so
0:16:13i just a mention it
0:16:16yeah for for this task um
0:16:19uh uh it just seemed like a a a the G a large you gave better results
0:16:25have and the big
0:16:26you know
0:16:32i
0:16:34use can be very well
0:16:37oh
0:16:40yeah yeah i actually did better
0:16:42hmmm
0:16:44yeah i mean i wish i had be had missed T database
0:16:48but we do have it
0:16:49so
0:16:49hmmm
0:16:53the movies because this a the simply greedily
0:16:56it's from calls
0:16:57hmmm
0:16:59that's from goes that you recorded two
0:17:02it to the at so own clue it's
0:17:05um maybe it's because of the the range
0:17:08frequency considering
0:17:09hmmm
0:17:09yeah i i i i don't remember a of uh was a K or sixteen K
0:17:15okay
0:17:19i can thank you like and