Speech Transcript - AN UTTERANCE COMPARISON MODEL FOR SPEAKER CLUSTERING USING FACTOR ANALYSIS

one one um and which reach an on and i'm gonna talk but an utterance comparison model that we're proposing for um speaker clustering using factor analysis so uh i'm first gonna define what we exactly mean by speaker clustering because the term is used under different context with like you know a a subtle variations and in our study we define speaker clustering as a the task of clustering a set of speaker homogeneous speech utterances such that each cluster corresponds to a unique speaker and one i say a speaker homogeneous speech utterance a means a each utterance which is like a set of speech features a feature vectors of contains speech from only one speaker um the so and the number of speakers are are uh no so i the applications of this um um the plan speech recognition for example when you want use a predefined set of speaker clusters to do uh robust speaker adaptation when when test data is is very limited um this also used uh in a very classical for a class called method of speaker diarisation where you want so i spoke when problem so um this is a a a very classical setting um speaker directories reason is when he you given an unlabeled but the recording of an unknown number of unknown speakers talking you to determine the parts spoken by each person so if you have an example here it's just a sixty second um recording of a conversation but they can do is in just divided up into small little chunks and assume each chunk is one utterance meaning it only contain speech by one person a you do some kind of like the of of these of each chunk and and then a can you have a clusters first cluster here a second and and that there and uh if the number of clusters is a number speakers in each cluster actually contains speech but only one person then you have perfect speaker diarisation of course in reality um you may have then you may have actually done some like this force them letters as you little board sometimes you a there may actually more speakers and clusters or there may actually less speakers a cluster or see that kind of errors that can occur so um this is a just a sort of classic speaker diarisation method uh of course the the more sort of the art uh methods is that that don't use this widget for with for example a variational inference inferred a but here it was let's look at this class of method for C we have a speech signal use segmented into these speaker homogeneous uh is and that you use some kind of distance measure to compute the distance between the headers is you merge the close to study or addresses check whether some stopping criterion is met but is not that you look back in and you continue clustering until you until your done so i have some pop a distance measures for for this task um a to arbitrary speech utterances X of a an X of P what is the distance between them uh you have things like the generalized likelihood ratio or there that cross likelihood ratio or the uh a a bayesian information criterion just and uh again for both yeah why and see that we are uh you have have to you uh estimate some some gmm parameters from from each utterance and then you that compute uh uh likelihoods and then use those two create some kind of a really show that determines you know how close these utterances are to each other so a why we we're is it can to be a a a better to measure i mean and we for example if if you look at you look at these that's G C a wire the the mostly really mathematical constructs i mean a a you're not be a really have a rigorous just as justification and on how they compare uh is based on a physical a of speaker similarity um there's no real a statistical training training involved um so it in that sense they're they're kind of a hot when you when you just that the men into a uh you know in a speaker clustering task yeah and that's to address these problems there been trained up a distance metrics that have been proposed and eigen voice uh voice eigenvoice voice based a methods um especially at the i didn't voice and i did channels and and factor analysis uh do this provides a very elegant and uh and a what framework for for modeling uh inter speaker and and intra-speaker variability and we we want to try to use this to come up with something that we think is is a more reasonable uh distance measure or method of comparing letters so the first thing we thought was what what what how do we define a uh a a a eight that and a way to compare other since M what example exactly were trying to do a one we cluster it if you have to a speech utterances but we think that they can from the same speaker then we should cluster and if we don't think they came from the same speaker and then we should cluster that's what we're to basically data so so we just define higher uh no i a probability that the two speakers were spoke them by the same person and uh and that's that's or similarity that metric so how to define the probability well i if you a perfectly that posterior probability uh of each speaker clip and and um arbitrary utterance this P that we i given an if that then you could simply right uh this a a probability each one which is the probability that i which is the at the hypothesis that X of a an X that be which are to arbitrary utterances or the same speaker and i can just simple we set up a question this way i just using for a basic probability a a probability of X a a a probability of of a um X A of producing a speaker W Y or let's say that that the of the uh i don't six a big was the probability of your speaker being W I and then the probably a given an X a be what's your a probability that you're speakers W like you just much by these two and then you just sum up over all the speakers in the world so that's so of W is but is basically the population of the world so i and we can also uh in and no some but that the five uh uh the uh the null hypothesis were X of and that would be come from different speakers and then you simply do this the notion a for the i-th jay's which are different and then it's very easy to show that these two uh probably are are going to add to one so so these are exactly you could just very basic probability one can question these of course but but are like impractical um i mean there's no we can really a a are these posteriors so this is where a a factor analysis um the um so are uh if you if you have a speaker-dependent dependent gmms mean supervector uh you you can model that has a ubm mean supervector plus and a some uh eigenvoice matrix much by by speaker factor vector plus and i can tell matrix uh multiplied by by channel factor fact and um a assume that each speaker uh in the world is mapped to a unique speaker factor vector Y but you can just change your uh uh uh the previous equation we had a we just replace the W use with wise of course this still doesn't have any any any practical that we what we wanna do that the more to some kind of analytical form where we're we can uh a you know introduce the uh the priors that we have on on Y a and Z so um a first a step is uh you have a we have that's because the estimation of the piece um so we just to a summation two uh a and then it about so and this as well okay do this um a first we have to realise is that the summation is over a speakers uh not the wise wherever ever whereas the integral is done over the why a uh a you have to actually get a to uh just a really basic capitalist and and the probability of break comes down to the uh room a summation forms and you actually get uh this is actually the correct form from you get uh for the probability that a that the two others is uh are from the same trick and this and equation of for "'em" actually i it actually terms up it in that the different contexts to um so which is quite interesting ah a here you see that you have a W you um yeah that that amount of or which means that if you if W goes to infinity then this probability goes to zero which intuitively makes sense uh you trying to calculate the probably that they came from the same speaker but if you of infinite number speakers then yeah that probably should go to zero so now are we need is closed form expressions for or uh the prior P X and uh the conditional P of X um given Y so um first we want uh the first thing we did was we we simplify the problem by ignoring the intra-speaker variability so let's just so that you zero and it just use a S is and plus V Y so we you we just have the eigen voice not be eigen channels um a and the second that assumption that we said was that um well yep i i got into that a a two add them use that we have to a use um a just take just but just of these these to have them is use that first uh a in the house and that that and i have to and can be written as a glass in with respect to the mean a the second i'd we use that the product of two thousand is is also a gaussian that's all you really need to know is that be a normalized gaussian there's gonna be some some scale factors that at the beginning but is essentially just gonna a gas um and then another sub that that would make a uh is to simplify the be computation a a is that we just assume that each vector in in in in each utterance was just generated by by only one gal in in the gmm not up a whole mixture because once if you use of whole mixture sure than the cup to to becomes to complicated so now you you can see here that uh uh the uh mixture summation is just spare place by a a single gal C and and how to decide which mixture generated which a each frame well one way is to just obtain the uh maximum like to estimate of of the Y a for each utterance uh which then for we described a parameters in the gmm and then just use and then for for each frame you just find a gal sing with with the maximum occupation probability so uh now uh you can see that this condition is basically just been a multiplication of gas since that's that's all we have is just a whole string of gauss is mark what together we we know that when you multiply gas is you get another gaussian although those not we normalize so i you you just continuously apply that i'd eighty two two pairs of of the absence and and the whole string of of multiple and uh i you we to pay too much attention to the map D appear but just to is that if you keep going you basically just gonna get run they have C and what put by some some complicated uh remote um a factor or uh which is now inter depended on just the your or uh observations and you are or eigen voices and you a universal background model and the also so of us to up to like a form solution for for the prior as well um and here are again uh everything that in a but just can be multiple of gaussian at the end just that with one thousand that's and out from negative infinity infinity so just an increase to one so now that you you've basically destroyed you're integral and i you you're just left with a with all these these factors there just based on your but put observation and and your model for and there's and then your a pre-trained to um eigen voice so i i everything here and again pretty much do go through the same process and i this is actually a a the the final form a that they can get for a for me to arbitrary speech utterances X of in X to be a you can find you can actually compute the probability that the came from the same speaker we we don't we don't doesn't matter which speaker that is your we actually much over all the speakers in the world yeah um and this is is basically the the uh close form solution uh that you can to ford and uh if you look at this uh solution you can actually see that uh for each utterance um uh you just need a a a a set of sufficient statistics uh D P N J A um and these are sufficient enough to just come your or um uh utterance comparison function than this probability so a in some settings i one but you don't want to keep a a a a a uh the input observation data you can just uh a extract be statistic a sufficient statistics and then just um discard yeah yeah the observations uh if you're in a constrained by ring uh environment so a sound uh and that's just as measure uh we we just a pilot to uh make the classical clustering a method of of doing speaker diarisation um for the for the call um data set and uh we just used uh a a uh measure for cluster purity and then a measure for uh uh how accurately we uh us we estimate of the number speakers we actually have to use both of them in conjunction um that's really make sense to just use one of them and these are just the optimal numbers that we were able to get a um using uh of for different uh distance functions um we use stick center at phone conversations number speakers range from two to seven i just twelve mfccs with energy and out to um dropped up the non-speech frames a we use eigenvoices is trained using uh uh uh G uh we got trained using a um i i think it was the uh that is the switchboard um database um and and and a here you can see see that uh the proposed model uh as much better performers than and the others uh of that that we tried um and uh at this is a really in the paper but you can actually uh uh uh a do use to an extension to the model uh we actually are originally uh of P eigen channel matrix for for a a a you know simplicity but not we can actually included and then go through the same process is actually a lot more that's actually have more involved but again you can of actually get a this kind of close form solution were now but also uh involving B B eigen channels that model T the intra speaker of very abilities and uh you can actually easily show that this a close this simplifies to that the previous one we had a if you if you set all the uh if you set the i can channel matrix to zero and so we actually tried this to has an additional experiment using a interest or of their is uh using eigen channels matrices that that we trained a i think a um but use a microphone database and that actually improve the uh the accuracy of of the column task by i one or two percent point and the actually more sessions that you can do here you you can actually also uh derive of this equation of for for for a general case and speakers and instead of a set of just two so that's pretty much it and are much and choose we use them for one two questions so i is the a question about then of these the cool on the the is do than the overlapping speech and for um there was but um is uh there were all each channel was recorded separately so when there was overlap things a speech i i basically just discarded a one channel and then just just just use one channels as to ensure that there's only one speaker talking for each utterance just where doing the clustering task i i just use the at manual transcriptions to to just to to obtain be to to pretty segment the the utterances so the other she's where basically person and and so you enjoyed just see what happens when it's the the living just to see whether it's a single a a a a a new speaker or something um them but that that would of interest try you ooh a question oh so we my vision in first or yep oh that a yeah i did actually to try with the back um the performance actually wasn't to great so i just a mention it yeah for for this task um uh uh it just seemed like a a a the G a large you gave better results have and the big you know i use can be very well oh yeah yeah i actually did better hmmm yeah i mean i wish i had be had missed T database but we do have it so hmmm the movies because this a the simply greedily it's from calls hmmm that's from goes that you recorded two it to the at so own clue it's um maybe it's because of the the range frequency considering hmmm yeah i i i i don't remember a of uh was a K or sixteen K okay i can thank you like and

AN UTTERANCE COMPARISON MODEL FOR SPEAKER CLUSTERING USING FACTOR ANALYSIS

Miscellaneous Speaker Identification

Presented by: Woojay Jeon, Author(s): Woojay Jeon, Changxue Ma, Dusan Macho, Motorola, United States