0:00:13 | one one um and which reach an on and i'm gonna talk but |
---|---|

0:00:17 | an utterance comparison model that we're proposing for um |

0:00:20 | speaker clustering using factor analysis |

0:00:23 | so uh i'm first gonna define what we exactly mean by speaker clustering because the term is used under different |

0:00:29 | context with like you know a a subtle variations |

0:00:32 | and in our study we define speaker clustering as a the task of clustering a set of speaker homogeneous speech |

0:00:39 | utterances |

0:00:40 | such that each cluster corresponds to a unique speaker |

0:00:44 | and one i say a speaker homogeneous speech utterance a means a each utterance which is like a set of |

0:00:48 | speech features a feature vectors of contains speech from only one speaker |

0:00:53 | um the so |

0:00:54 | and the number of speakers are are uh no |

0:00:57 | so i the applications of this um um |

0:01:00 | the plan speech recognition for example when you want use a predefined set of speaker clusters to do uh robust |

0:01:05 | speaker adaptation when when test data is is very limited |

0:01:08 | um this also used uh in a very classical for a class called method of speaker diarisation where you want |

0:01:14 | so |

0:01:15 | i spoke when |

0:01:16 | problem |

0:01:17 | so um this is a a a very classical setting um speaker directories reason is when |

0:01:22 | he you given an unlabeled but the recording of an unknown number of unknown speakers talking |

0:01:27 | you to determine the parts spoken by each person so if you have an example here it's just a sixty |

0:01:31 | second |

0:01:32 | um recording of a conversation but they can do is in just divided up into small little chunks and |

0:01:38 | assume each chunk is one utterance meaning it only contain speech by one person |

0:01:42 | a you do some kind of |

0:01:43 | like the of of these of each chunk |

0:01:46 | and and then |

0:01:47 | a can you have a clusters first cluster here a second and and that there |

0:01:51 | and uh if the number of clusters is a number speakers in each cluster actually contains speech but only one |

0:01:55 | person then you have perfect speaker diarisation |

0:01:57 | of course in reality um |

0:01:59 | you may have then you may have actually done some like this force them letters as you little board |

0:02:03 | sometimes you a there may actually more speakers and clusters or there may actually less speakers |

0:02:09 | a cluster or see that kind of errors that can occur |

0:02:12 | so um this is a just a sort of classic speaker diarisation method uh of course the the more sort |

0:02:17 | of the art uh methods is that that don't use this widget for with for example a variational inference inferred |

0:02:24 | a but here it was let's look at this class of method for C we have a speech signal use |

0:02:28 | segmented into |

0:02:29 | these speaker homogeneous uh is |

0:02:31 | and that you use some kind of distance measure to compute the distance between the headers is you merge the |

0:02:36 | close to study or addresses check whether some stopping criterion is met |

0:02:39 | but is not that you look back in and you continue clustering until you until your done |

0:02:44 | so i have some pop a distance measures for for this task |

0:02:48 | um |

0:02:49 | a to arbitrary speech utterances X of a an X of P what is the distance between them |

0:02:54 | uh you have things like the generalized likelihood ratio or there that cross likelihood ratio or the uh |

0:02:59 | a a bayesian information criterion just |

0:03:02 | and uh again for both yeah why and see that we are uh you have have to you uh estimate |

0:03:06 | some some gmm parameters from from each utterance |

0:03:10 | and then you that compute uh uh likelihoods and then use those two |

0:03:14 | create some kind of a really show that determines you know how close these utterances |

0:03:18 | are to each other |

0:03:20 | so |

0:03:20 | a why we we're is it can to be a a a better to measure i mean and we for |

0:03:25 | example if if you look at you look at these |

0:03:27 | that's G C a wire the the mostly really mathematical constructs i mean |

0:03:32 | a a you're not be a really have a rigorous just as justification and on how they compare uh is |

0:03:37 | based on |

0:03:38 | a physical a of speaker similarity |

0:03:41 | um there's no real a statistical training training involved |

0:03:45 | um so it in that sense they're they're kind of a hot when you when you just that the men |

0:03:49 | into a |

0:03:50 | uh you know in a speaker clustering task |

0:03:52 | yeah and that's to address these problems there been trained up a distance metrics that have been proposed and eigen |

0:03:57 | voice uh voice |

0:03:59 | eigenvoice voice based |

0:04:00 | a methods |

0:04:01 | um especially at the i didn't voice and i did channels and and factor analysis uh do this |

0:04:06 | provides a very elegant and uh and a what framework for for modeling uh inter speaker and |

0:04:11 | and intra-speaker variability and we |

0:04:14 | we want to try to use this to come up with something that we think is is a more reasonable |

0:04:18 | uh distance measure or method of comparing letters |

0:04:21 | so the first thing we thought was |

0:04:23 | what |

0:04:24 | what what how do we define a uh a a a eight |

0:04:27 | that and a way to compare other since M what example exactly were trying to do |

0:04:31 | a one we cluster it if you have to a speech utterances |

0:04:34 | but we think that they can from the same speaker then we should cluster and |

0:04:37 | if we don't think they came from the same speaker and then we should cluster |

0:04:41 | that's what we're to |

0:04:42 | basically data |

0:04:43 | so |

0:04:44 | so we just define higher |

0:04:46 | uh |

0:04:47 | no i a probability that the two speakers were spoke them by the same person |

0:04:51 | and uh and that's that's or similarity |

0:04:53 | that metric |

0:04:54 | so how to define the probability well |

0:04:57 | i |

0:04:58 | if you |

0:05:00 | a perfectly that posterior probability |

0:05:02 | uh of each speaker clip and and um arbitrary utterance this P that we i given an |

0:05:07 | if that then you could simply right |

0:05:10 | uh this |

0:05:11 | a a probability each one which is the probability that |

0:05:14 | i which is the at the hypothesis that X of a an X that be which are to arbitrary utterances |

0:05:19 | or the same speaker |

0:05:20 | and i can just simple we set up a question this way i just using for a basic probability |

0:05:25 | a a probability of X a |

0:05:27 | a a probability of of |

0:05:29 | a um X A of producing a speaker W Y |

0:05:33 | or let's say that that the of the uh i don't six a big was the probability of your speaker |

0:05:37 | being W I |

0:05:38 | and then the probably a given an X a be what's your |

0:05:41 | a probability that you're speakers W like you just much by these two and then you just sum up over |

0:05:46 | all the speakers in the world so that's so of W is |

0:05:49 | but is basically the population of the world |

0:05:51 | so i |

0:05:53 | and we can also uh in and |

0:05:55 | no some but that the five |

0:05:57 | uh uh the uh the null hypothesis were X of and that would be come from different speakers and then |

0:06:02 | you simply do this the notion |

0:06:03 | a for the i-th jay's which are different |

0:06:05 | and then |

0:06:06 | it's very easy to show that these two uh probably are are going to add to one |

0:06:10 | so so these are |

0:06:11 | exactly |

0:06:12 | you could just very basic probability |

0:06:14 | one can question these |

0:06:16 | of course but |

0:06:17 | but are like impractical um |

0:06:19 | i mean there's no we can really |

0:06:21 | a a are these posteriors |

0:06:22 | so this is where a a factor analysis |

0:06:25 | um the |

0:06:26 | um so are uh if you if you have a speaker-dependent dependent gmms mean supervector |

0:06:31 | uh you you can model that has a ubm mean supervector plus |

0:06:35 | and a some uh eigenvoice matrix much by by speaker factor vector |

0:06:39 | plus and i can tell matrix |

0:06:40 | uh multiplied by by channel factor |

0:06:42 | fact |

0:06:43 | and um |

0:06:45 | a assume that each speaker uh in the world is mapped to a unique speaker factor vector Y |

0:06:50 | but you can just change your uh uh uh the previous equation we had a we just replace the W |

0:06:54 | use with wise |

0:06:56 | of course this still doesn't have any |

0:06:57 | any any practical that we |

0:06:59 | what we wanna do that the more to some kind of analytical form where we're we can |

0:07:03 | uh a you know introduce the uh the priors that we have on on Y |

0:07:07 | a and Z |

0:07:09 | so um |

0:07:11 | a first a step is uh you have a we have that's |

0:07:13 | because the estimation |

0:07:15 | of the piece |

0:07:16 | um so we just to a summation two |

0:07:18 | uh a and then it about |

0:07:20 | so |

0:07:21 | and |

0:07:21 | this as well |

0:07:22 | okay do this |

0:07:23 | um |

0:07:24 | a first we have to realise is that the summation is over a speakers uh not the wise wherever ever |

0:07:28 | whereas the integral is done over the why |

0:07:31 | a uh a you have to actually get a to uh just a really basic capitalist and and the probability |

0:07:37 | of break comes down to the uh |

0:07:39 | room a summation forms |

0:07:41 | and you actually get uh this is actually the correct form from you get |

0:07:45 | uh for the probability that a that the two others is uh are from the same trick |

0:07:50 | and this and equation of for "'em" actually i it actually terms up it in that the different contexts to |

0:07:55 | um |

0:07:56 | so which is quite interesting ah a here you see that you have a W you um |

0:08:01 | yeah that that amount of or |

0:08:02 | which means that if you if W goes to infinity then this probability goes to zero |

0:08:06 | which intuitively makes sense |

0:08:08 | uh you trying to calculate the probably that they came from the same speaker but |

0:08:12 | if you of infinite number speakers |

0:08:13 | then yeah that probably should go to zero |

0:08:17 | so now |

0:08:18 | are we need is closed form expressions for or uh the prior P X and |

0:08:23 | uh the conditional P of X |

0:08:24 | um |

0:08:25 | given Y |

0:08:27 | so um |

0:08:28 | first we want uh the first thing we did was we we simplify the problem by ignoring the intra-speaker variability |

0:08:34 | so let's just so that you zero |

0:08:36 | and it just use a S is and plus V Y so we you we just have the eigen voice |

0:08:40 | not be eigen channels |

0:08:41 | um |

0:08:42 | a and the second that assumption that we said |

0:08:45 | was that um |

0:08:46 | well |

0:08:47 | yep i i got into that |

0:08:48 | a a two add them use that we have to |

0:08:51 | a use um |

0:08:52 | a |

0:08:53 | just take just |

0:08:54 | but just of these these to have them is use that first |

0:08:57 | uh a in the house and that that |

0:08:59 | and i have to and can be written as a glass in with respect to the mean |

0:09:02 | a the second i'd we use that the product of two thousand is is also a gaussian that's all you |

0:09:06 | really need to know is that be a normalized gaussian there's gonna be some |

0:09:09 | some scale factors that at the beginning but |

0:09:11 | is essentially just gonna a gas |

0:09:14 | um |

0:09:15 | and then another sub that that would make |

0:09:17 | a uh is to simplify the be computation |

0:09:20 | a a is that we just assume that each vector in in in in each utterance was just generated by |

0:09:24 | by only one gal in in the gmm not up a whole mixture because once if you use of whole |

0:09:29 | mixture sure than the cup to to becomes |

0:09:31 | to complicated |

0:09:32 | so now you you can see here that uh uh the uh mixture summation is just spare place by a |

0:09:37 | a single gal C |

0:09:39 | and and how to decide which mixture |

0:09:41 | generated which a each frame |

0:09:43 | well one way is to just obtain the uh maximum like to estimate of of the Y |

0:09:47 | a for each utterance |

0:09:48 | uh which then for we described a parameters in the gmm |

0:09:51 | and then just use |

0:09:52 | and then for for each frame you just find a gal sing with with the maximum |

0:09:56 | occupation probability |

0:09:58 | so uh |

0:09:59 | now uh you can see that this condition is basically just been a multiplication of gas since that's that's all |

0:10:04 | we have is just a whole string of gauss is mark what together |

0:10:07 | we we know that when you multiply gas is you get another gaussian although those not we normalize |

0:10:12 | so i you you just continuously apply that i'd eighty two two pairs of of the absence |

0:10:17 | and and the whole string of of multiple |

0:10:20 | and uh i you we to pay too much attention to the map D appear |

0:10:25 | but just to is that if you keep going |

0:10:27 | you basically just gonna get run they have C and what put by some some complicated uh remote |

0:10:33 | um |

0:10:33 | a factor or uh which is now inter depended on just the your or uh observations and you are |

0:10:40 | or eigen voices |

0:10:41 | and you a universal background model |

0:10:44 | and the also so of us to up to like a form solution for |

0:10:48 | for the prior as well um |

0:10:50 | and here are again uh everything that in a but just can be multiple of gaussian |

0:10:55 | at the end just that with one thousand that's and out from negative infinity infinity so just an increase to |

0:11:00 | one |

0:11:01 | so now that you |

0:11:02 | you've basically destroyed you're integral |

0:11:04 | and i you you're just left with a with all these |

0:11:07 | these factors there just based on your |

0:11:08 | but put observation and and your model |

0:11:11 | for and there's and then your a pre-trained to um eigen voice |

0:11:15 | so i i everything here and again pretty much do go through the same process and |

0:11:20 | i this is actually a a the the final form |

0:11:23 | a that they can get for a for me to arbitrary speech utterances X of in X to be |

0:11:28 | a you can find you can actually compute the probability that the came from the same speaker |

0:11:33 | we we don't we don't doesn't matter which speaker that is your we actually much over all the speakers in |

0:11:38 | the world |

0:11:39 | yeah um and this is is basically the the uh close form solution |

0:11:43 | uh that you can to ford |

0:11:45 | and uh if you look at this uh solution |

0:11:48 | you can actually see that |

0:11:49 | uh for each utterance um |

0:11:51 | uh you just need a a a a set of sufficient statistics uh D |

0:11:55 | P N J A um and these are sufficient enough |

0:11:58 | to just come your |

0:11:59 | or um |

0:12:00 | uh utterance comparison function than this probability so |

0:12:03 | a in some settings i one but you don't want to keep |

0:12:06 | a a a a a uh the input observation data you can just |

0:12:09 | uh |

0:12:10 | a extract be statistic a sufficient statistics |

0:12:13 | and then just um |

0:12:14 | discard |

0:12:15 | yeah yeah the observations |

0:12:17 | uh if you're in a constrained by ring uh environment |

0:12:20 | so |

0:12:22 | a sound uh and that's just as measure uh we we just a pilot to |

0:12:26 | uh make the classical clustering a method of of doing speaker diarisation |

0:12:30 | um for the for the call |

0:12:32 | um data set |

0:12:33 | and uh we just used uh a a uh measure for |

0:12:38 | cluster purity |

0:12:39 | and then a measure for uh uh how accurately we uh us we estimate of the number speakers |

0:12:44 | we actually have to use both of them in conjunction um |

0:12:47 | that's really make sense to just use one of them |

0:12:50 | and these are just the optimal numbers that we were able to get a |

0:12:53 | um using |

0:12:54 | uh of for different |

0:12:56 | uh distance functions |

0:12:57 | um we use stick center at phone conversations number speakers range from two to seven |

0:13:02 | i just twelve mfccs |

0:13:04 | with energy and out to |

0:13:05 | um dropped up the non-speech frames |

0:13:08 | a we use eigenvoices is trained using uh uh uh G |

0:13:12 | uh we got trained using a um |

0:13:14 | i i think it was the uh |

0:13:16 | that is the switchboard um database |

0:13:19 | um |

0:13:20 | and and and a here you can see see that uh the proposed model uh as much better performers than |

0:13:27 | and the others uh of that that we tried |

0:13:30 | um |

0:13:31 | and uh at this is a really in the paper but you can actually uh uh uh a do use |

0:13:36 | to an extension to the model |

0:13:37 | uh we actually are originally uh of P eigen channel matrix for |

0:13:42 | for a a a you know simplicity but not we can actually included and then go through the same process |

0:13:46 | is actually a lot more |

0:13:47 | that's actually have more involved but again you can of actually get a this kind of close form solution were |

0:13:53 | now but also uh involving B B eigen channels that model T |

0:13:57 | the intra speaker of very abilities and uh you can actually easily show that this |

0:14:02 | a close this simplifies to that the previous one we had a if you |

0:14:05 | if you set all the uh if you set the i can channel matrix to zero |

0:14:09 | and so we actually tried this to has an additional experiment using a interest or |

0:14:13 | of their is uh using eigen channels matrices that that we trained a i think a um |

0:14:18 | but use a microphone database |

0:14:20 | and that actually improve the uh the accuracy of of the column task by |

0:14:24 | i one or two percent point |

0:14:26 | and the actually more sessions that you can do here you you can actually also uh derive of this equation |

0:14:31 | of for for for a general case and speakers and instead of a set of just two |

0:14:36 | so |

0:14:37 | that's |

0:14:38 | pretty much it |

0:14:40 | and are much |

0:14:47 | and choose we use them for one two questions |

0:14:56 | so i is the a question about then of these the cool on the the is do than the overlapping |

0:15:02 | speech and |

0:15:02 | for |

0:15:03 | um there was but um |

0:15:05 | is uh there were all |

0:15:07 | each channel was recorded separately |

0:15:09 | so when there was overlap things a speech i i basically just discarded |

0:15:13 | a one channel and then just just |

0:15:15 | just use one channels as to ensure that there's only one speaker talking |

0:15:19 | for each utterance just where doing the clustering task |

0:15:22 | i i just use the at manual transcriptions to to just |

0:15:25 | to to obtain be |

0:15:26 | to to pretty segment the the utterances so the other she's where basically person |

0:15:31 | and and so you enjoyed just see what happens when it's the the living just to see whether |

0:15:35 | it's a single |

0:15:37 | a a a a a new speaker or something |

0:15:39 | um them |

0:15:41 | but that that would of interest try |

0:15:45 | you |

0:15:47 | ooh |

0:15:48 | a question |

0:15:53 | oh |

0:15:57 | so we my vision in first or |

0:16:01 | yep |

0:16:04 | oh |

0:16:05 | that |

0:16:06 | a |

0:16:07 | yeah i did actually to try with the back |

0:16:09 | um the performance actually wasn't to great |

0:16:12 | so |

0:16:13 | i just a mention it |

0:16:16 | yeah for for this task um |

0:16:19 | uh uh it just seemed like a a a the G a large you gave better results |

0:16:25 | have and the big |

0:16:26 | you know |

0:16:32 | i |

0:16:34 | use can be very well |

0:16:37 | oh |

0:16:40 | yeah yeah i actually did better |

0:16:42 | hmmm |

0:16:44 | yeah i mean i wish i had be had missed T database |

0:16:48 | but we do have it |

0:16:49 | so |

0:16:49 | hmmm |

0:16:53 | the movies because this a the simply greedily |

0:16:56 | it's from calls |

0:16:57 | hmmm |

0:16:59 | that's from goes that you recorded two |

0:17:02 | it to the at so own clue it's |

0:17:05 | um maybe it's because of the the range |

0:17:08 | frequency considering |

0:17:09 | hmmm |

0:17:09 | yeah i i i i don't remember a of uh was a K or sixteen K |

0:17:15 | okay |

0:17:19 | i can thank you like and |