Speech Transcript - The speaker partitioning problem

so uh yesterday uh done and almost mentioned that the port that might be necessary to uh uh to something to wake up uh affect the audience uh the suggestion that okay was to uh uh do a colour wheel uh uh which you didn't do i'm i'm also not going through that today um i was there uh assisted in this work or my calling yeah well uh and we're from uh i'll meet you in in south africa so uh this is the second paper out of three essentially on the same topic uh so i'll start by defining it and then saying what is different in our paper from the other two um and then uh we'll get to the race so and defining uh what i'm calling the partitioning problem uh we're all very familiar with the canonical detection problem uh where you need to decide there are two speech segments uh do we have one or two speakers here um so if you want to generalise that which is uh essentially my main interest here how do we generalise the canonical problem uh we're allowing more than two input segment and uh the natural question is then uh how many speakers and uh how do we divide the segments green speaker so here's an example uh the simplest generalisation we got from two three and uh the task is then to partition uh the uh set of input into subset so this is why recording a partition because that's the just the uh set theory uh why of stacking the problem so uh immediately we get five possibility uh so this should be quite obvious that could be one two or three speakers and in the two speaker case there are three ways to partition so um partitioning is the most general problem of this kind uh if you assume there's a single speaker in each segment um and uh i'm saying is the most general because if you have the answer about how the segments of partition you can on sir uh any other kind of uh detection verification identification open or closed so it uh uh uh or clustering problem that you can define within inserts egg uh the connection with diarization has already been mentioned there you also need a segmentation we're we're uh yeah pretty supposing the segmentation is given so uh the problem is general which is good but uh you have to be careful because uh the complexity explode um here's a little table uh that sounds that the number of possible ways what this and that uh uh can become very large uh very quickly so uh again the canonical problem there are just two solutions uh our example which will discuss some more that's fine and uh we're not going to discuss the last one in in full uh yesterday so uh to get to what's new uh material yeah and in the by product is uh too much to present in a in a single talk uh we were uh something to get everything up to the last line in the in the in the I pages so i'll just highlight what is near here and whine what am i want to go and and read the full right so as mentioned uh this is identical two what has been mentioned before are more probable mentioned after this so the problem itself is not you um so i'm stressing this generality which i've just mentioned uh but then i also have to mention that we're focusing uh on problems with that of a small number of of input but try where as uh for example don rickles trace the case of a large number of input um further uh in the background emphasised solutions that the level probabilistic output in other words calibrated likelihood and uh also propose a and associated evaluation criteria uh which i'm not going to discuss further yeah and then uh something which are going to discuss further our paper gives i closed form solution to this very general problem by using uh a simple additive gaussian uh a generative model in ivector space patrick yeah that's exactly what patrick explained except we're not doing it maybe tell just plane cows so um this model gives us uh lot like you uh output and it is tractable even fast when we don't do too many segments at once so uh yesterday in his keynote patrick mentioned that uh using this kind of modelling uh you can calculate the likelihood for any type of speaker recognition problem so this is exactly what we show in our paper uh the formulas of the you can use so um again very briefly be although i i picked a model uh they're model speaker and channel effects as independent multivariate gaussian and and additive uh i vectors i don't need to explain uh every speech segment gets mapped and i vector the reason we call it and i vector is simply because it's all intermediate sized ice for intermediate it's larger than an acoustic feature vector smaller than a super big i might also mention that uh total variability these eigenvectors uh cannot be reconstructed to give you uh the original speech so uh in my opinion they don't reflect the total variability in the in the signal so uh yeah i vector solution uh the generative model the hyperparameters of this model in other words those variability uh uh the covariance matrices that explain all the variability uh they have to be trained with an E M algorithm and that's similar to J I the there's some detail in in the base so i'm going to concentrate on the scoring because it's nice and simple with this very simple model so uh we've given a set of uh segments each represented by nite vector A B C and that is that i've got here represents a subset of that's it so is is a subset image my generative model and then uh we can calculate the likelihood that uh all of the segments in the subset belong to the same speaker so the details of how to calculate the likelihood i i love the subset is in the paper um so i'm something now how to go from the subset likelihood to the likelihood of a full partitioning of the full six so again for the three inputs uh that's one of the possibilities the model is simple so the likelihoods multiply that's very nice very very comfortable to use so this is all you need to solve all of those problems to to get a closed form solution of course is not always going to be a good solution but you get a solution uh so uh for the three input example uh the dust three inputs that represents a trial the output or the five different likelihoods for the fight partitioning probabilities so this solution uh is neat and there's a tile but as already mentioned it blows up if you try to uh used to too many input check so moving to experimental results um uh the experimental results on realness data is available in the full paper uh but in the rest of the stock we're going to uh use and experiment with synthetic data uh the reason i didn't with that in the paper was because yes especially the anonymous ones tend not to like synthetic data but everybody's wearing name tags so i was peers are not here so i'm going to uh per season with my uh synthetic data experiments so this takes the form of a a little tutorial in i think in probability theory so the generality of the partitioning problem and the simplicity of the i vector model uh a very handy tools two uh examine a few questions one might have about basic things about speaker recognition so i'd like to just show you how this how this works so the example we're going to discuss is nest uh unsupervised adaptation uh uh it's not a toss um and that's what i promised uh some points are uh yesterday that this would be discussed so um we're going to analyse it by making it a special case of the uh partitioning problem so basic problem is uh you do need more prior information then that which was provided in in the original definition of the stars so the next several slides are going to be on that so the input um we're looking at the simplest case you're given a train segment which is known to be of the target speaker uh then you'd also given and the adaptation segment which my or may not be from the target segment and you're allowed to use that and then finally there's a test segment and your job is to decide was this the target speaker or not so three inputs as mentioned there are five possibilities of hard to uh partition these three inputs uh we can group the first two as uh belonging to the target hypothesis and uh the last three as uh the instances of uh nontarget partitions non target because the test has a different speaker from that right so we need a prior nist provided the target price we don't need a prior for the train segment we already know it's of the target speaker but what about the adaptation segment so that prior was not stated in the but original problem so we can assemble uh these two priors uh just in the obvious way uh to give a full probability distribution over the five possibilities uh i've uh somewhat arbitrarily set the last one does the other uh to simplify matters here uh you're assuming if the test segment is not to target uh the adaptation segment is also not going to be so uh the whole thing a symbols like this uh the generative model supplies the five likelihoods for the five partitioning well possibilities and then uh you use as patrick said the basic rules of probability theory some room product rule but you need the uh you need those extra prize uh this prior which has not been mentioned before you need that to compute uh to properly express the likelihood ratio between the target and then and then on top hypothesis so the experiment that we did was to demonstrate uh what role does uh this prior play what might happen uh uh if uh you're assuming a bad price how closely should match the actual proportion uh in the data that you working so uh we use synthetic ivectors because we're not interested in examining the data or indeed the the P L D A model uh while making synthetic data with the model uh the data has a perfect shot to the model um so that focused focuses the experiment on the role of the prior so back to this system diagram uh we adjust to things independently the one is the proportion of the adaptation segments in in in the data the other uh is uh the assumed prior of how much that proportion might be and we evaluate the whole thing via via equal error so the results uh look something like this uh the horizontal axis of the here we have the assumed prior increasing in this direction other horizontal axis is the actual proportion and the vertical axis of course the equal error right so uh this here is the best situation to be in uh you know there are many adaptation segments and there are in fact an adaptation works the the back corner over there uh you're saying okay i'm not expecting any targets in the adaptation data so i'm not adapting uh the bad place to be is is there uh with you're assuming uh you'll find uh many adaptation segments but uh when there aren't any so uh the important thing to realise here is it's not so bad to assume that there aren't any adaptation segments because then you just back to what you would have done without adaptation but it is bad to have the mismatched the other way so uh the prior is important uh you might choose to ignore the prior but it's not going to go away it's there closing it's that means that even if you ignore so that brings me to the conclusion of that or um back in the real world uh we've already applied this partitioning software in helping us find the mislabelling in the uh is uh T O I uh data uh which we needed for development for the uh for this evaluation so we've only started on this on this work in the uh personas workshop to be that's starting next week of it here uh we'll be using uh exploring this problem some more uh okay that's all thank you fig three some question of common i could be could be could be if you look at the real case for the purpose you should you can't this you but which we are and then when you are you will have to this more okay usually in a real application you will have an impostor trying going to system enjoying during uh free house to cheat the system and something we will we have hmmm target speaker coming and uh on the targets green a few more you will i agree and this framework allows for that because uh that right there the prior but you plug in to get the final score you can make trial bin so if you know about this guy that you can modify the prior as as time progresses okay sorry did you could you could turn oh yes yeah now you have time to find some questions so do you see that it is it yes yeah wednesday speaker diarization yeah the down here correctly you ask whether that's this is a one step diarization system yeah well when we we do this in addition and it just means you wednesday um no i'm assuming here that the segmentation is given so i assume that that are not the segments which which uh have two speakers in them yeah it is and what they see yeah we apply a system four unique segmentation and you get all the boundaries you then could be used four yeah addition the segment i i wouldn't recommend it because as i pointed out if uh you have a thousand segments uh then uh you want to add to my mike ways uh they're not designed for the large scale case uh but they're they're other approximate uh methods you could you could you could still start from the same uh gaussian P L D A model but then you would need something like uh by variational bayes to to handle a large number of signal we're going to play with that uh at the workshop as well we can think okay

The speaker partitioning problem

SESSION 8: Human performances in Speaker recognition, Speaker clustering and partitioning

Added: 14. 7. 2010 11:08, Author: Niko Brümmer, Edward de Villiers (Agnitio), Length: 0:20:23