Speech Transcript - FAST SPEAKER DIARIZATION BASED ON BINARY KEYS

a this is such a are going to a uh and uh i we're talk about a binary that is a to they this is a during what we jump and so last from university of a you're i'm coming from that of funny nick research so what do not that way use has these out it is in front of a "'cause" is outlined i'm going to use what speaker diarization use or at least for the ones of you that the remember from the produced talk i i'm about a binary speaker modeling and then when during the two things into the binary speaker diarization system that we just developed experiments and then conclude a future work first first a speaker there is a and yeah as we have a a a of you are we split the give the speakers we see who spoke were and and we don't know need speakers is or how many speakers are there K is like the P three dogs no is the art days well we have done oh as in the people in the last year as we've got an do around seven ten percent in but cousin was even though these this something that since two thousand four uh it's not a part of the nist evaluations and i bet nowadays it's C even lower than that and we've got an to twelve fourteen percent for meetings even maybe nine percent now on uh on making the me this is a a a a a great to result these makes that a shouldn't be able to use for all there a you know as a a a as a blocking as a block step for other applications like a speaker I D when there is a a multiple speaker that that there but still we have a problem it's too small example of have some numbers an in in uh standard systems if you develop a diarization system you the do anything about it it's most probably gonna go way up of one time real and if you try doing something about it the two systems at base so that the people that they saw the were trying to do something about it that for mixing the first one is one couple years ago i was going down to point ninety seven for real time that's a on a model or and they were do some tweaks to to a gmm based algorithms to or it's a hierarchical bottom-up system a and the they were getting to a just on the real time and a father on they said okay let's go to do you know so we can use uh sixteen core whatever the uh our you much this and they went down to zero point zero seven i nine all the nowadays these this still even five you and faster but is to P O so you don't have to be you in a mobile phone or you don't have you know these uh the is have to work on used in depending on what architecture so and this is what you have a system that the really really is very fast and it doesn't matter and one and or what uh architecture running it and still have a this case we by adapting a recently proposed the uh uh uh a technique called binary speaker model we also have another poster or in uh getting i on using this for a speaker I and so in this case we that it to there is a show and i'll tell you i'll tell you how we it uh uh to know what we'll do that to know the basics of what's uh by speaker modeling as were i'm explain about it a little bit more now so and this is a a six of it so we have a a an acoustic C have some input acoustic data and one uh and that we a a vector well actor of J a zeros and ones so that it is uh basically in a very general way like explain here we just extract some to sleep parameters mfcc or whatever want and and we use a and he back background model K B M which is basically a ubm but trained in a different way to yeah this acoustic they and then with these K B M we obtain this uh these minor to case for each uh acoustic they say which could be a a data for one speaker or data for a couple seconds a that C D's T M be T B M you the understand it different ways this is basically a set of options position in a particular way in the acoustic space in that more be them may show acoustic space in you have just one they may so that we can see the example so we first position be acoustic options in the space and then we take up put that that were acoustic stick or or or of a speaker data i we see which all these options at most are present in the best our would data and uh from there extract a binary fingerprint which uh or by taking which has to does are present in the positions of discussions that do not really represent well than that and ones oh on the options that are are ending our data and is you right so how do we do it for a a a for an obvious to how the we all these together well we can see here on the left side we have already puts signal where we compute that were uh mfcc acoustic features at at the and on the right side we have a can be yeah and and the so and that's the vertical vectors he's we have uh vector as which each uh whose dimensionality is and is than a certain number of options we have you in our what be a model and for each input feature vector we select the best we could say other nor the one percent best two percent best that ten best whatever we wanna use whatever for a scroll one of us the and but feature vector that that for X one to X and a our where data one a model we can get down to this uh camping vector the first of the of the result of actors which basically counts how many times each of these options have been has been selected as one of the best representing options for the acoustic data and then i C we just say okay that a and know that by or whatever the options are present in the data of our once on the rest are also does so once we have a a a a E a binary vector for a two speakers of for two sets of acoustic data it is very fast and very easy to to compare them to combat how close they are in here is just an example and a is the the type of a few that should be a some uh uh in the form of a as in the top one of the model and uh a basic this one possible i mean that is many possibilities it in the working in by them are you just need to find a way to compare to binary signals in this case well we used in this paper is in the uh not we just need the sum oh of uh uh you know it's uh some supplies one whenever in the to back to we have a one and the denominator just uh do are so we some but as one of a number in a a in you that of the vectors we have a one and this gives as a score or from zero to one no the zero either use a this in a body not seem that and one is the the same back a a a a a speaker by and models and that they said to we have a poster experiment more about and you can go back to uh to a post we cut speech and that's see now how we apply to us to speaker diarization so this is basically the system the new system that was into they this is uh uh just even if it was a because different of strange this is just and a minute if but the map system we can see that the is a but if clustering down B and we have a kind of a stopping therapy or or or a cluster selection but the see the the of about so first the bottom it uh its uh D feature extraction to extract mfcc whatever we run training the next to eight so we need to train these K be a models in this case we train them from be they the the data itself we don't use external up features i did by stations well the we take the acoustic features and we like a i'm interested in in summarisation we always need to initialize as or a system as we are doing about the bottom-up system we need many more clusters than actual speakers are there so we need some how to create those clusters and this part is that was processing that is in at this is just a nice of using should would just a just a little bit of time of the computational time of the system after that the of minute of clustering which is what we uh keep blasting keep joining together or those clusters that are closest to that this is all going in the binary space and final once we have reached to one and this is one difference from a standard have a minute of clustering system go from a and to one we have reached a one use an algorithm to select how many a terms of to multi we have as a said uh of mfccs we use like be they have to C is a standard uh and B ten millisecond T five miliseconds and can be um well as a said that a model but train to the you know a special way i in a special if you use a uh you a model train it we stand standard em M L techniques you going to have the options positions at the average points modeling optimal more the in the late that uh and this a but it's is are we so that they are not uh uh uh are present in the particle it is of the discriminative information that the speakers have that the speakers of your all you have so we try to do something different that can model that and uh and the this and X so that it can be anything higher than five hundred options we can go to ten thousand the the performance those an the neither neither a uh that rates how to do this so in this case in this paper or to it in the following way to the uh we to be the that these is uh a i would put audio and we first train as to option for them i believe it's two seconds of speech we some overlap so we and that is parental oh second that the house and options oh two thousand a all the options the options of was and very small portions of the only so whenever the is speaker they represent the speaker very discriminatively and and we use that can do to uh medic to adaptively yeah shows shows that we're this space is optimally ultimately model the space like more do more separate between them the whole acoustic space and that's it this is actually much faster than than doing with additive splitting uh yeah M L no right and a is of the data these these binary vectors from the acoustic data and in two steps to do stuff so a step which is oh in the first best the the K best uh captions for each acoustic feature that we have to do we one time only and then on the second step for every subset of features that we to compute a fingerprint from that's gonna meet only the evenly in our uh that is that is addition hmmm a time we need it then this is actually very fast so that we have the mfcc vectors a in top and for each of them yet is this best options you may not working in for the time and that is our were first part and we can store in the score memory and that's done on one time this is a little expensive because evaluating option mixture models but this is one time only then at every time when i can be here speaker model just have to get that that the and the counts and from those counts get a binary vector okay and this is like fast a five acoustic have to talk about initialisation and he just uh uh did something so for simplicity just use we use the can be M the kingdom and then initial clusters which you you just to bit any options that where uh chosen first i mean as that that the segmental or segmentation and we've it there and with those we assigned we got the clusters that we than the most now are in the binary the me okay and we have that this is have for us is is is is exactly the same as for example the icsi system is a format for them map clustering except that now or anything the domain so for example to is fingerprints from our approach of of a cave as options a close per T is completely a binary a between all the models are that all the cluster models and just choosing the two that are closest to merge them i i am and we there are we just take three seconds of data in one second at time and assign compute a fingerprint from T for each of them and assign it to the to the better speaker model last but the last part of the system these ones we not to one so we have one a cluster we have to choose how many clusters is our optimum number of clusters so for bad a a a that the S uh to test this terms that was present but i i two are people in interspeech two thousand eight and the uh in a fit of time so five it all in the paper are but we just a is estimated to in the uh just a relation between the uh in and inter distances between the power of the terms which allows us to select the optimal number of clusters as as i have to say he's about in the system that i'm less happy about and the have to improve this by about eight of course we use that as a should it but also use a by a factor and because the diarization results of so freaking decided to use a nice to rich transcription evaluation that he's is about thirty the six uh shows and uh i to say that he's runs see in just a but an hour in in a lot the P C so it's pretty fast they maybe results the first aligned use the results using uh a big easy could a gmm system but just an implementation of the um basic one a a is as about twenty three mm send and average that position of than a running down of about one point nineteen uh real time he's is optimization here that is no i mean is just an implementation the standard implementation a at the last two lines but do that this is a uh to uh configuration depending on the number of options we do we take for the K B N two possible implementations of by system we can see that in a five or that position it is this is slightly higher than the baseline instant a the real time factor is ten times faster so is pretty good and uh i mean was to importance of the training of the K B in a a a a uh we the the that's we used just a standard gmm just T V if too that's the second line of results are we see that it just breaks i mean the reaches as if at a speaker characteristic a speaker discriminant down shown it just doesn't work i also about so that a a selection of the number of clusters still those and do the job a number of clusters after running the system we actually get to the five percent of the error rate which is a better than the than our baseline this is just a a a two show that's right and all depending on the number of options that the position error rate we have how we can see just think of the the black is the average we can see that event and have nine hundred but after five hundred a our sense for the K B um the results are more less flat so we doesn't matter of five hundred six and the that's fine and this is a body is so i've shows was or meetings oh are our proposed system or the baseline of and see that in most cases they have but was the same out of course a sum up is that make these two percent difference but and and and that a couple of shows that are are better so we so that that is a shown was kind of a a a a a star shown is more uh a was but the things on top of of a standard system that i to get these little gains in performance but just start a a a a system that we call we can even get that and and and when i'm working the next to uh uh we can improve the by key fingerprinting we gonna find a better of stopping at the hopefully and uh also that the system always monocle in and maybe working in cell phones will thank you very much that's can like to think of making did not and he's my okay no no this is this is and the M oh oh oh sorry L merging and speech key detection is on right at the beginning at the very beginning so as just the stand like a standard uh that action system it's just not it's mean the the and see if it goes back justin the acoustic feature extractions at the beginning of the system and but uh used uh the speech taking that action from for you to come thanks to that no no i just i just to acoustic i don't merge i use M D and that's multiple microphones but just been for than the use a single channel then many ideas but that work at the no have to try okay since we ran out of a nice thing

FAST SPEAKER DIARIZATION BASED ON BINARY KEYS

Speaker Diarization

Presented by: Xavier Anguera, Author(s): Xavier Anguera, Telefonica I+D, Spain; Jean-François Bonastre, University of Avignon, France