Speech Transcript - MULTISTREAM SPEAKER DIARIZATION THROUGH INFORMATION BOTTLENECK SYSTEM OUTPUTS COMBINATION

so um this is the second talk i about uh i J again a a speaker diarization them what we are trying to focus on multistream approach use and it's uh actually detect in the the the baseline technique which we are using a is the same as in the previous talk which is uh information but to like system and to uh we are made me as the P saying the are need trying to look at the a combination of the outputs or a combination actually of for different different seems on different levels and these are only acoustic strings cell so no prior information from brawl statistic so again um and the third for order here is this was done by D was was D is you them to a T D a we let for for should L D and um i interaction a motivation as the same or kind of close um again we set holes uh or we assume that's uh the recordings which we are working with a are recorded with multiple distant microphones i i as um actually features what you are using a two kind of a "'cause" to features and that mfcc features which are kind of standards and then i that time delay of are right and i was features um each loop they are pretty uh a compliment to mfcc and uh um people nowadays they they use quite quite a lot for for uh diarization actually this combination winning acoustic feature combination for we uh uh uh information but like a technique is a a key less a a state-of-the-art results in a meeting data stations um so back to O two motivations so usually the feature streams are combined or a model level so there are separate models for a gmm models for different uh actually speak uh streams and this is are those way away and the and these uh actually uh a look like use in the and are combined with it's some you know waiting a and there also some other approach is like a voting schemes between these uh systems i diarisation systems already or or actually the initialisation i run system is done on the output of the other system or some the graded approach our are actually question a is uh if we can if and if you see or do this to kind of different acoustic features can be integrated using independent diarization systems rather than independent models or in other word but but actually D add some advantage of using systems are then a a combination but do we mean by system or a combination i hope is going to be clear uh uh sure or a to slides um so maybe the last one about i'd like blind of the talks so for let me say a few words of all this information about but like principal which we use and which is actually done on single stream that a station so no combination of before features and also if few words about to model based combination about system based combination some he bit combination and the experiment a result again uh a state-of-the-art results using actually uh this uh but to make uh information but the like a technique um um we we are getting state of the results with such system and that is not too much of a computational complexity in in that um so this is uh are the can the advantage uh how does it work these information about button like principle um actually this kind of intuitive div approach each has been borrowed from uh from a a document clustering so at the beginning sample that we have some document that you want to class or in C clusters in our terminology and um and uh what these actually a a what is added did that as a as the information is some body Y which is about but be of interest a a or we call it as are but i of a body able which it surely no or something about discussed ring so some in these uh a document clustering these why uh why why able can be a can be words oh all the vocabulary which of course to was about uh a about these uh discussed serves and has information about a about six a also so actually some all that there is a a normal condition distribution P you white X so like given X is available and back and going back to this uh a problem or speaker diarisation our X got to i X is actually set of elements oh and the speech so again speech uh segments again you need for segmentation we we set and these need to be uh uh a cluster into C C class or so we to this information about the like a principal state uh that the clustering should be press the ring as much information as possible between a a C a Y or by minimizing the distortion these distortion we can see as a uh some compression for example or also in a our our way it's actually some regularization regularization so if you don't have uh these distortion C N N which is actually but our terms uh i'm each information oh oh X and C for i X and C uh uh if you don't have a it's probably going to cussing to one one global class or which which is not so the case C one so i get this i'm i intuitive div approach but in the end it looks that uh or you can be proved but if we actually you are going to um have to my this objective function which is again uh a mutual information C Y and my nose some uh like i to rate or uh X and C uh yeah are going to actually uh to move the problem to the uh to the way you where uh the properties those that he's Y given X are going to be uh measure don't can but using a simple divorce and so but the point so we don't need to look for some especially divisions of the as your which is saying which got a of we should we should be him together in this uh in do and so intuitive approach i be due the derivation we will find out that actually that should be jensen jensen channel uh the imagines used for for clustering so in the end uh the approach is pretty simple or going to be is so here it's actually a got marty for a a also in each iteration them the are we are uh we are thing to clusters together are based on the information uh from these uh give chance so we take those clusters which have the small the and we just met jim and you do it it's that to the um until should is some stop criteria stop it that you know is again pretty simple and it is actually a normalized but you or from i go back uh this a mutual information between C and Y so so again mm to somehow O i i know finalised this uh i the approach uh right is good we have us to pink daddy and we have actually the the um where you how to measure your the the similarity between between clusters and uh it's pretty simple to to and coded it you know so um oh just a a few information about uh are those properties which are actually here so would be fairly suppose that uh by but you of C given an X where C is cluster eight X is input uh segment is going to be hard partition meaning it all all these bills only to one class or but is no like a a week a uh weighting between several class er and place probability why given C which is actually a a some yeah but about a viable yeah distribution which which is used to a actually to do this so merging and um everything should be more clear to on this on this up your so i mean suppose we have input speech which is uniformly segment it oh for example mfcc features in this single some the approach we have uh elements of these and among variables i still didn't say what it is but i i it's probably in T if in our case is just universal background model you just on and tired speech and uh uh this is actually defining body able to what you to do the thing so actually actually state or which you see in the middle or are back doors P why you an X which are probabilities for a vector Y given uh you the input segments and um the clustering which is a a again competitive technique and in the end we get some initial segmentation and finally we do refinement using ca training a gmm and doing viterbi decoding that are let's go back to to the feature combination so in case of uh uh a feature combination which is based on the big around what else so suppose that we can have to features again uh a few just a at is and and tdoa away and we have to big our models uh each are trained on on such features uh what we can simply do that we uh we can just wait can nearly weights these uh B Y given X uh vectors or probabilities with put some weight and it's going to be us new mats weeks oh for these settlements sorry abilities in the a these weights how to get a to of course we trained them or estimate them on the development data so we should be juror rising or different data L so one we have actually these P Y X is make it's the rest of the diarization system is same so P actually do it just at the beginning where we combine these i are buttons where is and then we just just do a iterative approach to to do clustering so actually this is not a new these has been already but uh published be i row last the interspeech um this is just again the gap how how it is down a again there is a matrix cold thus be white X probably um the vectors like an vectors and they are simply a a it's uh by by alright right yeah and then there is a clustering operation and refinement now what is actually knew and what uh what we are type in this paper is uh multiple system combination so so a set of doing the combination before clustering uh what would happen if you do combination after clustering so um again with a of that they are to big our models oh trained on different uh features and they are two diarization systems in the end so uh we actually it actively get some clusters a stopping titanium actually can be different meaning can have different number of clusters for for a feature a a or four it should be the end to be get a this in these wide given X or a you see actually and and a time to go back from this class to initial segmentation is have been that would D Y you X i to do is just simple by bison operation and um again there is um something you image how how this is done so again and that two diarization systems which are doing complete clustering and in the end we are again getting a um some we are getting some clusters and to get actually back two to this initial segments P Y given X uh we just a apply those uh a simple operations um and just simply uh integrated over all be like C uh why why this should actually work uh is uh again between two intuitive in this case uh these be Y X after combination are actually estimate it on a a large amount of data so if they are not estimated on those short segments as in case so for a your combination before for clustering now each actually white a is uh estimated it or not on a lot of data because you have just you cost in the end of course um um the third approach so a actually keep it system so each is just the combination of those two but also uh are before passing and after clustering so in one case what we can do use just that before a as we just uh or and a one in one a a simple stream just do uh a a system combination and then we just uh a combine such output with a yeah are the others stream uh and she's to be before to cussing so maybe it's it's more seen here i into two streams in one case we do this system combination so we two clustering and from these be white C but is we go back to be Y X to get initial we show segmentation or initial properties for for the segmentation and in in the second case actually be we just do these uh um she's uh did you always stream just uh i try to do these combination before for for clustering that's a those to the kings are simply combine of course i i D and we have some you Y X uh but takes a P Y C about six N B just the i'm and as before of course there are two possible K sees uh what should be done on beach kind of theme and uh this is going to be seen the results are going to be the seen in table but again maybe it's into a D for how this should be done so that we say a few words about the experiments uh we are using the same but each transcription data uh system me sister uh sending meetings so no i mean data but the only rich transcription um the mfcc features and these uh uh tdoa features um and uh and she or the speech is coming from and the and they again um be uh single and hence speech signal um again the was weights which between the estimate are are estimated on the open set um as before we are only many shopping diarization error rate with respect to speaker or or so not speech or speech nonspeech there a a are the results each be a shift if you remember from the previews uh to talk the baseline was around fifteen or fifteen point five uh percent was uh actually use single stream techniques so just mfcc features you do can nation oh for mfcc and tdoa features in case of information but to technique and kind of the H M and gmm uh we may see that to because we get to you and twelve percent um and the second but is just to a being but are the weights those are weights for reading the but different features so in case of because these are different quantity so in our case of some properties which are actually which we are combining in case of a and uh J and those are a look like people so a that's why also be so uh weights are different and again in our case the combination is done using can of variables and this is actually as you see you can see a perform K the the a of system so these are the results for combination uh but combination one no on the um actually after clustering so combination system level as as we call it so in that is this base like you and point six percent comes from the previous table you do system combination meaning after a can my these tolerance of labels after clustering cut a you may C V are getting pretty high uh almost forty percent uh improvement and then they are of course two possible combinations of system and model and a weeding um actually looks and again it's pretty straightforward that it's better to to do see stan combination or system waiting we the tdoa features because they are usually mm more noisy and they need probably more data to were to be what estimated it or at least those of viable to have more data to to to be but estimated in case of a and that's is is features uh it looks at works so much better so that's why reason also you may look at the table a race if the the weights goals close to the those weights uh which we need to estimate the goal the go close to the system combination so instead of zero point seven zero point three we go to zero point eight and then estimated on different data but to generalise for this case um uh just a B to explain why possibly why we are getting such improvement a if you look at the single the stream a a results for each meeting seventeen meetings can this case so are but model combination and system combination um um and you look at the button or which is just simple and S C and D do you do away information but to neck techniques so there is no combination of different features may see that most of the improvement comes in case but is a big gap between those two single stream techniques we have the course you don't get to improvement but you is a a big gap between mfcc and tdoa single stream but system combination works so P develop for such a meeting and um just to conclude the paper uh so here we are present a new technique for or new weight of combination of of the streams of a was six teams so rather as we did before uh before clustering to to way the the acoustic features here we are present technique which actually is trying to do we after clustering and the reason uh a simple for that this uh probably the these on the variables which which are used to then to what you're to match different different uh a clusters or different segments are going to be estimated on are more data or not just on on short segments and uh actually uh as it was seeing in uh in uh the results you are getting pretty cool to improvement for for such a technique so forty percent uh that were all seventeen meeting um i think i'm done oh we the on spoken since something that i mean i no not i think some a specific question for them yeah i for all of the and and goes to P

MULTISTREAM SPEAKER DIARIZATION THROUGH INFORMATION BOTTLENECK SYSTEM OUTPUTS COMBINATION

Speaker Diarization

Presented by: Petr Motlíček, Author(s): Deepu Vijayasenan, Fabio Valente, Petr Motlicek, Idiap Research Institute, Switzerland