a Q four for in uh so but come on this uh all the whole car which she's going to be about uh actually implementing or uh investigating a a a a a a role and grams a model seen speaker diarization uh and my and is better but from what a check uh a i'm the third outdoor or this paper main out or with be one at that which mainly known for speaker diarization work he's not here so it's me was present um and it oh of us are coming from from space of from the up uh research search institute um so let me say a few words about the right direction so uh a C the speaker diarization case uh try to segment a input speech according to the speaker and uh a recent applications are trying to focus a lot or a and E and data or to recordings speech are actually record it to by multiple some microphones and also trying to look at uh a small than those conversations and and so on um we we sent to a mainly a main work was actually again uh convert but are can might convincing to uh using different kind of features acoustic features um like to deal away a uh a a time delay of arrival or my this can combination which also we are actually a do in our work but actually more so the speaker diarization systems are ignoring the fact that data alright is that this case then sees of human correlation so that such a a statistics for information which can be estimated from from let's say um um conversation i this is not usually use um that is just um uh a simple a lot so which makes use actually spending T the output of for speaker diarization so i put speech and the segments for each speaker a so again uh going back to actually do the motivation of the work um if you look at a conversation speech are usually a bearing to be and constraint and spun to yeah are also go an by by principles and they are actually some laws are we can actually a yeah pretty some be a year but are so one of those are for example uh trials it's out which are are still to be a or about there and uh um actually each speaker in the during the meeting um in in in a conversation has some or or actually in the meeting usually and based on that uh uh uh people take their turns in in in conversation so just for that minority uh roles scan your of for two three classes there formal roles factual roles or social or and so uh need the motivation to can the motivation um usually to perform the analysis like conversation any these four S actually use a role recognition people like using the information or statistics for conversations like this turn-taking taking that turns or uh and turn duration or the length actually of them just those of speaker segment uh the egg thing many core us like uh probably know a a C M use or i mean meetings i meetings and those i mean me for example are coming especially or may need from a from india for these precision should you'd and B are using that here and uh uh a any the all statistics are often up to be expected from a speaker diarisation of course but in our case what we are going to present this paper is actually we tried to use the information from a a a i is of conversations as a writer information and to use a back actually in a speaker diarization so there is um is a diagram oh actually of our technique uh where you we can see simply acoustic features that speaker diarization which is doing clustering for the speaker turns or segment and but is the output from a speaker diarization then you can do actually computer estimate the statistics based of this turn taking meaning cut for example statistics about the role and it's uh this information can be some use back back part in speaker diarization so it's kind of like by a information which can or how and uh acoustic features are up information in in that is that um let P you words that about the data set so as i said that at beginning we are using uh a in this work again i mean meeting data base a approximately approximately that hours of data but with of course uh for training men and test and um and uh also actually the center i is uh a kind of state also we doesn't change much over the recording or different meetings so they are always base uh for but it's buttons uh which has a given the role therefore all a a project manager user interface expect uh mark an X better than industrial design or actually those people are somehow talking a devil being some i think a remote control device like that um we also that are also assume that the is once supervisor like a project manager of with actually directing this this meeting and uh you you can see those the data about how much a how many meetings used for training testing and so on so twenty uh a think uh recordings the end where used um going back to take to like getting just again to uh to somehow well um this kind uh the the technique so we have tried to simplify to be to work in the sense that and the speech regions uh can not be a a a a can be posed by by a speech over then theorem miliseconds seconds and we don't can much about close talk so we are actually or a not in the cost talk to do uh a previous speaker a he is just a a a lot so at with those a speaker segments where each speaker segment uh i some beating beginning of uh duration and each these or speech segment is associated with some a a speaker which has some role in the meeting uh again going back to those n-grams so how to use these better information um so some as we have a again we have a sequence of speakers which is uh get the dress and we also have um actually uh uh to one mapping between speakers and the roles which will speakers have been each uh each meeting so or or or that stuck at the air here are and and base some that he of course had the corresponding sequence of for all which is this a P because caress so what we are what we are doing he's uh we're trying to uh um which i make the probability of the roles of the speakers depending on their roles based on the previous speaker so we are just simply applying something like a a language model or in uh automatic speech recognition and uh of course you are trying simple bigram and trigram to the beginning here you can see but traditional equation for for um training got language model so um there is the creation for for P they are of course are and i D and there is a table with but but C T V was obtained or that test data so of course for uni the perplexity is going to be for but then you may see that for bigrams and trigrams but that but and but at the E six discrete decreasing can in there is a whole that such information can a a can be an actually have meant it so to acoustic uh acoustic information so now now little to that the diarization system i don't want saying i don't the same much in about it because i is the second talk where i i have one or two slides about diarization no technique itself so superior again combining acoustic uh score sweet the language model scores which are actually those roles anagram and uh the diarization system our method is based on information bottleneck principle which has been are we discussing last year in this paper what we assume as the as input data speech non-speech uh segmentation or speech segmentation into initial segment uh and we do actually uniform segmentation yeah so i think uh nothing too much fancy and then there is a kind of these clustering corporation so we are trying to cluster input speech into two speakers segments and this is a for as uh a a a a a to the with which star so in the end it's uh uh uh we had some some um estimation of clustering and B do actually a define to by a challenge in an system is so retrain on speaker uh speech data meaning for each uh a speaker we train one one G and E D and you just to um a simple or with without decoding which is going to give us the the sense that that the sequence of the a speaker so a this time we of course didn't can by any prior information uh from from these roles and uh but this can be simply be done in a during viterbi decoding i just mentioning that there was always some them about actually using such information so um but a for example this paper or a in two thousand nine but we were uh should be using meeting dependent uh utterance between speakers um so during experiment uh what testing so we want to prove to what we are doing works works well or or or improves to but a sure we are samples or we assume we have to K in one case we know one to one mapping between mean can't uh between speakers and the roles this is that of course this is uh kind of cheating but so of that we know this information before we know that speaker one is for example a project manager and so on then actually the uh to obtain a a a a or to at like this uh uh and a um in a diarization system as i said is pretty simple use to bitter the decoding right a only acoustic like use are used but also these ah a property power property from from the and roles so just so a a classical education uh four or or viterbi decoding and um we i so use some scaling factors and insertion penalty to should you will do the output uh of course these um a these factors of this um can are tuned on the elements set S we have a it's from i mean to there i mean and meeting a the second case i is uh more difficult but uh i it's it's more real we suppose that be don't know the mapping between speakers and and about so we need to some of estimate it and we do we from actually from the uh those uh estimate it uh speaker segments uh which comes it should be for this viterbi decoding so what we do is uh a pretty simple uh uh also we takes actually some time and a computation but we do just a uh the search uh between a or impossible possible actually a combinations of of mapping and you are just looking for the maximum like i just or maximum uh likelihood uh that you so in the end but did getting some estimation we each speaker actually has feature or in the meeting and we can then apply this prior information from from uh are the statistics and then in the end the decoding is pretty pretty simple again we do we just bit ago um is R should be so the plots for that so in first case actually disease i seen from no "'cause" they see but the first case just gives one where you know actually this one to one mapping is uh the button uh the mean sure plot and the second case where we need to estimate this mapping is the first one so we should try to estimate it from those speakers which come from clustering directly before for gmm modeling oh us to have results so so again i mean meetings in this case for speakers for roles so we actually for of the clustering to commercial four speakers and to the results are nice the diarization it yeah rates we don't count uh speech nonspeech speech errors because just just use it uh uh for all they can the same so uh we are just actually accounting for speaker or this case so uh of for case one if they is more by a we get are fourteen percent error rate but are uh using ink uh actually you go we got to get we my see that there is uh a decrease in error rate and we get two percent for case to which is the reader one what we don't know the mapping between speed of the roll but good thing up to seventy percent for for three gram and um and someone results uh yeah if a model of speaker time okay activity active duty to to each of four speakers may see that for program manager which is kind of a direct in each other meeting we are getting the most of the gain shouldn't for compared to a three a a part is then seeing meeting uh also from the and as is we we have seen that the proposed method out performs the previous one the baseline right especially for short where the acoustic score actually are are not probably well estimate it and some these are information can a the question is how this can generalise who other data because everything what is done is done all i mean meeting speech are of same so for this case so we are using a a a a a a to data to um which transcription from two thousand six a double seven seventeen meetings um but there and the and microphones being for so we get one single and and speech signal at the end and so i actually D are not only you for about this but the but they are for up to nine but suspense in the meeting but all days we somehow as a part of the is a one uh and project manager or the the the guy who was leading to the meeting and uh the at as can be run of those three simple um so if you look at the results i again we made we we may see the form of batteries you two person and for applying to me we got to go there is a decreasing an error or forced and speaker me to twelve percent and um actually but so all this is kind of though um for for each meeting we C the speaker or a at a rate for for each meeting but in this case it's i mean actually not so each transcription is going uh we may see that in most so the meetings that is uh really some some gain for for the meetings i think from a from just to of of those meetings from a rich transcription we don't get so uh uh improvement but those fifteens actually improving and this information um um so i is um that's it uh so conclusion um is um so what we are presenting here is a speaker diarization system right are we at yeah attending to use uh the prior information from role uh roles or or a conversation i Z is back in a in and a speaker diarization um also so actually you are doing get are also so show it in the second told that uh the technique can improve by also by combining so different so of feature but uh in that case actually oh features in but in that case we still you use this but information so my money we are getting a a a a a bus like and percent improvement over i mean data and what is also what's important is that we set technique or uh the the pilot information this language model which we estimated form i mean data a is a generalized in also on different uh a a so uh means that this technique has a whole for for ready to be used for for different data and um actually as it is in the last uh last item a we in in that case in our case of just consider a very simple for roles uh in in a meeting and of course this can be somehow improved by uh i i actually expecting other information for for other for there at this button school i think that's small small for for the stop speaker a a question you and oh i as in to be large used when you a rate was already a fusion is is just a feature of the a being the in which would yeah as the because when the some of the meetings where you can still for the ones a very high and rate getting anyone all come i oh yeah two i know but you wang um no all the language or is trained on double open data out over seven meetings of course so i think it was like twenty meetings i don't and then the applied such a language model or or or an n-gram for the test data so it's not um meeting specific civic sure is joe uh i i um um they have been using unigrams by and trigram i oh so kind of traditional or members and uh the results i even for breaks that picks a T it looks uh it you'd have some of you look at the results we be able to receive they're which we achieve with the technique looks that also for trigrams it's the work we can you not that is from questions uh so oh oh or hmmm oh but only for oh oh all one but i i five then or uh one but one i yeah i and i i you oh uh oh oh is yeah hmmm why oh you