0:00:18a Q four
0:00:19for in
0:00:20uh so but come on this uh all the whole car
0:00:24which she's going to be about uh actually implementing or
0:00:28uh investigating a a a a a a role and grams
0:00:32a model seen speaker diarization
0:00:36and my and is better but from what a check uh a i'm the third outdoor or this paper
0:00:41main out or with be one at that
0:00:43which mainly known for
0:00:45speaker diarization work he's not here so it's me was present
0:00:50and it
0:00:52oh of us are coming from from space of from the up uh
0:00:56research search institute
0:01:00so let me say a few words about the right direction
0:01:03so uh a C the speaker diarization
0:01:06case uh
0:01:07try to segment a input speech
0:01:09according to the speaker
0:01:11and uh
0:01:13a recent applications are trying to focus a lot or a and E and data or to
0:01:17recordings speech are
0:01:19actually record it to by multiple some microphones
0:01:22and also trying to look at uh a small than those conversations
0:01:27and and so on
0:01:29we we sent to a mainly
0:01:31a main work was actually
0:01:33again uh convert
0:01:35but are can might convincing to uh using different kind of features
0:01:39acoustic features
0:01:40um like to deal away a uh a a time delay of arrival or
0:01:45my this can combination
0:01:47which also we are actually
0:01:49a do in our work but
0:01:51actually more so the speaker diarization systems
0:01:54are ignoring the fact that
0:01:57alright is that this case then sees of human correlation
0:02:01so that such a a statistics for information which can be estimated from
0:02:06from let's say um
0:02:08conversation i this is not usually use
0:02:12um that is just um
0:02:14uh a simple a lot so which makes use actually spending T the output of for speaker diarization
0:02:20so i put speech and the segments
0:02:23for each speaker
0:02:27a so again uh going back to actually do the motivation of the work
0:02:32if you look at a conversation speech are usually a bearing to be and constraint and spun to
0:02:38yeah are also go an by by principles and they are actually some laws
0:02:42are we can actually a yeah pretty some be a year but are so
0:02:46one of those are
0:02:47for example
0:02:48uh trials
0:02:49it's out
0:02:50which are are still to be a or about there
0:02:53and uh um actually
0:02:55each speaker in the during the meeting
0:02:59in in in a conversation has some or or actually in the meeting usually and based on that uh uh
0:03:04uh people take their turns
0:03:06in in in conversation
0:03:08so just for that minority uh roles scan your of for two three classes there formal roles
0:03:15factual roles or social or
0:03:19and so uh need the motivation to can the motivation
0:03:25to perform the analysis like conversation any these four
0:03:28S actually use a role recognition
0:03:30people like using the information or statistics for conversations
0:03:34like this
0:03:35turn-taking taking that turns or
0:03:38and turn duration or the length
0:03:40actually of them just those of speaker segment
0:03:44uh the egg thing many core us like uh
0:03:47probably know
0:03:48a a C M use or i mean meetings
0:03:50i meetings
0:03:51and those i mean me for example are coming especially or may need from a from india for these precision
0:03:56should you'd and B are
0:03:58using that here
0:04:00and uh uh a any the all statistics are often up to be expected from a speaker diarisation of course
0:04:07but in our case what we are going to present this paper is
0:04:10actually we tried to use
0:04:12the information from
0:04:13a a a i is of conversations as a writer information
0:04:17and to use a back actually in a speaker diarization
0:04:20so there is um
0:04:22is a diagram
0:04:24actually of our technique
0:04:27uh where you we can see simply acoustic features that speaker diarization which is doing clustering
0:04:32for the speaker turns or segment
0:04:35but is the output from a speaker diarization
0:04:38then you can do actually computer estimate the statistics
0:04:41based of this
0:04:42turn taking meaning cut for example statistics about the role
0:04:46and it's uh this information can be some use back
0:04:50back part in speaker diarization
0:04:52so it's kind of like by a information
0:04:54which can or
0:04:56how and
0:04:57uh acoustic features are up information in in that is that
0:05:03um let P you words that about the data set so as i said that at beginning we are using
0:05:08a in this work again i mean meeting data base a
0:05:12approximately approximately that hours of data
0:05:15but with of course uh for training men and test
0:05:19and um
0:05:21and uh
0:05:24actually the center i is uh a kind of
0:05:27state also we doesn't change much over the recording or different meetings
0:05:31so they are always base uh for but it's buttons uh
0:05:35which has a given the role
0:05:36therefore all
0:05:38a a project manager user interface expect
0:05:41uh mark an X better than industrial design or
0:05:43actually those people are somehow
0:05:45talking a devil being some i think a remote control device
0:05:49like that
0:05:52we also that are also assume that the is
0:05:54once supervisor
0:05:56like a project manager of with actually directing this this meeting
0:06:00and uh you you can see those the data about how much a
0:06:03how many meetings used for training testing and so on so twenty
0:06:07uh a think uh recordings
0:06:09the end where used
0:06:12going back to take to like getting just again to
0:06:16uh to somehow well um
0:06:18this kind
0:06:19uh the the technique
0:06:21so we have tried to simplify to be to work in the sense that
0:06:24and the speech regions
0:06:26uh can not be a a a a can be posed
0:06:29by by a speech over then
0:06:31theorem miliseconds seconds and we don't
0:06:33can much about close talk so we are actually
0:06:36or a not in the cost talk to do
0:06:38uh a previous
0:06:39a he is just a a a lot so at
0:06:42with those a speaker segments where each speaker segment
0:06:47i some beating beginning of uh duration and each
0:06:50these or speech segment is associated with
0:06:54a a speaker
0:06:55which has some role in the meeting
0:06:59uh again going back to those n-grams so how to use these better information
0:07:05so some as we have a again we have a
0:07:07sequence of speakers which is uh get the dress
0:07:12we also have um
0:07:13actually uh uh to one mapping between speakers and the roles
0:07:18which will speakers have been
0:07:20each uh
0:07:21each meeting
0:07:23or or or that stuck at the air
0:07:25here are
0:07:28and and
0:07:28base some that he of course had the corresponding sequence of for all
0:07:32which is this a P
0:07:34because caress
0:07:36what we are what we are doing he's uh we're trying to
0:07:42um which i make the probability of the roles
0:07:46of the speakers depending on their roles
0:07:48based on the previous speaker so we are just simply applying something like a a language model or in uh
0:07:53automatic speech recognition
0:07:55and uh of course you are trying simple
0:07:58bigram and trigram to the beginning
0:08:01here you can see
0:08:02but traditional equation for
0:08:05for um training got language model so
0:08:11there is the creation for for P they are of course are and
0:08:15i D and there is a table with but but C T V was obtained or that test data
0:08:19so of course for uni
0:08:21the perplexity is going to be for but then you may see that for bigrams and trigrams
0:08:25but that but
0:08:26and but at the E six discrete decreasing can in there is a whole that such information can
0:08:32a a can be an actually have meant it so to acoustic uh
0:08:36acoustic information
0:08:41now now little to that the diarization system i don't want saying i don't the same much in about it
0:08:47i is the second talk where i
0:08:49i have one or two slides about
0:08:53no technique itself
0:08:55so superior again combining acoustic uh score sweet
0:08:59the language model scores which are actually those roles anagram
0:09:03and uh the diarization system
0:09:05our method is based on information bottleneck principle which has been
0:09:10are we discussing last year in this paper
0:09:14what we assume as the as input data speech
0:09:17non-speech uh segmentation or speech segmentation into initial segment
0:09:22uh and we do actually uniform segmentation
0:09:26yeah so i think uh
0:09:28nothing too much fancy
0:09:29and then there is a kind of
0:09:31these clustering corporation
0:09:33so we are trying to cluster input speech into
0:09:36two speakers segments
0:09:38and this is a for as uh a a a a a to the with which star
0:09:42so in the end it's uh
0:09:43uh uh we had some
0:09:45some um
0:09:47estimation of clustering and B do actually a define to by a challenge in an system
0:09:52so retrain on speaker
0:09:55uh speech data meaning for each uh a speaker we train one one G and E D and you just
0:10:02a simple or with without decoding
0:10:04which is going to give us the the sense that that the sequence of
0:10:08the a speaker
0:10:10so a this time we of course didn't can by any prior information uh from
0:10:15from these roles
0:10:17and uh but this can be simply be done in a during viterbi decoding
0:10:21i just mentioning that there was always some them
0:10:25actually using such information
0:10:27so um
0:10:29a for example this paper or a in two thousand nine
0:10:32but we were uh should be using meeting dependent uh utterance
0:10:37between speakers
0:10:41so during experiment uh what testing so we want to prove to what we are doing works works well or
0:10:47or or improves to
0:10:49but a sure
0:10:50we are samples or we assume we have to K
0:10:53in one case we know
0:10:55one to one mapping between
0:10:57mean can't uh between speakers and the roles
0:11:00this is that
0:11:01of course this is uh
0:11:02kind of cheating but so of that we know this information before we know that speaker one is for example
0:11:07a project manager and so on
0:11:09then actually the
0:11:11uh to obtain a a a a or to at like this uh
0:11:14uh and a um in a diarization system as i said is pretty simple use to bitter the decoding
0:11:22a only acoustic like use are used but also
0:11:25these ah a property power property from
0:11:27from the
0:11:28and roles
0:11:29so just so
0:11:31a a classical education uh four
0:11:34or or viterbi decoding
0:11:36and um
0:11:38i so use some scaling factors and insertion penalty to should you will do the output
0:11:44uh of course these um
0:11:46a these factors of this um
0:11:49can are tuned on the elements set S
0:11:52we have a it's from i mean to
0:11:54there i mean
0:11:55and meeting
0:11:57a the second case
0:11:59i is uh more difficult but uh
0:12:01i it's it's more real we suppose that be don't know the mapping between speakers and and about
0:12:08so we need to some of estimate it and we do we
0:12:11from actually from the
0:12:13uh those uh estimate it uh speaker segments uh
0:12:17which comes it should be for this viterbi decoding
0:12:20so what we do is uh a pretty simple
0:12:23uh uh also we takes actually some time
0:12:26and a computation but we do just a
0:12:31the search
0:12:33between a or impossible possible actually a combinations of of mapping
0:12:38and you are just looking for the maximum
0:12:40like i just or maximum uh likelihood uh
0:12:44that you so in the end
0:12:45but did getting some estimation
0:12:47we each speaker
0:12:49has feature or in the meeting and we can then apply this prior information
0:12:55from uh are the statistics
0:12:57and then in the end the decoding is pretty pretty simple again
0:13:01we do we just bit ago
0:13:05is R
0:13:06should be
0:13:07so the plots for that so in first case actually disease i seen from no "'cause" they see
0:13:13but the first case just gives one where you know actually
0:13:16this one to one mapping is uh the button
0:13:19the mean sure plot
0:13:21and the second case where we need to estimate this mapping is the first one so we should try to
0:13:26estimate it from those speakers which come from clustering directly
0:13:29before for gmm modeling
0:13:34oh us to have results so so again
0:13:37i mean meetings in this case for speakers for roles
0:13:41so we actually for of the clustering to commercial four speakers
0:13:45and to the results are nice the diarization it yeah rates we don't count uh speech nonspeech speech errors because
0:13:52just just use it uh
0:13:53uh for all they can the same so
0:13:57uh we are just actually
0:13:58accounting for speaker or this case
0:14:01so uh of for case one if
0:14:03they is more by a we get are fourteen percent error rate
0:14:06but are uh using ink uh
0:14:08actually you go we got to get we my see that there is uh a decrease in error rate
0:14:13and we get two percent
0:14:15for case to which is the reader one what we don't know the mapping between speed of the roll
0:14:20but good thing up to seventy percent for
0:14:22for three gram
0:14:24and um
0:14:27and someone results uh
0:14:29yeah if a model of speaker time
0:14:31okay activity active duty to to each of four speakers
0:14:34may see that for program manager which is kind of a direct in each other
0:14:39we are getting the most of the gain shouldn't for compared to a three
0:14:44a a part is then seeing meeting
0:14:47uh also from the and as is we we have seen that
0:14:50the proposed method out performs the previous one the baseline right especially for short
0:14:54where the acoustic score
0:14:56actually are are not probably
0:14:59estimate it and some these are information can
0:15:03a the question is how this can generalise who other data because everything what is done is done all
0:15:08i mean meeting speech are of same
0:15:11so for this case so we are using a a a a a a to data to um which transcription
0:15:15from two thousand six a double seven
0:15:17seventeen meetings
0:15:19um but there and the and microphones being for so we get one single and and speech signal at the
0:15:26and so i actually D are not only you for about this but the but they are for up to
0:15:31nine but suspense in the meeting but all days
0:15:33we somehow
0:15:35as a part of the is a one uh and project manager or the the
0:15:39the guy who was leading to
0:15:40the meeting
0:15:42and uh the at as can be run of those three simple
0:15:47so if you look at the results
0:15:49i again we made we we may see the form of batteries
0:15:52you two person and
0:15:53for applying to me we got to go there is a decreasing
0:15:57an error or forced and speaker
0:16:00to twelve percent
0:16:01and um
0:16:05but so all this is kind of
0:16:07though um for for each meeting we C the speaker or a at a rate for for each meeting but
0:16:12in this case it's i mean actually not so each transcription is going
0:16:16uh we may see that in most so the meetings that is uh really some some gain for
0:16:21for the meetings i think from a from just
0:16:24to of
0:16:25of those meetings from a rich transcription we don't get so uh
0:16:28uh improvement
0:16:29but those fifteens actually improving
0:16:32and this information
0:16:36um um
0:16:37so i is um
0:16:39that's it uh so conclusion
0:16:42um is um
0:16:44so what we are presenting here is a speaker diarization system
0:16:48right are we at yeah attending to use uh the prior information from role uh roles
0:16:54or or a conversation i Z is
0:16:56back in a
0:16:58in and a speaker diarization
0:17:02also so actually you are doing get are also so show it in the second told that
0:17:07uh the technique can improve by also by combining so different so of feature
0:17:13but uh in that case actually
0:17:15oh features in but in that case we still you use this but information
0:17:19so my money we are getting a a a a a bus like and percent improvement over i mean data
0:17:25and what is also what's important is that
0:17:28we set technique or
0:17:31uh the
0:17:32the pilot information this language model which we
0:17:35estimated form
0:17:37i mean data
0:17:38a is a generalized in also on different uh
0:17:41a a so uh
0:17:43means that this technique has a whole for for ready to be used for for different data
0:17:49and um
0:17:52as it is in the last uh
0:17:54last item a we in in that case
0:17:57in our case of just consider a very simple
0:18:00for roles
0:18:01uh in in a meeting and of course this can be somehow
0:18:05improved by uh
0:18:08i i actually expecting other information for for other for there
0:18:12at this button school
0:18:15i think that's
0:18:16small small for for the stop
0:18:25a a question you
0:18:37as in to be large used when you a rate was already
0:18:40a fusion
0:18:42is is just a feature of the
0:18:44a being the in which would yeah
0:18:48as the because when the some of the meetings where you can still
0:18:51for the ones a very high and rate getting anyone all
0:19:38but you
0:19:46um no all the language or is trained on double open data out over
0:19:50seven meetings of course
0:19:52so i think it was like twenty meetings
0:19:54i don't
0:19:55and then the applied such a language model or or or an n-gram
0:20:00for the test data so it's not um
0:20:02meeting specific civic sure is joe
0:20:10um they have been using unigrams by and trigram
0:20:15so kind of traditional
0:20:18or members and uh
0:20:20the results i even for breaks that picks a T it looks uh it you'd have some of you look
0:20:25at the results we be able to receive
0:20:27they're which we achieve with the technique looks that also for trigrams it's the work
0:20:35we can you not that is from questions
0:21:02but only for
0:21:15five then