| 0:00:13 | however affinity for changing my spiritualist extraction presentation | 
|---|
| 0:00:21 | my name's in the mountains and i'm going to present you know how to stay | 
|---|
| 0:00:28 | on a linguistically it is triggered iterations distant loses information from stricter rules | 
|---|
| 0:00:35 | personal unseen words our task is and what sticklers issues | 
|---|
| 0:00:40 | sewing | 
|---|
| 0:00:40 | small still a generic | 
|---|
| 0:00:43 | setting standardisation i don't want to answer the question | 
|---|
| 0:00:47 | so where | 
|---|
| 0:00:50 | a given as input a real speech signal | 
|---|
| 0:00:53 | what's wanted used to partition the signal into since derivatives | 
|---|
| 0:00:59 | without having any prior information about the speakers a single precision errors | 
|---|
| 0:01:07 | and conceptual and traditionally | 
|---|
| 0:01:10 | this | 
|---|
| 0:01:11 | task involves two steps | 
|---|
| 0:01:14 | first | 
|---|
| 0:01:15 | we want to circumvent the signal | 
|---|
| 0:01:19 | into speaker images segment and this can be found either a uniform way or according | 
|---|
| 0:01:25 | to some speaker change detection | 
|---|
| 0:01:28 | and then how those speaker sessions and we want to cluster those interesting speaker groups | 
|---|
| 0:01:36 | but | 
|---|
| 0:01:37 | a there are a specific problems are connected to | 
|---|
| 0:01:42 | instead of clustering | 
|---|
| 0:01:44 | and in particular | 
|---|
| 0:01:50 | speakers within the conversation | 
|---|
| 0:01:54 | recite wrinkle means taking stays in terms of the acoustic characteristics | 
|---|
| 0:01:59 | then there is there is all merging | 
|---|
| 0:02:03 | the corresponding clusters together | 
|---|
| 0:02:07 | also | 
|---|
| 0:02:08 | it was too much noise or silence | 
|---|
| 0:02:10 | we think the speech signal | 
|---|
| 0:02:13 | which probably has not been a catchy by giving attention | 
|---|
| 0:02:20 | then we may construct a close to shown cultures | 
|---|
| 0:02:24 | well those nuisances | 
|---|
| 0:02:28 | and as a result | 
|---|
| 0:02:30 | is in fact | 
|---|
| 0:02:32 | v performance of the system | 
|---|
| 0:02:35 | using | 
|---|
| 0:02:36 | we knew in advance the number of speakers | 
|---|
| 0:02:39 | in the conversation | 
|---|
| 0:02:44 | in this work we are for closed or scenarios word of speakers | 
|---|
| 0:02:50 | a specific roles | 
|---|
| 0:02:52 | for example with me think of that occupation direction a meeting collection where we have | 
|---|
| 0:02:59 | the teacher questions | 
|---|
| 0:03:02 | anyway interview will be out that each of your and interviewee and so on | 
|---|
| 0:03:08 | and the interesting fee of those scenarios | 
|---|
| 0:03:12 | is that different roles are usually associated | 
|---|
| 0:03:16 | well with distinction | 
|---|
| 0:03:18 | when we see colours | 
|---|
| 0:03:21 | for example in and you we expect that the interviewer with a small portion and | 
|---|
| 0:03:25 | you're you mm we'll answer those questions | 
|---|
| 0:03:29 | over in another conversation we except for us the emissions will describe there's in terms | 
|---|
| 0:03:37 | and the doctor will | 
|---|
| 0:03:39 | you medical i still don't | 
|---|
| 0:03:42 | so the question now and is kind we use language and commonly used | 
|---|
| 0:03:47 | those linguistic buttons | 
|---|
| 0:03:49 | to cities | 
|---|
| 0:03:50 | there is a sh | 
|---|
| 0:03:54 | so | 
|---|
| 0:03:56 | if we remember of the problem for | 
|---|
| 0:04:00 | diarisation in a traditional or a bunch | 
|---|
| 0:04:04 | what we | 
|---|
| 0:04:05 | we do is given the audio signal | 
|---|
| 0:04:08 | first hmms and is given really done with involved in addition and the cluster | 
|---|
| 0:04:16 | instead | 
|---|
| 0:04:18 | if you're propose to also | 
|---|
| 0:04:22 | process the fisher information which can really | 
|---|
| 0:04:26 | you are from an asr | 
|---|
| 0:04:32 | and issues | 
|---|
| 0:04:33 | some extent no knowledge about so there are within the conversation | 
|---|
| 0:04:40 | and give it is knowledge to estimate their profiles | 
|---|
| 0:04:45 | and files you mean the acoustic | 
|---|
| 0:04:47 | changes | 
|---|
| 0:04:49 | all the speakers in the conversation | 
|---|
| 0:04:51 | and now | 
|---|
| 0:04:51 | since we have those two profiles we can conclude a clustering problem | 
|---|
| 0:04:57 | into which conditional | 
|---|
| 0:04:59 | and thus | 
|---|
| 0:05:00 | we're gonna for the potential problems races which are conducted in clustering | 
|---|
| 0:05:05 | we mention triggers | 
|---|
| 0:05:08 | and now the next a few slides and one to go into detail | 
|---|
| 0:05:14 | well on what i | 
|---|
| 0:05:16 | someone change your use | 
|---|
| 0:05:18 | and how we have implemented | 
|---|
| 0:05:22 | so noticed your the in the first | 
|---|
| 0:05:25 | a couple steps of our system | 
|---|
| 0:05:28 | we all process the texture flemish | 
|---|
| 0:05:31 | so given text the first step is that we want to change the chronology text | 
|---|
| 0:05:38 | so | 
|---|
| 0:05:39 | in which a segment after this segmentation set | 
|---|
| 0:05:44 | we ones | 
|---|
| 0:05:45 | every | 
|---|
| 0:05:46 | to be uttered by a single speaker | 
|---|
| 0:05:50 | so i really want assistant | 
|---|
| 0:05:52 | that | 
|---|
| 0:05:54 | no as a kind of their | 
|---|
| 0:05:58 | where and there is in you are | 
|---|
| 0:06:00 | speaker a speaker change in the conversation | 
|---|
| 0:06:04 | instead | 
|---|
| 0:06:06 | permissible we assume | 
|---|
| 0:06:07 | that there is a single speaker or sentence | 
|---|
| 0:06:11 | so we will segment i s the sentence level | 
|---|
| 0:06:15 | and energy just so we view of this problem was sequence labeling or sequence tagging | 
|---|
| 0:06:21 | problem | 
|---|
| 0:06:23 | and | 
|---|
| 0:06:24 | we construct this is a similar situation here were initially we construct | 
|---|
| 0:06:32 | a | 
|---|
| 0:06:33 | current level representation which were stressed and then something | 
|---|
| 0:06:39 | we concatenate | 
|---|
| 0:06:42 | this is representation with the | 
|---|
| 0:06:46 | ward embedding all the course from war and now this | 
|---|
| 0:06:50 | a sequence of words sheer ease thing to a biased and steering wheel | 
|---|
| 0:06:58 | which predicts a sequence of labels | 
|---|
| 0:07:02 | and a little here are two | 
|---|
| 0:07:04 | but no that the war is at the beginning of a sentence and denotes the | 
|---|
| 0:07:10 | war | 
|---|
| 0:07:10 | is that the middle sentence | 
|---|
| 0:07:13 | which essentially means | 
|---|
| 0:07:14 | every which is not | 
|---|
| 0:07:17 | so our sentence here each one of those machines | 
|---|
| 0:07:21 | or whatever | 
|---|
| 0:07:22 | words | 
|---|
| 0:07:23 | strong | 
|---|
| 0:07:24 | one when b | 
|---|
| 0:07:27 | until the next one | 
|---|
| 0:07:31 | now handles a segment we want to a sign role | 
|---|
| 0:07:35 | to ensure those | 
|---|
| 0:07:37 | so | 
|---|
| 0:07:39 | and the domain working on a we assume that we more at | 
|---|
| 0:07:45 | the roles in this domain | 
|---|
| 0:07:49 | so we you | 
|---|
| 0:07:50 | and roles just | 
|---|
| 0:07:52 | language models for three and also we have and also with a wrong language model | 
|---|
| 0:07:58 | and for this to a construction and prior models | 
|---|
| 0:08:04 | and after we interpolate the language models and by these symbols you're a regional ventilation | 
|---|
| 0:08:13 | and all the ways of your on some of the questions | 
|---|
| 0:08:17 | are optimized on a development set | 
|---|
| 0:08:21 | so what we interpolate the language models | 
|---|
| 0:08:24 | we can just a sign | 
|---|
| 0:08:26 | to each take segment the role that minimizes the corresponding complex | 
|---|
| 0:08:35 | no not is that if you're we have built on about to a text | 
|---|
| 0:08:40 | in the next step to discontinue was the case densities of the speakers for the | 
|---|
| 0:08:45 | year in the conversation | 
|---|
| 0:08:47 | we also need was you only so | 
|---|
| 0:08:50 | so you're we need to align the text and the audio | 
|---|
| 0:08:56 | and the | 
|---|
| 0:08:57 | textual information comes from an asr system which to be in a real-world application | 
|---|
| 0:09:03 | then these all right information is already available probability can last | 
|---|
| 0:09:10 | so have no those module and a segments | 
|---|
| 0:09:13 | we extract speaker rating which one visual with the extractor | 
|---|
| 0:09:21 | for each | 
|---|
| 0:09:22 | segment | 
|---|
| 0:09:23 | a sign to a statistical | 
|---|
| 0:09:26 | and we can now define as the wrong for the | 
|---|
| 0:09:31 | are all acoustic identity | 
|---|
| 0:09:33 | as a range of all those | 
|---|
| 0:09:36 | speaker ratings transform that role | 
|---|
| 0:09:41 | a by doing so however | 
|---|
| 0:09:45 | we assume that | 
|---|
| 0:09:47 | only on v | 
|---|
| 0:09:51 | segments | 
|---|
| 0:09:51 | r g | 
|---|
| 0:09:54 | however | 
|---|
| 0:09:56 | we cannot be confidently about all the roles segments and the reason e | 
|---|
| 0:10:03 | since we have conversational interactions | 
|---|
| 0:10:07 | after oversegmentations that we may have | 
|---|
| 0:10:10 | some very short sessions for example | 
|---|
| 0:10:14 | like even one or things like | 
|---|
| 0:10:17 | well which do not contain sufficient information | 
|---|
| 0:10:20 | well that that's all right recognition | 
|---|
| 0:10:25 | so what we're doing instead is that we | 
|---|
| 0:10:28 | assign a confidence measure | 
|---|
| 0:10:30 | creation of those segments | 
|---|
| 0:10:32 | and its confidence measure is the also difference | 
|---|
| 0:10:35 | between the best implicitly we have | 
|---|
| 0:10:40 | from a and the second was classes | 
|---|
| 0:10:45 | and now we can then define a few | 
|---|
| 0:10:52 | profile | 
|---|
| 0:10:52 | a an average but now for this average we only a control and | 
|---|
| 0:10:58 | e | 
|---|
| 0:11:00 | segments | 
|---|
| 0:11:03 | for which the confidence | 
|---|
| 0:11:05 | is able | 
|---|
| 0:11:06 | some stuff racial factor | 
|---|
| 0:11:09 | and this is the size the tunable parameter all sources | 
|---|
| 0:11:16 | so we can we have now estimated or profiles were ready to | 
|---|
| 0:11:22 | or | 
|---|
| 0:11:23 | a regularization | 
|---|
| 0:11:25 | we're instead of clustering we can have a classification much | 
|---|
| 0:11:30 | election | 
|---|
| 0:11:30 | you're | 
|---|
| 0:11:32 | and we're calling a traditional approach for a diarisation were first we segment | 
|---|
| 0:11:38 | uniform the speech signal with a sliding window | 
|---|
| 0:11:42 | we extract | 
|---|
| 0:11:44 | us to go embedding for each resulting segment | 
|---|
| 0:11:49 | and we probably | 
|---|
| 0:11:51 | the only a similarity | 
|---|
| 0:11:54 | known for each segment | 
|---|
| 0:11:57 | with all the role profiles are just a estimate | 
|---|
| 0:12:03 | and the role that we are assigned to each day | 
|---|
| 0:12:07 | using one | 
|---|
| 0:12:09 | that is most similar to segment | 
|---|
| 0:12:11 | we know that maximizes | 
|---|
| 0:12:13 | this is a single are in school | 
|---|
| 0:12:21 | so this is this is in the were proposing and we're going to use in | 
|---|
| 0:12:26 | to evaluate the system on dialects i felt interactions what we have two rolls namely | 
|---|
| 0:12:32 | the normal that there is an efficient | 
|---|
| 0:12:38 | and we are also going to use a mix of corporal | 
|---|
| 0:12:41 | in order to train a our students tiger and or language models | 
|---|
| 0:12:47 | is your in those the data is and reading the sizes of the core well | 
|---|
| 0:12:53 | we're using well | 
|---|
| 0:12:57 | and not going to go into detail | 
|---|
| 0:13:00 | i'm to the specific parameters that we used for system and the several subsystems | 
|---|
| 0:13:07 | i just mentioned that if a score or sentences like or more so | 
|---|
| 0:13:14 | point age a after all | 
|---|
| 0:13:18 | a working at all possible there she said | 
|---|
| 0:13:21 | but a word error rates for asr system we're using | 
|---|
| 0:13:26 | was about forty percent for dataset but we just is a lot a but actually | 
|---|
| 0:13:32 | is | 
|---|
| 0:13:33 | can call com one source some changes medical conversations | 
|---|
| 0:13:40 | and | 
|---|
| 0:13:41 | also baselines we will use in your own and it language baseline | 
|---|
| 0:13:46 | forty one you know baseline a workout this is then | 
|---|
| 0:13:50 | that we have | 
|---|
| 0:13:52 | already mentioned the traditional system i'll mention where we have a uniform segmentation and then | 
|---|
| 0:13:58 | to lda clustering | 
|---|
| 0:14:01 | and forty language from baseline | 
|---|
| 0:14:04 | we essentially how the first steps | 
|---|
| 0:14:07 | all our a text based system you | 
|---|
| 0:14:09 | well for one takes with a text we segments with our | 
|---|
| 0:14:14 | a sentence tiger | 
|---|
| 0:14:16 | and we assign a each | 
|---|
| 0:14:20 | segments to enrol | 
|---|
| 0:14:22 | and the only think of that we need to do in order to evaluate the | 
|---|
| 0:14:25 | diarisation is to | 
|---|
| 0:14:28 | a line you're | 
|---|
| 0:14:31 | and the text here and | 
|---|
| 0:14:34 | they have already mentioned | 
|---|
| 0:14:36 | in the text can strong it is are then be alignment information | 
|---|
| 0:14:40 | already available | 
|---|
| 0:14:44 | chair our results on the survey data the we have testing | 
|---|
| 0:14:51 | well we have used i don't the reference prostrate or asr transcript | 
|---|
| 0:14:58 | we using a or something you're or an oracle text segmentation | 
|---|
| 0:15:02 | here are or | 
|---|
| 0:15:05 | unimodal | 
|---|
| 0:15:06 | baseline same as yours the system that the we have | 
|---|
| 0:15:10 | controls and by looking at the numbers we can make | 
|---|
| 0:15:15 | interesting observations and | 
|---|
| 0:15:18 | generate some interest conclusions | 
|---|
| 0:15:21 | some personal | 
|---|
| 0:15:22 | if we can further of the to a baseline we have | 
|---|
| 0:15:27 | we see that the results or | 
|---|
| 0:15:31 | better we feel guilty | 
|---|
| 0:15:33 | that's just a | 
|---|
| 0:15:34 | i which instantly on your screen as expected contains one information for the task also | 
|---|
| 0:15:40 | speaker and session | 
|---|
| 0:15:42 | and this is why | 
|---|
| 0:15:44 | we propose using the ontology information only as the supplementary q | 
|---|
| 0:15:51 | a what is interesting to notice is that | 
|---|
| 0:15:56 | you know language model system comparing work and the some additional the timer | 
|---|
| 0:16:01 | segmentation | 
|---|
| 0:16:02 | i based machine | 
|---|
| 0:16:04 | there is the | 
|---|
| 0:16:05 | performance gap | 
|---|
| 0:16:09 | and the reason for that is that | 
|---|
| 0:16:11 | the tiger overstatement and also mention | 
|---|
| 0:16:15 | we may have also show segments there's | 
|---|
| 0:16:18 | do not contain sufficient information for english | 
|---|
| 0:16:23 | however in our system we use this information only | 
|---|
| 0:16:27 | in and i would be useful in order to a reddish | 
|---|
| 0:16:30 | all the | 
|---|
| 0:16:34 | segments of the rules segments to get a acoustic identity | 
|---|
| 0:16:38 | the article rule | 
|---|
| 0:16:41 | so | 
|---|
| 0:16:42 | so such an actress is kind of cancel out you know system after this | 
|---|
| 0:16:48 | well i'm british | 
|---|
| 0:16:50 | a similar factor | 
|---|
| 0:16:52 | is observed | 
|---|
| 0:16:54 | last year you we compare the | 
|---|
| 0:16:57 | results using the reference for the asr transcript | 
|---|
| 0:17:01 | and because condition we have a pretty high word error rate | 
|---|
| 0:17:06 | we have as if you're degradation in performance for the language system | 
|---|
| 0:17:10 | once when using a star | 
|---|
| 0:17:13 | results | 
|---|
| 0:17:15 | however | 
|---|
| 0:17:16 | when the trustees are only used for the profile estimation as we're doing in our | 
|---|
| 0:17:21 | proposed system | 
|---|
| 0:17:22 | then the performance | 
|---|
| 0:17:24 | is substantially smaller | 
|---|
| 0:17:29 | finally | 
|---|
| 0:17:31 | when we see here is the if we estimate the files | 
|---|
| 0:17:35 | using | 
|---|
| 0:17:36 | not only know all the | 
|---|
| 0:17:39 | i relevant segments but only | 
|---|
| 0:17:42 | the segments that we are most compelling about then we have further a performance improvement | 
|---|
| 0:17:50 | and instead of the parameters that we introduce | 
|---|
| 0:17:53 | the earlier | 
|---|
| 0:17:54 | here | 
|---|
| 0:17:55 | we are using the eight percent all the | 
|---|
| 0:18:00 | test segments | 
|---|
| 0:18:01 | or station by the segment i mean the segments that we're most confident about | 
|---|
| 0:18:06 | and they is a parameter optimize convertible | 
|---|
| 0:18:11 | well i first observation again it's made from this library | 
|---|
| 0:18:16 | where we have illustrated the | 
|---|
| 0:18:19 | diarization error rate a function | 
|---|
| 0:18:23 | all of the number of segments that's clear thinking this duration | 
|---|
| 0:18:28 | or final estimation is that | 
|---|
| 0:18:30 | unless we use | 
|---|
| 0:18:33 | a very small number of segments per session most of the time | 
|---|
| 0:18:38 | but performance is better five | 
|---|
| 0:18:41 | the key audio-only baseline which is illustrated by a dashed line we shoot | 
|---|
| 0:18:48 | also | 
|---|
| 0:18:49 | if we compare those | 
|---|
| 0:18:52 | blue and red lines | 
|---|
| 0:18:55 | what we see is that even though | 
|---|
| 0:18:57 | when we're using | 
|---|
| 0:18:58 | v | 
|---|
| 0:19:00 | sequence this time you're | 
|---|
| 0:19:02 | a bit which is | 
|---|
| 0:19:04 | this red line | 
|---|
| 0:19:06 | i don't though | 
|---|
| 0:19:07 | when using this | 
|---|
| 0:19:09 | we have a slightly worse performance is an oracle | 
|---|
| 0:19:14 | segmentation we observe that you we have two shoes | 
|---|
| 0:19:18 | you're only the number of segments to use | 
|---|
| 0:19:21 | then a tiger performance approaches the oracle | 
|---|
| 0:19:26 | segmentation performance | 
|---|
| 0:19:30 | to some with my presentation today we propose a system for speaker diarization | 
|---|
| 0:19:36 | in scenarios were speakers for a specific roles | 
|---|
| 0:19:40 | and we use the lexical information machine | 
|---|
| 0:19:44 | with those roles | 
|---|
| 0:19:45 | in order to estimate the acoustic advantages | 
|---|
| 0:19:49 | and which changes the ability for classification approach | 
|---|
| 0:19:54 | instead of a clustering | 
|---|
| 0:19:56 | approaches use a common thing to do diarization | 
|---|
| 0:20:01 | we evaluated our system on dynamics et cetera interruptions | 
|---|
| 0:20:05 | and we just really a relative improvement of about | 
|---|
| 0:20:09 | thirty percent | 
|---|
| 0:20:10 | number two t only on baseline | 
|---|
| 0:20:14 | so | 
|---|
| 0:20:16 | this was my own presentation | 
|---|
| 0:20:18 | thank you very much for button | 
|---|