| 0:00:15 | and welcome to my paper improving diarisation robustness using verification randomisation | 
|---|
| 0:00:22 | and the dover algorithm | 
|---|
| 0:00:25 | if you brief overview will start with a review of the door algorithm | 
|---|
| 0:00:30 | something that we directly but with recently to combine the outputs of multiple conversations systems | 
|---|
| 0:00:37 | actually use of that is for information fusion | 
|---|
| 0:00:40 | over this paper we're gonna focus on a another application used to achieve more robustness | 
|---|
| 0:00:45 | whatever position | 
|---|
| 0:00:48 | we describe our experiments and results and then conclude with some really an outlook | 
|---|
| 0:00:55 | i'm sure everybody's familiar with the speaker diarization task it's the answer the question who | 
|---|
| 0:01:00 | spoke when | 
|---|
| 0:01:02 | so given an input you label it according to speaker identity without having any prior | 
|---|
| 0:01:08 | knowledge | 
|---|
| 0:01:09 | speaker so the | 
|---|
| 0:01:10 | labels are anonymous label such as speaker one speaker to or | 
|---|
| 0:01:16 | positions in order to track the interaction among multiple speakers in the conversation or meeting | 
|---|
| 0:01:23 | is also critical to be able to speaker should you eer of the speech recognition | 
|---|
| 0:01:28 | system that our readable transcript | 
|---|
| 0:01:32 | and you can use it for things like speaker if one where you need to | 
|---|
| 0:01:35 | identify all the speech i mean how the same | 
|---|
| 0:01:39 | speaker source | 
|---|
| 0:01:42 | the diarization error metric as a measure also similar to most | 
|---|
| 0:01:46 | it's the racial the total duration of missed speech false alarm speech and speaker | 
|---|
| 0:01:53 | speech does this labeled according to who were spoken by | 
|---|
| 0:01:57 | and normalized for the should duration the speech | 
|---|
| 0:02:01 | l the critical | 
|---|
| 0:02:04 | thing in and are stationary computation | 
|---|
| 0:02:07 | is which will important for you know later on is actually the mapping in speaker | 
|---|
| 0:02:14 | labels that occur in the reference versus the hypothesis | 
|---|
| 0:02:19 | e labels and reference have nothing to do with a label and the clusters so | 
|---|
| 0:02:23 | we need to construct a mapping | 
|---|
| 0:02:26 | actually minimizes the error rate | 
|---|
| 0:02:29 | so in this case we will map speaker one speaker a | 
|---|
| 0:02:33 | and speaker into two speaker e | 
|---|
| 0:02:35 | and leaves speaker three amount because of the in fact is an extra speaker relative | 
|---|
| 0:02:41 | to the reference | 
|---|
| 0:02:42 | once we've done the mapping | 
|---|
| 0:02:45 | we can compute false alarm the speech | 
|---|
| 0:02:48 | speaker | 
|---|
| 0:02:52 | now system combination or ensemble methods of coding methods are very popular in machine learning | 
|---|
| 0:03:00 | sessions | 
|---|
| 0:03:02 | it "'cause" it is very powerful to zero it combine multiple classifiers | 
|---|
| 0:03:07 | to achieve a better results | 
|---|
| 0:03:09 | and coding | 
|---|
| 0:03:10 | it's just letting the majority determine optimal or soft-voting such as combining different scores in | 
|---|
| 0:03:17 | some there is not gonna make weight | 
|---|
| 0:03:20 | or to combine your already outputs by interpolation for example in order to achieve | 
|---|
| 0:03:26 | any more accurate estimate | 
|---|
| 0:03:28 | posterior probability and therefore us | 
|---|
| 0:03:31 | labels | 
|---|
| 0:03:32 | now this can be done and weighted marilyn the weighted matter so if you have | 
|---|
| 0:03:37 | the | 
|---|
| 0:03:37 | a reason to attribute more than so which to me that's | 
|---|
| 0:03:42 | you and that and the voting algorithm | 
|---|
| 0:03:45 | a popular version of this for speech recognition is the over algorithm | 
|---|
| 0:03:51 | also confusion network combination any also of the purpose of i mean the word labels | 
|---|
| 0:03:59 | from multiple asr systems like well | 
|---|
| 0:04:02 | and performing and loading and the different machines | 
|---|
| 0:04:06 | and usually this gives you know whatever when the input systems are about equally good | 
|---|
| 0:04:11 | by have different error i don't store | 
|---|
| 0:04:14 | as in the and errors | 
|---|
| 0:04:17 | now how can we use this idea of for diarization | 
|---|
| 0:04:20 | so there is a problem because these labels coming from | 
|---|
| 0:04:24 | position hypotheses | 
|---|
| 0:04:26 | are not inherently related | 
|---|
| 0:04:28 | so there are anonymous as we said what | 
|---|
| 0:04:31 | so it is not clear how to order among them | 
|---|
| 0:04:35 | we can solve this problem i | 
|---|
| 0:04:38 | extracting in that in between the different labels | 
|---|
| 0:04:41 | and then performed by doing so | 
|---|
| 0:04:43 | we can go there's map of the labels in fact as a kind of alignment | 
|---|
| 0:04:47 | lingual space or level alignment | 
|---|
| 0:04:50 | so we do it incrementally it's like for a rover for example so we start | 
|---|
| 0:04:55 | with the first analysis that for star | 
|---|
| 0:04:59 | and it as our initial alignment | 
|---|
| 0:05:01 | and we iterate over all the remaining outputs we construct a mapping the it was | 
|---|
| 0:05:07 | processed out that's | 
|---|
| 0:05:08 | so that the e diarization error between the labels is minimized | 
|---|
| 0:05:14 | we all know | 
|---|
| 0:05:16 | we can | 
|---|
| 0:05:18 | simply for the voting | 
|---|
| 0:05:20 | i'm really label for all time instants | 
|---|
| 0:05:26 | and this is what was described in our | 
|---|
| 0:05:28 | last year and inside you | 
|---|
| 0:05:32 | okay here's an example | 
|---|
| 0:05:33 | so we have three systems at c | 
|---|
| 0:05:36 | the labels are disjoint | 
|---|
| 0:05:39 | and we | 
|---|
| 0:05:42 | first start by starting with system a and then computing best map | 
|---|
| 0:05:47 | of the second system to these labels in the second the first system | 
|---|
| 0:05:51 | so in this case we will | 
|---|
| 0:05:54 | one way one | 
|---|
| 0:05:56 | to ensure a two three would in extra speaker labels so it remains | 
|---|
| 0:06:02 | we re label everything so now we have system a and system i in the | 
|---|
| 0:06:07 | same label space | 
|---|
| 0:06:08 | read the same thing again with system c | 
|---|
| 0:06:11 | so we can see here that c one should | 
|---|
| 0:06:13 | i at one | 
|---|
| 0:06:15 | t three should be mapped into | 
|---|
| 0:06:18 | c two | 
|---|
| 0:06:20 | remains map and that's the next a label | 
|---|
| 0:06:23 | doesn't have a correspondence | 
|---|
| 0:06:29 | so here we have no all three how that's the same label space | 
|---|
| 0:06:35 | and we can fall the voting | 
|---|
| 0:06:38 | for each time instance so they only when is a one to this point | 
|---|
| 0:06:44 | then we enter a region where is actually if we went i between a one | 
|---|
| 0:06:49 | human speech | 
|---|
| 0:06:52 | so no matter only we can break the time anyway that's can and example in | 
|---|
| 0:06:58 | the first one or if there are weights attached to the n b and | 
|---|
| 0:07:02 | the one with the highest weight | 
|---|
| 0:07:05 | we have a to again as the consensus and we're trying to a one | 
|---|
| 0:07:11 | we never hears because it is always in the minority | 
|---|
| 0:07:16 | and we can use the same idea to decide on speech versus non speech | 
|---|
| 0:07:21 | so | 
|---|
| 0:07:23 | we will help us speech only on those regions | 
|---|
| 0:07:26 | where at least half of the in its i think there's speech | 
|---|
| 0:07:32 | no again the natural | 
|---|
| 0:07:34 | use of this is for information fusion | 
|---|
| 0:07:38 | it is we run diarisation in the in italy stand for information for example we | 
|---|
| 0:07:43 | have multiple microphones we can i rise in italy | 
|---|
| 0:07:46 | and fused it's using dover | 
|---|
| 0:07:49 | or we could have a single input that different feature streams | 
|---|
| 0:07:52 | we can arise in the end is | 
|---|
| 0:07:56 | we used just for multiple microphones in i paper | 
|---|
| 0:08:01 | we have meeting recordings on seven microphones | 
|---|
| 0:08:05 | and you can see here that difference is doing a clustering based diarization | 
|---|
| 0:08:10 | this be wide range of results depending on which channel you choose | 
|---|
| 0:08:16 | and over actually | 
|---|
| 0:08:18 | if you're result that a slightly better than e single channel | 
|---|
| 0:08:23 | so you're free from having to figure out which is the | 
|---|
| 0:08:26 | thus the channel | 
|---|
| 0:08:29 | if you do the diarization using speaker id because you're speakers are actually all of | 
|---|
| 0:08:34 | the system | 
|---|
| 0:08:35 | you get the same effect of course but much lower at position error rate over | 
|---|
| 0:08:39 | also you average | 
|---|
| 0:08:42 | you have the single channel and you have a where single channel | 
|---|
| 0:08:46 | and it over a combination of all these out there is you have resulted actually | 
|---|
| 0:08:50 | is better | 
|---|
| 0:08:51 | the minimum | 
|---|
| 0:08:53 | all the individual channels | 
|---|
| 0:08:57 | no for this paper we gonna looking to different application of over | 
|---|
| 0:09:02 | starts with the observation that diarization algorithm is often quite sensitive to the choice of | 
|---|
| 0:09:07 | hyper parameters | 
|---|
| 0:09:09 | i give some examples later but it is basically because when you clustering | 
|---|
| 0:09:14 | you make our decisions based on comparing real values | 
|---|
| 0:09:18 | and small differences in the in this can actually yield large differences you know | 
|---|
| 0:09:24 | also the clustering is often greedy | 
|---|
| 0:09:26 | and iterative so small | 
|---|
| 0:09:29 | regions somewhere a linear model and a very large differences later on | 
|---|
| 0:09:35 | so | 
|---|
| 0:09:36 | this can be remedied by averaging over the different run essentially so | 
|---|
| 0:09:42 | okay and you run with different hyperparameters an average the results | 
|---|
| 0:09:47 | and using the over or you can used over from i'm the out of multiple | 
|---|
| 0:09:51 | different | 
|---|
| 0:09:53 | clustering solutions | 
|---|
| 0:09:58 | to experiment with this we used an old speaker clustering algorithm of for diarization develop | 
|---|
| 0:10:04 | idiomatic c | 
|---|
| 0:10:05 | you start with an equal length segmentation of during the day | 
|---|
| 0:10:10 | segments | 
|---|
| 0:10:11 | then each segment is modeled by a mixture of gaussians | 
|---|
| 0:10:16 | and e ds similarity between different segments can be evaluated i asking whether merging two | 
|---|
| 0:10:25 | gmms yields a higher over likelihood or not | 
|---|
| 0:10:29 | e | 
|---|
| 0:10:31 | duration happens by merging two best clusters that resegmenting | 
|---|
| 0:10:38 | and re-estimating so gmms | 
|---|
| 0:10:42 | l which do this until i is information criterion tells you just a clustering | 
|---|
| 0:10:52 | it like this algorithm to a collection of | 
|---|
| 0:10:56 | recordings of meetings | 
|---|
| 0:10:58 | from which we are extracted two feature streams and mfccs training after beamforming so we | 
|---|
| 0:11:04 | had multiple | 
|---|
| 0:11:06 | constraints but we marched on informing of the signal level | 
|---|
| 0:11:09 | then extracted mfccs | 
|---|
| 0:11:11 | and the beamformer would also give us the time delays of arrival which are an | 
|---|
| 0:11:15 | important feature | 
|---|
| 0:11:16 | because it indicates where the speakers are situated | 
|---|
| 0:11:21 | now | 
|---|
| 0:11:22 | there's two ways to generate more hypotheses from a single | 
|---|
| 0:11:27 | this case | 
|---|
| 0:11:28 | one is a what i call device verification meeting there either i and under | 
|---|
| 0:11:35 | what was some range | 
|---|
| 0:11:36 | and a single low also | 
|---|
| 0:11:39 | example i can every the relative weight of the feature streams | 
|---|
| 0:11:44 | or i can every the initial number | 
|---|
| 0:11:46 | other clusters in the clustering order | 
|---|
| 0:11:50 | the first one which we discuss the three what else given here for the interest | 
|---|
| 0:11:54 | of time | 
|---|
| 0:11:55 | and the other way as to randomise so i can manipulate the clustering algorithm | 
|---|
| 0:12:00 | we will not always pick the first best | 
|---|
| 0:12:03 | of clusters remark about two sometimes take the second just pure clusters | 
|---|
| 0:12:09 | and a five point in order to make these decisions over it can generate multiple | 
|---|
| 0:12:13 | clusterings | 
|---|
| 0:12:15 | and of course i used over to final design with equal weight | 
|---|
| 0:12:21 | although the of its use the same speech nonspeech classifier so we'll only differ are | 
|---|
| 0:12:26 | speaker labels not in the speech nonspeech sessions | 
|---|
| 0:12:30 | and only difference on the diarisation error is in fact on the speaker error rate | 
|---|
| 0:12:38 | it is set was from the nist meeting rich transcription evaluations from the nist two | 
|---|
| 0:12:44 | thousand seven thousand nine | 
|---|
| 0:12:46 | and we used all of the microphone channels but we combine with beamforming | 
|---|
| 0:12:53 | and you variety is actually quite considerable in this data so | 
|---|
| 0:12:58 | errors different recording sites | 
|---|
| 0:12:59 | there is different speakers from small three four | 
|---|
| 0:13:04 | sixteen twenty one respectively so it was quite heterogeneous and that's why it's a challenge | 
|---|
| 0:13:11 | to actually | 
|---|
| 0:13:13 | and analyze the hyper parameters for them into a | 
|---|
| 0:13:17 | in forty from one | 
|---|
| 0:13:19 | f sets to the test set | 
|---|
| 0:13:23 | use what happens when you vary your a stream weight one of the hyper parameters | 
|---|
| 0:13:28 | so you can see that | 
|---|
| 0:13:31 | varying along agree not use a small variation in the output rather channels up and | 
|---|
| 0:13:38 | this is the speaker error rate | 
|---|
| 0:13:40 | and more importantly | 
|---|
| 0:13:42 | the best value on it's a it's not just value on the eval set | 
|---|
| 0:13:48 | conversely the value of all citizens are was choice for the test set | 
|---|
| 0:13:54 | so this is what i mean i robustness of problems in that | 
|---|
| 0:13:59 | every when we do over a combination over all the different | 
|---|
| 0:14:03 | results | 
|---|
| 0:14:04 | we actually at a nice good result | 
|---|
| 0:14:08 | it is either better than a single results for the test set or very close | 
|---|
| 0:14:12 | to the single best result on you got stuck | 
|---|
| 0:14:18 | similarly when we vary the initial number of clusters of the algorithm | 
|---|
| 0:14:23 | we also got the l | 
|---|
| 0:14:27 | with the speaker | 
|---|
| 0:14:29 | according to a you know it is the variational the cluster number | 
|---|
| 0:14:34 | and | 
|---|
| 0:14:36 | the best choice for the test set is not the best choice for the eval | 
|---|
| 0:14:41 | set | 
|---|
| 0:14:42 | again when you do that or conversational you a good result in fact there is | 
|---|
| 0:14:47 | always better than the second best choice | 
|---|
| 0:14:49 | on the data for you also | 
|---|
| 0:14:53 | finally when we do the randomisation of the clustering specifically we flip a coin with | 
|---|
| 0:14:59 | only three we use a second best cluster each information merging | 
|---|
| 0:15:06 | and the result is surprisingly sometimes lead to better and with the first a clustering | 
|---|
| 0:15:14 | so you see here that with different random seeds we are in a range of | 
|---|
| 0:15:18 | results | 
|---|
| 0:15:19 | sometimes worse but often other and with the best first clustering | 
|---|
| 0:15:25 | and the same is true for the whole set | 
|---|
| 0:15:27 | first we cannot | 
|---|
| 0:15:29 | expect the best thing all the data to also interesting only vol instead we need | 
|---|
| 0:15:34 | to do the recognition in order to get a result | 
|---|
| 0:15:37 | so we actually improve on the best first clustering consistently by doing or correlation over | 
|---|
| 0:15:43 | different | 
|---|
| 0:15:44 | randomized results | 
|---|
| 0:15:48 | summary | 
|---|
| 0:15:49 | we have just over algorithm allows us to voting among multiple times additional sees | 
|---|
| 0:15:56 | we can use this to achieve | 
|---|
| 0:15:59 | a robustness and annotation | 
|---|
| 0:16:01 | by | 
|---|
| 0:16:04 | combining multiple hypotheses obtained from a single input | 
|---|
| 0:16:08 | e two ways that we do this is by very high utterance | 
|---|
| 0:16:12 | or introduce diversity if you will and the results | 
|---|
| 0:16:17 | and we find that the hyperparameter populations higher in over essentially freezes from the need | 
|---|
| 0:16:24 | to do that optimization | 
|---|
| 0:16:26 | and that its robustness that way | 
|---|
| 0:16:28 | now the clustering can also be randomized be overcome the limitation of the first | 
|---|
| 0:16:35 | research and clustering | 
|---|
| 0:16:37 | and e combination of the randomized results actually says | 
|---|
| 0:16:41 | higher accuracy and you the single | 
|---|
| 0:16:45 | a string that that's | 
|---|
| 0:16:50 | finally there's many more things we can do this so we can try to come | 
|---|
| 0:16:55 | i'm | 
|---|
| 0:16:56 | the different techniques so for example i are is wearing | 
|---|
| 0:17:01 | a lot of multiple dimensions | 
|---|
| 0:17:02 | or combining that with randomisation all in one and a well-known combination about | 
|---|
| 0:17:09 | we can also tried as with different | 
|---|
| 0:17:12 | like conversation than the algorithm is the gnostic to the actual | 
|---|
| 0:17:16 | and | 
|---|
| 0:17:17 | form of the diarisation algorithm | 
|---|
| 0:17:19 | so we can try with x vector of a spectral clustering or normal and systems | 
|---|
| 0:17:26 | of course region or we wish to | 
|---|
| 0:17:28 | in this multiple the corporate in order to can work in the | 
|---|
| 0:17:33 | the algorithm | 
|---|
| 0:17:35 | to other things were currently working on is can i think different diarisation algorithms | 
|---|
| 0:17:41 | as well as to generalize the to handle overlapping speech | 
|---|
| 0:17:47 | thank you very much for your time | 
|---|
| 0:17:50 | you're into question so essential to the c website | 
|---|
| 0:17:54 | and i the rest of you culture | 
|---|