0:00:15and welcome to my paper improving diarisation robustness using verification randomisation
0:00:22and the dover algorithm
0:00:25if you brief overview will start with a review of the door algorithm
0:00:30something that we directly but with recently to combine the outputs of multiple conversations systems
0:00:37actually use of that is for information fusion
0:00:40over this paper we're gonna focus on a another application used to achieve more robustness
0:00:45whatever position
0:00:48we describe our experiments and results and then conclude with some really an outlook
0:00:55i'm sure everybody's familiar with the speaker diarization task it's the answer the question who
0:01:00spoke when
0:01:02so given an input you label it according to speaker identity without having any prior
0:01:08knowledge
0:01:09speaker so the
0:01:10labels are anonymous label such as speaker one speaker to or
0:01:16positions in order to track the interaction among multiple speakers in the conversation or meeting
0:01:23is also critical to be able to speaker should you eer of the speech recognition
0:01:28system that our readable transcript
0:01:32and you can use it for things like speaker if one where you need to
0:01:35identify all the speech i mean how the same
0:01:39speaker source
0:01:42the diarization error metric as a measure also similar to most
0:01:46it's the racial the total duration of missed speech false alarm speech and speaker
0:01:53speech does this labeled according to who were spoken by
0:01:57and normalized for the should duration the speech
0:02:01l the critical
0:02:04thing in and are stationary computation
0:02:07is which will important for you know later on is actually the mapping in speaker
0:02:14labels that occur in the reference versus the hypothesis
0:02:19e labels and reference have nothing to do with a label and the clusters so
0:02:23we need to construct a mapping
0:02:26actually minimizes the error rate
0:02:29so in this case we will map speaker one speaker a
0:02:33and speaker into two speaker e
0:02:35and leaves speaker three amount because of the in fact is an extra speaker relative
0:02:41to the reference
0:02:42once we've done the mapping
0:02:45we can compute false alarm the speech
0:02:48speaker
0:02:52now system combination or ensemble methods of coding methods are very popular in machine learning
0:03:00sessions
0:03:02it "'cause" it is very powerful to zero it combine multiple classifiers
0:03:07to achieve a better results
0:03:09and coding
0:03:10it's just letting the majority determine optimal or soft-voting such as combining different scores in
0:03:17some there is not gonna make weight
0:03:20or to combine your already outputs by interpolation for example in order to achieve
0:03:26any more accurate estimate
0:03:28posterior probability and therefore us
0:03:31labels
0:03:32now this can be done and weighted marilyn the weighted matter so if you have
0:03:37the
0:03:37a reason to attribute more than so which to me that's
0:03:42you and that and the voting algorithm
0:03:45a popular version of this for speech recognition is the over algorithm
0:03:51also confusion network combination any also of the purpose of i mean the word labels
0:03:59from multiple asr systems like well
0:04:02and performing and loading and the different machines
0:04:06and usually this gives you know whatever when the input systems are about equally good
0:04:11by have different error i don't store
0:04:14as in the and errors
0:04:17now how can we use this idea of for diarization
0:04:20so there is a problem because these labels coming from
0:04:24position hypotheses
0:04:26are not inherently related
0:04:28so there are anonymous as we said what
0:04:31so it is not clear how to order among them
0:04:35we can solve this problem i
0:04:38extracting in that in between the different labels
0:04:41and then performed by doing so
0:04:43we can go there's map of the labels in fact as a kind of alignment
0:04:47lingual space or level alignment
0:04:50so we do it incrementally it's like for a rover for example so we start
0:04:55with the first analysis that for star
0:04:59and it as our initial alignment
0:05:01and we iterate over all the remaining outputs we construct a mapping the it was
0:05:07processed out that's
0:05:08so that the e diarization error between the labels is minimized
0:05:14we all know
0:05:16we can
0:05:18simply for the voting
0:05:20i'm really label for all time instants
0:05:26and this is what was described in our
0:05:28last year and inside you
0:05:32okay here's an example
0:05:33so we have three systems at c
0:05:36the labels are disjoint
0:05:39and we
0:05:42first start by starting with system a and then computing best map
0:05:47of the second system to these labels in the second the first system
0:05:51so in this case we will
0:05:54one way one
0:05:56to ensure a two three would in extra speaker labels so it remains
0:06:02we re label everything so now we have system a and system i in the
0:06:07same label space
0:06:08read the same thing again with system c
0:06:11so we can see here that c one should
0:06:13i at one
0:06:15t three should be mapped into
0:06:18c two
0:06:20remains map and that's the next a label
0:06:23doesn't have a correspondence
0:06:29so here we have no all three how that's the same label space
0:06:35and we can fall the voting
0:06:38for each time instance so they only when is a one to this point
0:06:44then we enter a region where is actually if we went i between a one
0:06:49human speech
0:06:52so no matter only we can break the time anyway that's can and example in
0:06:58the first one or if there are weights attached to the n b and
0:07:02the one with the highest weight
0:07:05we have a to again as the consensus and we're trying to a one
0:07:11we never hears because it is always in the minority
0:07:16and we can use the same idea to decide on speech versus non speech
0:07:21so
0:07:23we will help us speech only on those regions
0:07:26where at least half of the in its i think there's speech
0:07:32no again the natural
0:07:34use of this is for information fusion
0:07:38it is we run diarisation in the in italy stand for information for example we
0:07:43have multiple microphones we can i rise in italy
0:07:46and fused it's using dover
0:07:49or we could have a single input that different feature streams
0:07:52we can arise in the end is
0:07:56we used just for multiple microphones in i paper
0:08:01we have meeting recordings on seven microphones
0:08:05and you can see here that difference is doing a clustering based diarization
0:08:10this be wide range of results depending on which channel you choose
0:08:16and over actually
0:08:18if you're result that a slightly better than e single channel
0:08:23so you're free from having to figure out which is the
0:08:26thus the channel
0:08:29if you do the diarization using speaker id because you're speakers are actually all of
0:08:34the system
0:08:35you get the same effect of course but much lower at position error rate over
0:08:39also you average
0:08:42you have the single channel and you have a where single channel
0:08:46and it over a combination of all these out there is you have resulted actually
0:08:50is better
0:08:51the minimum
0:08:53all the individual channels
0:08:57no for this paper we gonna looking to different application of over
0:09:02starts with the observation that diarization algorithm is often quite sensitive to the choice of
0:09:07hyper parameters
0:09:09i give some examples later but it is basically because when you clustering
0:09:14you make our decisions based on comparing real values
0:09:18and small differences in the in this can actually yield large differences you know
0:09:24also the clustering is often greedy
0:09:26and iterative so small
0:09:29regions somewhere a linear model and a very large differences later on
0:09:35so
0:09:36this can be remedied by averaging over the different run essentially so
0:09:42okay and you run with different hyperparameters an average the results
0:09:47and using the over or you can used over from i'm the out of multiple
0:09:51different
0:09:53clustering solutions
0:09:58to experiment with this we used an old speaker clustering algorithm of for diarization develop
0:10:04idiomatic c
0:10:05you start with an equal length segmentation of during the day
0:10:10segments
0:10:11then each segment is modeled by a mixture of gaussians
0:10:16and e ds similarity between different segments can be evaluated i asking whether merging two
0:10:25gmms yields a higher over likelihood or not
0:10:29e
0:10:31duration happens by merging two best clusters that resegmenting
0:10:38and re-estimating so gmms
0:10:42l which do this until i is information criterion tells you just a clustering
0:10:52it like this algorithm to a collection of
0:10:56recordings of meetings
0:10:58from which we are extracted two feature streams and mfccs training after beamforming so we
0:11:04had multiple
0:11:06constraints but we marched on informing of the signal level
0:11:09then extracted mfccs
0:11:11and the beamformer would also give us the time delays of arrival which are an
0:11:15important feature
0:11:16because it indicates where the speakers are situated
0:11:21now
0:11:22there's two ways to generate more hypotheses from a single
0:11:27this case
0:11:28one is a what i call device verification meeting there either i and under
0:11:35what was some range
0:11:36and a single low also
0:11:39example i can every the relative weight of the feature streams
0:11:44or i can every the initial number
0:11:46other clusters in the clustering order
0:11:50the first one which we discuss the three what else given here for the interest
0:11:54of time
0:11:55and the other way as to randomise so i can manipulate the clustering algorithm
0:12:00we will not always pick the first best
0:12:03of clusters remark about two sometimes take the second just pure clusters
0:12:09and a five point in order to make these decisions over it can generate multiple
0:12:13clusterings
0:12:15and of course i used over to final design with equal weight
0:12:21although the of its use the same speech nonspeech classifier so we'll only differ are
0:12:26speaker labels not in the speech nonspeech sessions
0:12:30and only difference on the diarisation error is in fact on the speaker error rate
0:12:38it is set was from the nist meeting rich transcription evaluations from the nist two
0:12:44thousand seven thousand nine
0:12:46and we used all of the microphone channels but we combine with beamforming
0:12:53and you variety is actually quite considerable in this data so
0:12:58errors different recording sites
0:12:59there is different speakers from small three four
0:13:04sixteen twenty one respectively so it was quite heterogeneous and that's why it's a challenge
0:13:11to actually
0:13:13and analyze the hyper parameters for them into a
0:13:17in forty from one
0:13:19f sets to the test set
0:13:23use what happens when you vary your a stream weight one of the hyper parameters
0:13:28so you can see that
0:13:31varying along agree not use a small variation in the output rather channels up and
0:13:38this is the speaker error rate
0:13:40and more importantly
0:13:42the best value on it's a it's not just value on the eval set
0:13:48conversely the value of all citizens are was choice for the test set
0:13:54so this is what i mean i robustness of problems in that
0:13:59every when we do over a combination over all the different
0:14:03results
0:14:04we actually at a nice good result
0:14:08it is either better than a single results for the test set or very close
0:14:12to the single best result on you got stuck
0:14:18similarly when we vary the initial number of clusters of the algorithm
0:14:23we also got the l
0:14:27with the speaker
0:14:29according to a you know it is the variational the cluster number
0:14:34and
0:14:36the best choice for the test set is not the best choice for the eval
0:14:41set
0:14:42again when you do that or conversational you a good result in fact there is
0:14:47always better than the second best choice
0:14:49on the data for you also
0:14:53finally when we do the randomisation of the clustering specifically we flip a coin with
0:14:59only three we use a second best cluster each information merging
0:15:06and the result is surprisingly sometimes lead to better and with the first a clustering
0:15:14so you see here that with different random seeds we are in a range of
0:15:18results
0:15:19sometimes worse but often other and with the best first clustering
0:15:25and the same is true for the whole set
0:15:27first we cannot
0:15:29expect the best thing all the data to also interesting only vol instead we need
0:15:34to do the recognition in order to get a result
0:15:37so we actually improve on the best first clustering consistently by doing or correlation over
0:15:43different
0:15:44randomized results
0:15:48summary
0:15:49we have just over algorithm allows us to voting among multiple times additional sees
0:15:56we can use this to achieve
0:15:59a robustness and annotation
0:16:01by
0:16:04combining multiple hypotheses obtained from a single input
0:16:08e two ways that we do this is by very high utterance
0:16:12or introduce diversity if you will and the results
0:16:17and we find that the hyperparameter populations higher in over essentially freezes from the need
0:16:24to do that optimization
0:16:26and that its robustness that way
0:16:28now the clustering can also be randomized be overcome the limitation of the first
0:16:35research and clustering
0:16:37and e combination of the randomized results actually says
0:16:41higher accuracy and you the single
0:16:45a string that that's
0:16:50finally there's many more things we can do this so we can try to come
0:16:55i'm
0:16:56the different techniques so for example i are is wearing
0:17:01a lot of multiple dimensions
0:17:02or combining that with randomisation all in one and a well-known combination about
0:17:09we can also tried as with different
0:17:12like conversation than the algorithm is the gnostic to the actual
0:17:16and
0:17:17form of the diarisation algorithm
0:17:19so we can try with x vector of a spectral clustering or normal and systems
0:17:26of course region or we wish to
0:17:28in this multiple the corporate in order to can work in the
0:17:33the algorithm
0:17:35to other things were currently working on is can i think different diarisation algorithms
0:17:41as well as to generalize the to handle overlapping speech
0:17:47thank you very much for your time
0:17:50you're into question so essential to the c website
0:17:54and i the rest of you culture