Speech Transcript - On Early-stop Clustering for Speaker Diarization

0:00:14	hi everyone this is needed region problematical stopped today i'm going to present a list
0:00:20	of clustering for speaker diarization the course of this outcome i like a ramp
0:00:27	in the beginning reasons i mean to give a brief introduction to the past can
0:00:30	you diarization use the no i think or from results
0:00:35	as we all know that are initially is wow the task is equal recognition terrible
0:00:40	together with identification and verification
0:00:43	at the bottom of this feature it shows the scenario of speaker diarization tools because
0:00:49	i'm talking with each other based on the recording the case of speaker diarization used
0:00:55	to
0:00:56	is i when each speaker is speaking
0:00:59	technically no diarization can be decomposed into two steps segmentation and clustering
0:01:07	you this now i will go through the most commonly used framework is speaker diarization
0:01:12	that he's the optimal if you have a typical cluster we use h table shows
0:01:18	in the nineteen one which composition we bust two cameras that imitation only
0:01:24	no i always true method of the intention the next nist documentation and the segmentation
0:01:30	based on speaker change point detection
0:01:34	already that it in the speech segments it
0:01:38	a stairwell good the speech segments from the same speaker to the same cluster
0:01:43	in s a with respect to whether the number of clusters useful human or not
0:01:50	we have important operations
0:01:52	when the number of speakers is given to be a
0:01:56	no clustering always
0:01:58	stops when the without the number of clusters ranges and
0:02:03	then each of the and clusters will be used a representation of a speaker in
0:02:07	the conversation
0:02:10	when the number of speakers is nothing the we will both the threshold to those
0:02:15	because indirectly with does it go similarity of the merging clusters you know we you
0:02:20	know here that when you know t then i feel and stick to
0:02:24	when the
0:02:26	speakers in the idea of them o g p c speaker thing one thing to
0:02:30	reach is the threshold
0:02:32	after we will stop
0:02:35	yes
0:02:36	no result in the number of clusters where is the estimated number of clusters and
0:02:41	hence
0:02:42	and each of the casters will be used to represent a specific speaker in the
0:02:47	composition
0:02:49	after e
0:02:51	baby with applications there is always used
0:02:54	imagine be re-segmentation we first race present each speaker with a gmm
0:03:00	after that we're well beyond and h a gmm based on the gmms by adding
0:03:05	transitional probability
0:03:07	but only we will lie speech frames to the speaker gmms by viterbi decoding
0:03:18	although age they has been widely used
0:03:21	and the performance of each has been acknowledged
0:03:24	no asked us some shortcomings units in our work way
0:03:28	cope with the well as in
0:03:30	now he's the clusters and probably the orange speakers they can watch
0:03:35	in this nice
0:03:36	when k is the diarization and tools costco example
0:03:40	speaker in rule and speaker would be red
0:03:43	during clustering we will have a pastor or speaker eight understatement consisting of each problem
0:03:49	both speakers a and b
0:03:52	but unknown speaker not only and is because similarity of the on and the statement
0:03:56	of mixed each
0:03:58	they didn't manage to a custom speaker i
0:04:02	another scenario
0:04:03	those documents from speaker he may also be multitudes because they actually the second picture
0:04:09	in both cases the cost of speaker it will be biased to be could be
0:04:15	with a clustering going on the speech
0:04:18	the speech of speakers
0:04:20	a and b may now present already
0:04:22	that means those
0:04:24	i mean
0:04:25	no doubt addition they lost
0:04:28	future studies of the original those that can be into the statistically
0:04:35	in the in the that is composed of sailors from a only
0:04:41	with the battery go the system is composed of these statements from
0:04:47	see for me getting worse it to a
0:04:51	the clusters if a is composed of speech signals from both the a and b
0:04:57	all strategies with problem is to start early either
0:05:01	go to be able to determine rose because they get in most states in this
0:05:06	way we have to us to clean the you really use the way
0:05:13	okay the clusters the issues that is like to be known as what is that
0:05:18	it should be large enough to provide us it organisations people a i that it
0:05:25	should be clean i allowing for
0:05:29	involved in this one c d can be as we have
0:05:32	so the action a
0:05:36	will be a tradeoff between the two vectors
0:05:38	we propose a list of clustering by thinking strict threshold without age they the ideally
0:05:44	that will be a change the and get more faster than time t is the
0:05:49	number of speakers
0:05:51	is the only stuff clustering the clustering was a clustering
0:05:57	checked thresholds the resulting clusters where k is large and then the anticipated number of
0:06:03	speakers and
0:06:05	in any way to is given all that we have different implementations
0:06:11	when the number of speakers is nothing but we will first estimating it to be
0:06:17	and had
0:06:19	then based on a given or estimated number of speakers and only had we want
0:06:25	to the class to selection to select a model and how clusters problems ending clusters
0:06:30	each of the selected clusters where represents a specific speaker in the speech conversation
0:06:36	in the battles that we will apply viterbi re-segmentation to align the frames of the
0:06:42	whole conversation to the selected clusters
0:06:47	in this now and the following
0:06:49	we will describe how the number of speakers is no work
0:06:54	was gone it will work should not because similarity score magics s
0:06:58	each element s is thus because in our goal but no we'll let you clusters
0:07:04	example s j k is a speaker similarity score into the g s and have
0:07:10	after
0:07:11	finally as well be i was initially magics of five i
0:07:18	in the score matching s we will do and ninety conversation on it and stored
0:07:24	in a manual in using you of the role you one to u k
0:07:29	after that we want him choose the union ratio between the existing and can values
0:07:33	after that k
0:07:35	finally the lamb of speakers and had will be estimated at the point with a
0:07:40	maximum again that night
0:07:45	with a given all the estimated number of speakers in this nine and the following
0:07:50	we will show how do not have to selection works in but we with the
0:07:55	latter selecting is this and after of probability clusters of i wonder what i
0:08:00	no we were achieved this to find out all of the company combination and after
0:08:05	in these is to be the index set i one to
0:08:10	after that we work on how the stuff or matching for each combination by extracting
0:08:15	the corresponding rows and columns from s
0:08:18	well score magics it would be of the imaging
0:08:22	now takes a factor and i
0:08:27	in the scores that matches this way was then do the eigenvalue decomposition and each
0:08:32	of the in and found that the eigenvalues to be in one three
0:08:37	but only the in this combination of the maximum and you man summation well be
0:08:42	used in this is
0:08:43	definitely pastors
0:08:47	so that follows a description of the algorithm next we were able to the experiments
0:08:53	all experiments was having a i had use the money is being the data set
0:08:59	consisting of two cents is a dimension that and the as the of
0:09:04	you made mistakes
0:09:05	the duration of conversation various problems three hundred two hundred seconds
0:09:11	the number of speakers conversation from one to nine
0:09:15	in our evaluation when used are now role in addition error rates and eer as
0:09:20	actually
0:09:22	what use the pen the ground truth segmentation
0:09:25	as a temporal segmentation
0:09:28	be to you has to be noted that if in the reference euclidean speaker b
0:09:34	hyper
0:09:35	overlaps
0:09:36	no overlap segments will be used as individual segments
0:09:43	in our experiments we have to model as opposed by being a bottleneck feature extractor
0:09:49	with a given model no is an expensive extractor with the rest of the model
0:09:54	for most of the models
0:09:56	we used at an additional advantage as input feature and of course of the and
0:10:02	one change how about static y and into
0:10:05	in the model the acoustic input layer of the year is the carriage real compatibility
0:10:12	with ease contextual between you really both that and the right size
0:10:17	you has i hate enables the was well hidden layers well that one thousand and
0:10:24	it will give for the dimension of the not hidden layer wise lda and he's
0:10:28	being a output was used
0:10:31	it can only be sure
0:10:33	in our is known model there were nine convolutional class
0:10:38	only that we had no we'll collection they are five thousand and to you or
0:10:42	than the ones we may go after that to well collection labels were used up
0:10:48	to this green a the are one of the may five one hundred and twenty
0:10:53	eight
0:10:53	well use i x
0:10:55	in both models a five of the classification a while the number of training because
0:11:01	at least eleven thousand three hundred and if we
0:11:08	we use the conventional a st as the baseline based on a involves the conventional
0:11:14	clustering and the or is not mastery when use the egg expect when combined with
0:11:20	cosine distance as the speaker modeling and is because similarity on a on then
0:11:26	in the another speaker information and after selection in our restart clustering framework when use
0:11:34	the bic score unspeakable individual
0:11:39	in the re-segmentation phase
0:11:41	way used a speaker pair of each point we duration
0:11:49	well the name
0:11:49	when having a experiments in the scenario where the number because once again but
0:11:55	this table shows the performance comparison between the provisional edge v and the proposed only
0:12:00	a star clustering and development and evaluation sets respectively
0:12:06	from a comparison we have seen that the list of clustering
0:12:12	can provide better performance than the conventional h
0:12:17	to understand the reason for the computer there already
0:12:21	we have a purity after the whole clustering process of the two systems that control
0:12:28	case is given by
0:12:30	in the evaluation
0:12:33	to be the same page speech that's was required
0:12:37	to be in those in speaker at the reference from the comparison to
0:12:42	we have seen that the superiority of a restart clustering i know how
0:12:47	high-level speaker correctly
0:12:50	that it can provide a better initialization with imitation based
0:12:57	then we continued our experiments in this scenario we also number of speakers was not
0:13:02	a but
0:13:03	this table shows the performance comparison between the conditional basically and of the proposed or
0:13:09	is not clustering
0:13:10	development and the evaluation sets respectively
0:13:14	problem comparison because in that the or is not clustering can achieve better performance than
0:13:19	age they
0:13:22	or address l when used the
0:13:25	a report the results reported by different schemes
0:13:32	to have a family known database of various clustering with a number of speakers right
0:13:38	now again by the way how does advantage of speech in the development set was
0:13:43	estimated numbers of because what's more than or equal to the ground truth actually you
0:13:49	this paper
0:13:51	no means that shows that not only start clustering estimate columbus because more accurately well
0:13:57	as the number of estimated on us because it was not ground truth
0:14:03	combined with those because right here as you know strangely enough three people this can
0:14:09	help us to understand
0:14:11	the database of the audience to a jury
0:14:18	asked but experiments you know only those a threshold in both systems
0:14:23	right
0:14:24	results actually you don't actually got we evaluate the threshold zero point one to the
0:14:30	row of table one the paper we have seen that the or is not clustering
0:14:34	provided that statistically problem is not age they
0:14:39	well
0:14:40	no only a star clustering bad rich mess that interesting that means that the audience
0:14:45	to clustering is less than thirty two just a threshold
0:14:48	and more robust pitch there
0:14:53	finally we will come to a convolution
0:14:55	in this paper we propose an only stuff that you to h stays speaker diarization
0:15:00	consisting of two steps
0:15:03	second the number of initial clusters natural and he's anything man phenomenon that's because then
0:15:09	we combine no extraneous
0:15:11	after into the have a few number of speakers
0:15:14	the database of the proposed method was just a better from two aspects
0:15:19	back home as well had than h they based speaker diarization past well as the
0:15:25	number of speakers last not even all that
0:15:28	the second one is the a propose the similarity in magic space estimate of the
0:15:34	number of speakers and the resultant of speaker and a half of context of threshold
0:15:38	setting process relatively simple and robust
0:15:44	that's all of my a hessian thank you

On Early-stop Clustering for Speaker Diarization

Diarization

Liping Chen, Kongaik Lee, Lei He, Frank Soong