0:00:14hi everyone this is needed region problematical stopped today i'm going to present a list
0:00:20of clustering for speaker diarization the course of this outcome i like a ramp
0:00:27in the beginning reasons i mean to give a brief introduction to the past can
0:00:30you diarization use the no i think or from results
0:00:35as we all know that are initially is wow the task is equal recognition terrible
0:00:40together with identification and verification
0:00:43at the bottom of this feature it shows the scenario of speaker diarization tools because
0:00:49i'm talking with each other based on the recording the case of speaker diarization used
0:00:55to
0:00:56is i when each speaker is speaking
0:00:59technically no diarization can be decomposed into two steps segmentation and clustering
0:01:07you this now i will go through the most commonly used framework is speaker diarization
0:01:12that he's the optimal if you have a typical cluster we use h table shows
0:01:18in the nineteen one which composition we bust two cameras that imitation only
0:01:24no i always true method of the intention the next nist documentation and the segmentation
0:01:30based on speaker change point detection
0:01:34already that it in the speech segments it
0:01:38a stairwell good the speech segments from the same speaker to the same cluster
0:01:43in s a with respect to whether the number of clusters useful human or not
0:01:50we have important operations
0:01:52when the number of speakers is given to be a
0:01:56no clustering always
0:01:58stops when the without the number of clusters ranges and
0:02:03then each of the and clusters will be used a representation of a speaker in
0:02:07the conversation
0:02:10when the number of speakers is nothing the we will both the threshold to those
0:02:15because indirectly with does it go similarity of the merging clusters you know we you
0:02:20know here that when you know t then i feel and stick to
0:02:24when the
0:02:26speakers in the idea of them o g p c speaker thing one thing to
0:02:30reach is the threshold
0:02:32after we will stop
0:02:35yes
0:02:36no result in the number of clusters where is the estimated number of clusters and
0:02:41hence
0:02:42and each of the casters will be used to represent a specific speaker in the
0:02:47composition
0:02:49after e
0:02:51baby with applications there is always used
0:02:54imagine be re-segmentation we first race present each speaker with a gmm
0:03:00after that we're well beyond and h a gmm based on the gmms by adding
0:03:05transitional probability
0:03:07but only we will lie speech frames to the speaker gmms by viterbi decoding
0:03:18although age they has been widely used
0:03:21and the performance of each has been acknowledged
0:03:24no asked us some shortcomings units in our work way
0:03:28cope with the well as in
0:03:30now he's the clusters and probably the orange speakers they can watch
0:03:35in this nice
0:03:36when k is the diarization and tools costco example
0:03:40speaker in rule and speaker would be red
0:03:43during clustering we will have a pastor or speaker eight understatement consisting of each problem
0:03:49both speakers a and b
0:03:52but unknown speaker not only and is because similarity of the on and the statement
0:03:56of mixed each
0:03:58they didn't manage to a custom speaker i
0:04:02another scenario
0:04:03those documents from speaker he may also be multitudes because they actually the second picture
0:04:09in both cases the cost of speaker it will be biased to be could be
0:04:15with a clustering going on the speech
0:04:18the speech of speakers
0:04:20a and b may now present already
0:04:22that means those
0:04:24i mean
0:04:25no doubt addition they lost
0:04:28future studies of the original those that can be into the statistically
0:04:35in the in the that is composed of sailors from a only
0:04:41with the battery go the system is composed of these statements from
0:04:47see for me getting worse it to a
0:04:51the clusters if a is composed of speech signals from both the a and b
0:04:57all strategies with problem is to start early either
0:05:01go to be able to determine rose because they get in most states in this
0:05:06way we have to us to clean the you really use the way
0:05:13okay the clusters the issues that is like to be known as what is that
0:05:18it should be large enough to provide us it organisations people a i that it
0:05:25should be clean i allowing for
0:05:29involved in this one c d can be as we have
0:05:32so the action a
0:05:36will be a tradeoff between the two vectors
0:05:38we propose a list of clustering by thinking strict threshold without age they the ideally
0:05:44that will be a change the and get more faster than time t is the
0:05:49number of speakers
0:05:51is the only stuff clustering the clustering was a clustering
0:05:57checked thresholds the resulting clusters where k is large and then the anticipated number of
0:06:03speakers and
0:06:05in any way to is given all that we have different implementations
0:06:11when the number of speakers is nothing but we will first estimating it to be
0:06:17and had
0:06:19then based on a given or estimated number of speakers and only had we want
0:06:25to the class to selection to select a model and how clusters problems ending clusters
0:06:30each of the selected clusters where represents a specific speaker in the speech conversation
0:06:36in the battles that we will apply viterbi re-segmentation to align the frames of the
0:06:42whole conversation to the selected clusters
0:06:47in this now and the following
0:06:49we will describe how the number of speakers is no work
0:06:54was gone it will work should not because similarity score magics s
0:06:58each element s is thus because in our goal but no we'll let you clusters
0:07:04example s j k is a speaker similarity score into the g s and have
0:07:10after
0:07:11finally as well be i was initially magics of five i
0:07:18in the score matching s we will do and ninety conversation on it and stored
0:07:24in a manual in using you of the role you one to u k
0:07:29after that we want him choose the union ratio between the existing and can values
0:07:33after that k
0:07:35finally the lamb of speakers and had will be estimated at the point with a
0:07:40maximum again that night
0:07:45with a given all the estimated number of speakers in this nine and the following
0:07:50we will show how do not have to selection works in but we with the
0:07:55latter selecting is this and after of probability clusters of i wonder what i
0:08:00no we were achieved this to find out all of the company combination and after
0:08:05in these is to be the index set i one to
0:08:10after that we work on how the stuff or matching for each combination by extracting
0:08:15the corresponding rows and columns from s
0:08:18well score magics it would be of the imaging
0:08:22now takes a factor and i
0:08:27in the scores that matches this way was then do the eigenvalue decomposition and each
0:08:32of the in and found that the eigenvalues to be in one three
0:08:37but only the in this combination of the maximum and you man summation well be
0:08:42used in this is
0:08:43definitely pastors
0:08:47so that follows a description of the algorithm next we were able to the experiments
0:08:53all experiments was having a i had use the money is being the data set
0:08:59consisting of two cents is a dimension that and the as the of
0:09:04you made mistakes
0:09:05the duration of conversation various problems three hundred two hundred seconds
0:09:11the number of speakers conversation from one to nine
0:09:15in our evaluation when used are now role in addition error rates and eer as
0:09:20actually
0:09:22what use the pen the ground truth segmentation
0:09:25as a temporal segmentation
0:09:28be to you has to be noted that if in the reference euclidean speaker b
0:09:34hyper
0:09:35overlaps
0:09:36no overlap segments will be used as individual segments
0:09:43in our experiments we have to model as opposed by being a bottleneck feature extractor
0:09:49with a given model no is an expensive extractor with the rest of the model
0:09:54for most of the models
0:09:56we used at an additional advantage as input feature and of course of the and
0:10:02one change how about static y and into
0:10:05in the model the acoustic input layer of the year is the carriage real compatibility
0:10:12with ease contextual between you really both that and the right size
0:10:17you has i hate enables the was well hidden layers well that one thousand and
0:10:24it will give for the dimension of the not hidden layer wise lda and he's
0:10:28being a output was used
0:10:31it can only be sure
0:10:33in our is known model there were nine convolutional class
0:10:38only that we had no we'll collection they are five thousand and to you or
0:10:42than the ones we may go after that to well collection labels were used up
0:10:48to this green a the are one of the may five one hundred and twenty
0:10:53eight
0:10:53well use i x
0:10:55in both models a five of the classification a while the number of training because
0:11:01at least eleven thousand three hundred and if we
0:11:08we use the conventional a st as the baseline based on a involves the conventional
0:11:14clustering and the or is not mastery when use the egg expect when combined with
0:11:20cosine distance as the speaker modeling and is because similarity on a on then
0:11:26in the another speaker information and after selection in our restart clustering framework when use
0:11:34the bic score unspeakable individual
0:11:39in the re-segmentation phase
0:11:41way used a speaker pair of each point we duration
0:11:49well the name
0:11:49when having a experiments in the scenario where the number because once again but
0:11:55this table shows the performance comparison between the provisional edge v and the proposed only
0:12:00a star clustering and development and evaluation sets respectively
0:12:06from a comparison we have seen that the list of clustering
0:12:12can provide better performance than the conventional h
0:12:17to understand the reason for the computer there already
0:12:21we have a purity after the whole clustering process of the two systems that control
0:12:28case is given by
0:12:30in the evaluation
0:12:33to be the same page speech that's was required
0:12:37to be in those in speaker at the reference from the comparison to
0:12:42we have seen that the superiority of a restart clustering i know how
0:12:47high-level speaker correctly
0:12:50that it can provide a better initialization with imitation based
0:12:57then we continued our experiments in this scenario we also number of speakers was not
0:13:02a but
0:13:03this table shows the performance comparison between the conditional basically and of the proposed or
0:13:09is not clustering
0:13:10development and the evaluation sets respectively
0:13:14problem comparison because in that the or is not clustering can achieve better performance than
0:13:19age they
0:13:22or address l when used the
0:13:25a report the results reported by different schemes
0:13:32to have a family known database of various clustering with a number of speakers right
0:13:38now again by the way how does advantage of speech in the development set was
0:13:43estimated numbers of because what's more than or equal to the ground truth actually you
0:13:49this paper
0:13:51no means that shows that not only start clustering estimate columbus because more accurately well
0:13:57as the number of estimated on us because it was not ground truth
0:14:03combined with those because right here as you know strangely enough three people this can
0:14:09help us to understand
0:14:11the database of the audience to a jury
0:14:18asked but experiments you know only those a threshold in both systems
0:14:23right
0:14:24results actually you don't actually got we evaluate the threshold zero point one to the
0:14:30row of table one the paper we have seen that the or is not clustering
0:14:34provided that statistically problem is not age they
0:14:39well
0:14:40no only a star clustering bad rich mess that interesting that means that the audience
0:14:45to clustering is less than thirty two just a threshold
0:14:48and more robust pitch there
0:14:53finally we will come to a convolution
0:14:55in this paper we propose an only stuff that you to h stays speaker diarization
0:15:00consisting of two steps
0:15:03second the number of initial clusters natural and he's anything man phenomenon that's because then
0:15:09we combine no extraneous
0:15:11after into the have a few number of speakers
0:15:14the database of the proposed method was just a better from two aspects
0:15:19back home as well had than h they based speaker diarization past well as the
0:15:25number of speakers last not even all that
0:15:28the second one is the a propose the similarity in magic space estimate of the
0:15:34number of speakers and the resultant of speaker and a half of context of threshold
0:15:38setting process relatively simple and robust
0:15:44that's all of my a hessian thank you