hi everyone this is needed region problematical stopped today i'm going to present a list

of clustering for speaker diarization the course of this outcome i like a ramp

in the beginning reasons i mean to give a brief introduction to the past can

you diarization use the no i think or from results

as we all know that are initially is wow the task is equal recognition terrible

together with identification and verification

at the bottom of this feature it shows the scenario of speaker diarization tools because

i'm talking with each other based on the recording the case of speaker diarization used


is i when each speaker is speaking

technically no diarization can be decomposed into two steps segmentation and clustering

you this now i will go through the most commonly used framework is speaker diarization

that he's the optimal if you have a typical cluster we use h table shows

in the nineteen one which composition we bust two cameras that imitation only

no i always true method of the intention the next nist documentation and the segmentation

based on speaker change point detection

already that it in the speech segments it

a stairwell good the speech segments from the same speaker to the same cluster

in s a with respect to whether the number of clusters useful human or not

we have important operations

when the number of speakers is given to be a

no clustering always

stops when the without the number of clusters ranges and

then each of the and clusters will be used a representation of a speaker in

the conversation

when the number of speakers is nothing the we will both the threshold to those

because indirectly with does it go similarity of the merging clusters you know we you

know here that when you know t then i feel and stick to

when the

speakers in the idea of them o g p c speaker thing one thing to

reach is the threshold

after we will stop


no result in the number of clusters where is the estimated number of clusters and


and each of the casters will be used to represent a specific speaker in the


after e

baby with applications there is always used

imagine be re-segmentation we first race present each speaker with a gmm

after that we're well beyond and h a gmm based on the gmms by adding

transitional probability

but only we will lie speech frames to the speaker gmms by viterbi decoding

although age they has been widely used

and the performance of each has been acknowledged

no asked us some shortcomings units in our work way

cope with the well as in

now he's the clusters and probably the orange speakers they can watch

in this nice

when k is the diarization and tools costco example

speaker in rule and speaker would be red

during clustering we will have a pastor or speaker eight understatement consisting of each problem

both speakers a and b

but unknown speaker not only and is because similarity of the on and the statement

of mixed each

they didn't manage to a custom speaker i

another scenario

those documents from speaker he may also be multitudes because they actually the second picture

in both cases the cost of speaker it will be biased to be could be

with a clustering going on the speech

the speech of speakers

a and b may now present already

that means those

i mean

no doubt addition they lost

future studies of the original those that can be into the statistically

in the in the that is composed of sailors from a only

with the battery go the system is composed of these statements from

see for me getting worse it to a

the clusters if a is composed of speech signals from both the a and b

all strategies with problem is to start early either

go to be able to determine rose because they get in most states in this

way we have to us to clean the you really use the way

okay the clusters the issues that is like to be known as what is that

it should be large enough to provide us it organisations people a i that it

should be clean i allowing for

involved in this one c d can be as we have

so the action a

will be a tradeoff between the two vectors

we propose a list of clustering by thinking strict threshold without age they the ideally

that will be a change the and get more faster than time t is the

number of speakers

is the only stuff clustering the clustering was a clustering

checked thresholds the resulting clusters where k is large and then the anticipated number of

speakers and

in any way to is given all that we have different implementations

when the number of speakers is nothing but we will first estimating it to be

and had

then based on a given or estimated number of speakers and only had we want

to the class to selection to select a model and how clusters problems ending clusters

each of the selected clusters where represents a specific speaker in the speech conversation

in the battles that we will apply viterbi re-segmentation to align the frames of the

whole conversation to the selected clusters

in this now and the following

we will describe how the number of speakers is no work

was gone it will work should not because similarity score magics s

each element s is thus because in our goal but no we'll let you clusters

example s j k is a speaker similarity score into the g s and have


finally as well be i was initially magics of five i

in the score matching s we will do and ninety conversation on it and stored

in a manual in using you of the role you one to u k

after that we want him choose the union ratio between the existing and can values

after that k

finally the lamb of speakers and had will be estimated at the point with a

maximum again that night

with a given all the estimated number of speakers in this nine and the following

we will show how do not have to selection works in but we with the

latter selecting is this and after of probability clusters of i wonder what i

no we were achieved this to find out all of the company combination and after

in these is to be the index set i one to

after that we work on how the stuff or matching for each combination by extracting

the corresponding rows and columns from s

well score magics it would be of the imaging

now takes a factor and i

in the scores that matches this way was then do the eigenvalue decomposition and each

of the in and found that the eigenvalues to be in one three

but only the in this combination of the maximum and you man summation well be

used in this is

definitely pastors

so that follows a description of the algorithm next we were able to the experiments

all experiments was having a i had use the money is being the data set

consisting of two cents is a dimension that and the as the of

you made mistakes

the duration of conversation various problems three hundred two hundred seconds

the number of speakers conversation from one to nine

in our evaluation when used are now role in addition error rates and eer as


what use the pen the ground truth segmentation

as a temporal segmentation

be to you has to be noted that if in the reference euclidean speaker b



no overlap segments will be used as individual segments

in our experiments we have to model as opposed by being a bottleneck feature extractor

with a given model no is an expensive extractor with the rest of the model

for most of the models

we used at an additional advantage as input feature and of course of the and

one change how about static y and into

in the model the acoustic input layer of the year is the carriage real compatibility

with ease contextual between you really both that and the right size

you has i hate enables the was well hidden layers well that one thousand and

it will give for the dimension of the not hidden layer wise lda and he's

being a output was used

it can only be sure

in our is known model there were nine convolutional class

only that we had no we'll collection they are five thousand and to you or

than the ones we may go after that to well collection labels were used up

to this green a the are one of the may five one hundred and twenty


well use i x

in both models a five of the classification a while the number of training because

at least eleven thousand three hundred and if we

we use the conventional a st as the baseline based on a involves the conventional

clustering and the or is not mastery when use the egg expect when combined with

cosine distance as the speaker modeling and is because similarity on a on then

in the another speaker information and after selection in our restart clustering framework when use

the bic score unspeakable individual

in the re-segmentation phase

way used a speaker pair of each point we duration

well the name

when having a experiments in the scenario where the number because once again but

this table shows the performance comparison between the provisional edge v and the proposed only

a star clustering and development and evaluation sets respectively

from a comparison we have seen that the list of clustering

can provide better performance than the conventional h

to understand the reason for the computer there already

we have a purity after the whole clustering process of the two systems that control

case is given by

in the evaluation

to be the same page speech that's was required

to be in those in speaker at the reference from the comparison to

we have seen that the superiority of a restart clustering i know how

high-level speaker correctly

that it can provide a better initialization with imitation based

then we continued our experiments in this scenario we also number of speakers was not

a but

this table shows the performance comparison between the conditional basically and of the proposed or

is not clustering

development and the evaluation sets respectively

problem comparison because in that the or is not clustering can achieve better performance than

age they

or address l when used the

a report the results reported by different schemes

to have a family known database of various clustering with a number of speakers right

now again by the way how does advantage of speech in the development set was

estimated numbers of because what's more than or equal to the ground truth actually you

this paper

no means that shows that not only start clustering estimate columbus because more accurately well

as the number of estimated on us because it was not ground truth

combined with those because right here as you know strangely enough three people this can

help us to understand

the database of the audience to a jury

asked but experiments you know only those a threshold in both systems


results actually you don't actually got we evaluate the threshold zero point one to the

row of table one the paper we have seen that the or is not clustering

provided that statistically problem is not age they


no only a star clustering bad rich mess that interesting that means that the audience

to clustering is less than thirty two just a threshold

and more robust pitch there

finally we will come to a convolution

in this paper we propose an only stuff that you to h stays speaker diarization

consisting of two steps

second the number of initial clusters natural and he's anything man phenomenon that's because then

we combine no extraneous

after into the have a few number of speakers

the database of the proposed method was just a better from two aspects

back home as well had than h they based speaker diarization past well as the

number of speakers last not even all that

the second one is the a propose the similarity in magic space estimate of the

number of speakers and the resultant of speaker and a half of context of threshold

setting process relatively simple and robust

that's all of my a hessian thank you