| 0:00:14 | hi everyone this is needed region problematical stopped today i'm going to present a list | 
|---|
| 0:00:20 | of clustering for speaker diarization the course of this outcome i like a ramp | 
|---|
| 0:00:27 | in the beginning reasons i mean to give a brief introduction to the past can | 
|---|
| 0:00:30 | you diarization use the no i think or from results | 
|---|
| 0:00:35 | as we all know that are initially is wow the task is equal recognition terrible | 
|---|
| 0:00:40 | together with identification and verification | 
|---|
| 0:00:43 | at the bottom of this feature it shows the scenario of speaker diarization tools because | 
|---|
| 0:00:49 | i'm talking with each other based on the recording the case of speaker diarization used | 
|---|
| 0:00:55 | to | 
|---|
| 0:00:56 | is i when each speaker is speaking | 
|---|
| 0:00:59 | technically no diarization can be decomposed into two steps segmentation and clustering | 
|---|
| 0:01:07 | you this now i will go through the most commonly used framework is speaker diarization | 
|---|
| 0:01:12 | that he's the optimal if you have a typical cluster we use h table shows | 
|---|
| 0:01:18 | in the nineteen one which composition we bust two cameras that imitation only | 
|---|
| 0:01:24 | no i always true method of the intention the next nist documentation and the segmentation | 
|---|
| 0:01:30 | based on speaker change point detection | 
|---|
| 0:01:34 | already that it in the speech segments it | 
|---|
| 0:01:38 | a stairwell good the speech segments from the same speaker to the same cluster | 
|---|
| 0:01:43 | in s a with respect to whether the number of clusters useful human or not | 
|---|
| 0:01:50 | we have important operations | 
|---|
| 0:01:52 | when the number of speakers is given to be a | 
|---|
| 0:01:56 | no clustering always | 
|---|
| 0:01:58 | stops when the without the number of clusters ranges and | 
|---|
| 0:02:03 | then each of the and clusters will be used a representation of a speaker in | 
|---|
| 0:02:07 | the conversation | 
|---|
| 0:02:10 | when the number of speakers is nothing the we will both the threshold to those | 
|---|
| 0:02:15 | because indirectly with does it go similarity of the merging clusters you know we you | 
|---|
| 0:02:20 | know here that when you know t then i feel and stick to | 
|---|
| 0:02:24 | when the | 
|---|
| 0:02:26 | speakers in the idea of them o g p c speaker thing one thing to | 
|---|
| 0:02:30 | reach is the threshold | 
|---|
| 0:02:32 | after we will stop | 
|---|
| 0:02:35 | yes | 
|---|
| 0:02:36 | no result in the number of clusters where is the estimated number of clusters and | 
|---|
| 0:02:41 | hence | 
|---|
| 0:02:42 | and each of the casters will be used to represent a specific speaker in the | 
|---|
| 0:02:47 | composition | 
|---|
| 0:02:49 | after e | 
|---|
| 0:02:51 | baby with applications there is always used | 
|---|
| 0:02:54 | imagine be re-segmentation we first race present each speaker with a gmm | 
|---|
| 0:03:00 | after that we're well beyond and h a gmm based on the gmms by adding | 
|---|
| 0:03:05 | transitional probability | 
|---|
| 0:03:07 | but only we will lie speech frames to the speaker gmms by viterbi decoding | 
|---|
| 0:03:18 | although age they has been widely used | 
|---|
| 0:03:21 | and the performance of each has been acknowledged | 
|---|
| 0:03:24 | no asked us some shortcomings units in our work way | 
|---|
| 0:03:28 | cope with the well as in | 
|---|
| 0:03:30 | now he's the clusters and probably the orange speakers they can watch | 
|---|
| 0:03:35 | in this nice | 
|---|
| 0:03:36 | when k is the diarization and tools costco example | 
|---|
| 0:03:40 | speaker in rule and speaker would be red | 
|---|
| 0:03:43 | during clustering we will have a pastor or speaker eight understatement consisting of each problem | 
|---|
| 0:03:49 | both speakers a and b | 
|---|
| 0:03:52 | but unknown speaker not only and is because similarity of the on and the statement | 
|---|
| 0:03:56 | of mixed each | 
|---|
| 0:03:58 | they didn't manage to a custom speaker i | 
|---|
| 0:04:02 | another scenario | 
|---|
| 0:04:03 | those documents from speaker he may also be multitudes because they actually the second picture | 
|---|
| 0:04:09 | in both cases the cost of speaker it will be biased to be could be | 
|---|
| 0:04:15 | with a clustering going on the speech | 
|---|
| 0:04:18 | the speech of speakers | 
|---|
| 0:04:20 | a and b may now present already | 
|---|
| 0:04:22 | that means those | 
|---|
| 0:04:24 | i mean | 
|---|
| 0:04:25 | no doubt addition they lost | 
|---|
| 0:04:28 | future studies of the original those that can be into the statistically | 
|---|
| 0:04:35 | in the in the that is composed of sailors from a only | 
|---|
| 0:04:41 | with the battery go the system is composed of these statements from | 
|---|
| 0:04:47 | see for me getting worse it to a | 
|---|
| 0:04:51 | the clusters if a is composed of speech signals from both the a and b | 
|---|
| 0:04:57 | all strategies with problem is to start early either | 
|---|
| 0:05:01 | go to be able to determine rose because they get in most states in this | 
|---|
| 0:05:06 | way we have to us to clean the you really use the way | 
|---|
| 0:05:13 | okay the clusters the issues that is like to be known as what is that | 
|---|
| 0:05:18 | it should be large enough to provide us it organisations people a i that it | 
|---|
| 0:05:25 | should be clean i allowing for | 
|---|
| 0:05:29 | involved in this one c d can be as we have | 
|---|
| 0:05:32 | so the action a | 
|---|
| 0:05:36 | will be a tradeoff between the two vectors | 
|---|
| 0:05:38 | we propose a list of clustering by thinking strict threshold without age they the ideally | 
|---|
| 0:05:44 | that will be a change the and get more faster than time t is the | 
|---|
| 0:05:49 | number of speakers | 
|---|
| 0:05:51 | is the only stuff clustering the clustering was a clustering | 
|---|
| 0:05:57 | checked thresholds the resulting clusters where k is large and then the anticipated number of | 
|---|
| 0:06:03 | speakers and | 
|---|
| 0:06:05 | in any way to is given all that we have different implementations | 
|---|
| 0:06:11 | when the number of speakers is nothing but we will first estimating it to be | 
|---|
| 0:06:17 | and had | 
|---|
| 0:06:19 | then based on a given or estimated number of speakers and only had we want | 
|---|
| 0:06:25 | to the class to selection to select a model and how clusters problems ending clusters | 
|---|
| 0:06:30 | each of the selected clusters where represents a specific speaker in the speech conversation | 
|---|
| 0:06:36 | in the battles that we will apply viterbi re-segmentation to align the frames of the | 
|---|
| 0:06:42 | whole conversation to the selected clusters | 
|---|
| 0:06:47 | in this now and the following | 
|---|
| 0:06:49 | we will describe how the number of speakers is no work | 
|---|
| 0:06:54 | was gone it will work should not because similarity score magics s | 
|---|
| 0:06:58 | each element s is thus because in our goal but no we'll let you clusters | 
|---|
| 0:07:04 | example s j k is a speaker similarity score into the g s and have | 
|---|
| 0:07:10 | after | 
|---|
| 0:07:11 | finally as well be i was initially magics of five i | 
|---|
| 0:07:18 | in the score matching s we will do and ninety conversation on it and stored | 
|---|
| 0:07:24 | in a manual in using you of the role you one to u k | 
|---|
| 0:07:29 | after that we want him choose the union ratio between the existing and can values | 
|---|
| 0:07:33 | after that k | 
|---|
| 0:07:35 | finally the lamb of speakers and had will be estimated at the point with a | 
|---|
| 0:07:40 | maximum again that night | 
|---|
| 0:07:45 | with a given all the estimated number of speakers in this nine and the following | 
|---|
| 0:07:50 | we will show how do not have to selection works in but we with the | 
|---|
| 0:07:55 | latter selecting is this and after of probability clusters of i wonder what i | 
|---|
| 0:08:00 | no we were achieved this to find out all of the company combination and after | 
|---|
| 0:08:05 | in these is to be the index set i one to | 
|---|
| 0:08:10 | after that we work on how the stuff or matching for each combination by extracting | 
|---|
| 0:08:15 | the corresponding rows and columns from s | 
|---|
| 0:08:18 | well score magics it would be of the imaging | 
|---|
| 0:08:22 | now takes a factor and i | 
|---|
| 0:08:27 | in the scores that matches this way was then do the eigenvalue decomposition and each | 
|---|
| 0:08:32 | of the in and found that the eigenvalues to be in one three | 
|---|
| 0:08:37 | but only the in this combination of the maximum and you man summation well be | 
|---|
| 0:08:42 | used in this is | 
|---|
| 0:08:43 | definitely pastors | 
|---|
| 0:08:47 | so that follows a description of the algorithm next we were able to the experiments | 
|---|
| 0:08:53 | all experiments was having a i had use the money is being the data set | 
|---|
| 0:08:59 | consisting of two cents is a dimension that and the as the of | 
|---|
| 0:09:04 | you made mistakes | 
|---|
| 0:09:05 | the duration of conversation various problems three hundred two hundred seconds | 
|---|
| 0:09:11 | the number of speakers conversation from one to nine | 
|---|
| 0:09:15 | in our evaluation when used are now role in addition error rates and eer as | 
|---|
| 0:09:20 | actually | 
|---|
| 0:09:22 | what use the pen the ground truth segmentation | 
|---|
| 0:09:25 | as a temporal segmentation | 
|---|
| 0:09:28 | be to you has to be noted that if in the reference euclidean speaker b | 
|---|
| 0:09:34 | hyper | 
|---|
| 0:09:35 | overlaps | 
|---|
| 0:09:36 | no overlap segments will be used as individual segments | 
|---|
| 0:09:43 | in our experiments we have to model as opposed by being a bottleneck feature extractor | 
|---|
| 0:09:49 | with a given model no is an expensive extractor with the rest of the model | 
|---|
| 0:09:54 | for most of the models | 
|---|
| 0:09:56 | we used at an additional advantage as input feature and of course of the and | 
|---|
| 0:10:02 | one change how about static y and into | 
|---|
| 0:10:05 | in the model the acoustic input layer of the year is the carriage real compatibility | 
|---|
| 0:10:12 | with ease contextual between you really both that and the right size | 
|---|
| 0:10:17 | you has i hate enables the was well hidden layers well that one thousand and | 
|---|
| 0:10:24 | it will give for the dimension of the not hidden layer wise lda and he's | 
|---|
| 0:10:28 | being a output was used | 
|---|
| 0:10:31 | it can only be sure | 
|---|
| 0:10:33 | in our is known model there were nine convolutional class | 
|---|
| 0:10:38 | only that we had no we'll collection they are five thousand and to you or | 
|---|
| 0:10:42 | than the ones we may go after that to well collection labels were used up | 
|---|
| 0:10:48 | to this green a the are one of the may five one hundred and twenty | 
|---|
| 0:10:53 | eight | 
|---|
| 0:10:53 | well use i x | 
|---|
| 0:10:55 | in both models a five of the classification a while the number of training because | 
|---|
| 0:11:01 | at least eleven thousand three hundred and if we | 
|---|
| 0:11:08 | we use the conventional a st as the baseline based on a involves the conventional | 
|---|
| 0:11:14 | clustering and the or is not mastery when use the egg expect when combined with | 
|---|
| 0:11:20 | cosine distance as the speaker modeling and is because similarity on a on then | 
|---|
| 0:11:26 | in the another speaker information and after selection in our restart clustering framework when use | 
|---|
| 0:11:34 | the bic score unspeakable individual | 
|---|
| 0:11:39 | in the re-segmentation phase | 
|---|
| 0:11:41 | way used a speaker pair of each point we duration | 
|---|
| 0:11:49 | well the name | 
|---|
| 0:11:49 | when having a experiments in the scenario where the number because once again but | 
|---|
| 0:11:55 | this table shows the performance comparison between the provisional edge v and the proposed only | 
|---|
| 0:12:00 | a star clustering and development and evaluation sets respectively | 
|---|
| 0:12:06 | from a comparison we have seen that the list of clustering | 
|---|
| 0:12:12 | can provide better performance than the conventional h | 
|---|
| 0:12:17 | to understand the reason for the computer there already | 
|---|
| 0:12:21 | we have a purity after the whole clustering process of the two systems that control | 
|---|
| 0:12:28 | case is given by | 
|---|
| 0:12:30 | in the evaluation | 
|---|
| 0:12:33 | to be the same page speech that's was required | 
|---|
| 0:12:37 | to be in those in speaker at the reference from the comparison to | 
|---|
| 0:12:42 | we have seen that the superiority of a restart clustering i know how | 
|---|
| 0:12:47 | high-level speaker correctly | 
|---|
| 0:12:50 | that it can provide a better initialization with imitation based | 
|---|
| 0:12:57 | then we continued our experiments in this scenario we also number of speakers was not | 
|---|
| 0:13:02 | a but | 
|---|
| 0:13:03 | this table shows the performance comparison between the conditional basically and of the proposed or | 
|---|
| 0:13:09 | is not clustering | 
|---|
| 0:13:10 | development and the evaluation sets respectively | 
|---|
| 0:13:14 | problem comparison because in that the or is not clustering can achieve better performance than | 
|---|
| 0:13:19 | age they | 
|---|
| 0:13:22 | or address l when used the | 
|---|
| 0:13:25 | a report the results reported by different schemes | 
|---|
| 0:13:32 | to have a family known database of various clustering with a number of speakers right | 
|---|
| 0:13:38 | now again by the way how does advantage of speech in the development set was | 
|---|
| 0:13:43 | estimated numbers of because what's more than or equal to the ground truth actually you | 
|---|
| 0:13:49 | this paper | 
|---|
| 0:13:51 | no means that shows that not only start clustering estimate columbus because more accurately well | 
|---|
| 0:13:57 | as the number of estimated on us because it was not ground truth | 
|---|
| 0:14:03 | combined with those because right here as you know strangely enough three people this can | 
|---|
| 0:14:09 | help us to understand | 
|---|
| 0:14:11 | the database of the audience to a jury | 
|---|
| 0:14:18 | asked but experiments you know only those a threshold in both systems | 
|---|
| 0:14:23 | right | 
|---|
| 0:14:24 | results actually you don't actually got we evaluate the threshold zero point one to the | 
|---|
| 0:14:30 | row of table one the paper we have seen that the or is not clustering | 
|---|
| 0:14:34 | provided that statistically problem is not age they | 
|---|
| 0:14:39 | well | 
|---|
| 0:14:40 | no only a star clustering bad rich mess that interesting that means that the audience | 
|---|
| 0:14:45 | to clustering is less than thirty two just a threshold | 
|---|
| 0:14:48 | and more robust pitch there | 
|---|
| 0:14:53 | finally we will come to a convolution | 
|---|
| 0:14:55 | in this paper we propose an only stuff that you to h stays speaker diarization | 
|---|
| 0:15:00 | consisting of two steps | 
|---|
| 0:15:03 | second the number of initial clusters natural and he's anything man phenomenon that's because then | 
|---|
| 0:15:09 | we combine no extraneous | 
|---|
| 0:15:11 | after into the have a few number of speakers | 
|---|
| 0:15:14 | the database of the proposed method was just a better from two aspects | 
|---|
| 0:15:19 | back home as well had than h they based speaker diarization past well as the | 
|---|
| 0:15:25 | number of speakers last not even all that | 
|---|
| 0:15:28 | the second one is the a propose the similarity in magic space estimate of the | 
|---|
| 0:15:34 | number of speakers and the resultant of speaker and a half of context of threshold | 
|---|
| 0:15:38 | setting process relatively simple and robust | 
|---|
| 0:15:44 | that's all of my a hessian thank you | 
|---|