0:00:15 | hi first i want to think |
---|---|

0:00:17 | two or stay here because i thought that only me colours and the cameraman will |

0:00:22 | be here |

0:00:26 | the work was |

0:00:27 | than by |

0:00:29 | die unit and like me most about it i |

0:00:33 | and he should |

0:00:36 | present this war but |

0:00:37 | unfortunately about ten days ago he got married |

0:00:46 | so |

0:00:48 | you prefer to go to cost oracle or something else in order to come here |

0:00:54 | and present |

0:00:55 | the more also used a quiz me and i stuck with this work |

0:01:01 | and |

0:01:02 | so |

0:01:03 | you have to suffer me for about ten |

0:01:09 | after that |

0:01:10 | most of this workshop also on |

0:01:13 | the n n's and spatially |

0:01:16 | a very good managing skin all i felt that i also want to say some |

0:01:22 | think about it so |

0:01:24 | i briefly |

0:01:25 | take talk about it a little be then |

0:01:28 | i give a motivation |

0:01:30 | about the clustering problem |

0:01:33 | the basic mean shift algorithm and them but discussion we need |

0:01:37 | and then i present the |

0:01:40 | clustering system some experiments and summary |

0:01:45 | so about the intense |

0:01:49 | okay next |

0:01:57 | so |

0:01:59 | our problem as you we have |

0:02:01 | a texas station there are many |

0:02:05 | text across each lr |

0:02:09 | have |

0:02:10 | not one driver about the driver also |

0:02:12 | changed |

0:02:14 | and we have recording four |

0:02:16 | quite |

0:02:17 | they two days three days |

0:02:20 | that you speak at to talk so we exactly know where the segment |

0:02:25 | the and the start of the n |

0:02:27 | of each segment |

0:02:29 | and we collected |

0:02:32 | these recording devices |

0:02:36 | and |

0:02:38 | and the end of the day we want to segment and no which segments were |

0:02:43 | said by |

0:02:44 | one speaker now the speaker so |

0:02:48 | each month and talk |

0:02:50 | we don't wear an |

0:02:51 | where |

0:02:53 | can take that one speaker now |

0:02:57 | then the next time speaks after two hours three hours |

0:03:01 | maybe to model |

0:03:02 | from different car |

0:03:04 | and what we have |

0:03:08 | use a bag of segments |

0:03:11 | which are unlabeled |

0:03:13 | and we want you please |

0:03:16 | mostly these segments are very short on the average one how seconds two seconds |

0:03:22 | so |

0:03:24 | we want to |

0:03:25 | cluster |

0:03:28 | short segments |

0:03:29 | and usually the population use white be sort of speakers for two speakers |

0:03:35 | and so on |

0:03:37 | so the issues our problem |

0:03:43 | a given |

0:03:44 | mainly short segments we want |

0:03:47 | to segment them into on morgan use group which means that we want |

0:03:53 | have |

0:03:54 | it would |

0:03:57 | cluster purity it means that each cluster we would be occupied mostly by one speaker |

0:04:04 | only |

0:04:05 | but we also want to have |

0:04:07 | speaker purity |

0:04:10 | so |

0:04:11 | we don't want that the same speaker will be spread between ten clusters |

0:04:19 | the basic mean shift algorithm |

0:04:21 | is that we have but |

0:04:23 | many vectors we choose and vector or find by |

0:04:29 | and b and b so the |

0:04:33 | and go |

0:04:34 | set of clothes this |

0:04:37 | vectors |

0:04:39 | and take all the |

0:04:40 | vectors which are below some threshold |

0:04:44 | and then shift |

0:04:47 | than in particular the weighted mean of these |

0:04:52 | vectors |

0:04:53 | take the mean as reference points and the gain |

0:04:57 | looking for the |

0:05:00 | neighbours that below the threshold calculate the mean and you to converse some |

0:05:06 | point and |

0:05:07 | these what we can do we |

0:05:09 | each point we |

0:05:11 | each vector |

0:05:13 | demos talk about this algorithm many times |

0:05:17 | so for more details please refer to tables |

0:05:21 | and |

0:05:23 | lda the after we find the stable point of each wrecked or we are |

0:05:31 | a group all the vectors which are close one to each other |

0:05:36 | according to some threshold |

0:05:38 | and the number of broad we have this is the number of clusters |

0:05:44 | and the points which are in each group out there |

0:05:48 | one of the cluster but we know that a canadian distance is not very good |

0:05:52 | for the purpose of speaker clustering |

0:05:57 | so |

0:06:00 | here what we present with cosine distance now we use the be lda scoring instead |

0:06:05 | of cosine |

0:06:08 | so instead of looking for |

0:06:12 | closest vectors in the sense of forty the as distance we look for |

0:06:19 | based high score between b lda score |

0:06:24 | and the |

0:06:26 | calculate the |

0:06:28 | a new mean |

0:06:29 | when they've function g is the weight in the weight is basically the |

0:06:34 | and be lda score |

0:06:37 | and are the difference we made we didn't do not use a threshold |

0:06:44 | to look for |

0:06:47 | class vectors |

0:06:50 | below the threshold instead we was k and then we set |

0:06:55 | okay and take the k nearest |

0:07:00 | and vectors which have the highest the |

0:07:03 | bleu score |

0:07:07 | so |

0:07:08 | basically we |

0:07:11 | all these creations |

0:07:14 | the and |

0:07:15 | now this case |

0:07:17 | the to a shown is not fixed |

0:07:20 | like in the original algorithm but you the depends on the |

0:07:26 | largest distance of the case vector or |

0:07:31 | but we calculate the same are |

0:07:35 | algorithm in shift so we calculate the mean according to these k |

0:07:39 | nearest vectors shifted the mean and again continue the process |

0:07:48 | hearable the |

0:07:50 | well i-vectors but i-vector also because |

0:07:54 | i will explain that we |

0:07:55 | do the small modification of them |

0:07:58 | and applying mean shift algorithm according to the bleu score |

0:08:03 | and we have the results i just as we mentioned |

0:08:06 | and |

0:08:08 | because i we compared to the brother previous or |

0:08:12 | that in previous work |

0:08:13 | the threshold was fixed |

0:08:15 | we will scroll signs used and these stunts |

0:08:19 | and we use run don't mean shifted mean that we don't |

0:08:23 | i'll go across all the points but only randomly choose them |

0:08:32 | so before clustering we of course need to train ubm |

0:08:38 | total variability matrix but before using one |

0:08:42 | build the score we found that it is better to |

0:08:46 | do pca on the data and gab just another job pca |

0:08:52 | a no we reduce our r vector from for dimensions of the four hundred to |

0:08:58 | two hundred fifty |

0:09:00 | and we tried to compare it to is just i-vector of size two hundred fifty |

0:09:04 | this work better |

0:09:07 | we don't sure why |

0:09:09 | but is the fact it was better and we apply and next would indeed whitening |

0:09:16 | and apply a p lda score on these vectors |

0:09:26 | okay |

0:09:27 | so this was explained before |

0:09:34 | the experiment setup was that |

0:09:37 | we use nist two thousand eight two we got six |

0:09:41 | into a low dimensional segments |

0:09:45 | on average to enhance seconds |

0:09:49 | and |

0:09:50 | and we have average number all five segments per |

0:09:54 | speaker so to three |

0:09:57 | a no |

0:09:59 | we |

0:10:00 | calculate |

0:10:02 | the results |

0:10:03 | according to |

0:10:05 | average speaker purity average cluster purity the k parameter |

0:10:10 | and the other important parameter use how many class so we have at the end |

0:10:15 | comparing to the true number of clusters |

0:10:22 | so we're starting from the beginning |

0:10:27 | we start with the |

0:10:30 | here we go system of cosine distance with |

0:10:35 | a threshold fixed threshold this is the red line |

0:10:39 | and we see that |

0:10:41 | we have a wide be we can we |

0:10:44 | have to know exactly |

0:10:46 | what is the |

0:10:47 | based threshold |

0:10:49 | to make clustering when we use k-means instead |

0:10:54 | can use them but they're holding is that we see that we |

0:10:58 | have a plucked or so it |

0:11:01 | doesn't |

0:11:03 | make a difference if we choose sorting or seventeen so it's much more robust and |

0:11:10 | when we use k nearest neighbor hold |

0:11:14 | all the |

0:11:16 | results are for |

0:11:19 | i thought speakers |

0:11:22 | next we yea instead of using |

0:11:26 | random mean shoes |

0:11:28 | useful mean shift which are much more expensive computationally |

0:11:34 | but we see that we have some gain |

0:11:36 | still be is cosine distance |

0:11:40 | and then we switch from cosine distance to be lda score |

0:11:45 | and we have |

0:11:47 | and it to beat |

0:11:48 | more gain |

0:11:51 | i have to say that |

0:11:53 | both for the lda training |

0:11:56 | and for |

0:11:58 | and w c and training in the cosine system |

0:12:01 | we trained them on short segments well too long segment |

0:12:08 | shortly we will see why we did it on short segments |

0:12:13 | this one |

0:12:15 | when we train |

0:12:18 | be lda on long segments we have very better results |

0:12:23 | but on short segments |

0:12:26 | we improve the remote results dramatically |

0:12:30 | the total variability matrix trained on long segments only |

0:12:35 | we didn't use short segments |

0:12:37 | because it was very bit |

0:12:41 | this all results |

0:12:43 | deal now only we've sort of speakers |

0:12:47 | and this is some summary of all the results we see it's better to move |

0:12:51 | from |

0:12:52 | and i fixed threshold to get a nearest neighbor hope to go from randall mean |

0:12:58 | shift to full mean shift to move to be lda |

0:13:04 | and the hope for results |

0:13:11 | it's a totally |

0:13:13 | not require a problem is how many clusters |

0:13:17 | we have after clustering problem process |

0:13:22 | with when the compared to the actual number of clusters |

0:13:26 | and the red line of the drawing a units |

0:13:31 | of the |

0:13:32 | and a fixed threshold |

0:13:36 | and |

0:13:37 | if you are looking |

0:13:39 | and the result |

0:13:41 | it's not so nice we have |

0:13:46 | true forty six clusters speakers but it was estimated |

0:13:51 | s |

0:13:52 | about one hundred eighty clusters means that we have many small clusters they are very |

0:13:58 | pure but |

0:14:01 | a small and too much |

0:14:05 | but when we will scale and then |

0:14:08 | we can see that we have about a factor of two about six two clusters |

0:14:13 | so we have better cave better clustering performance with much less |

0:14:19 | clusters |

0:14:23 | these are the results |

0:14:27 | when we use the cosine distance with a fixed threshold |

0:14:31 | on different arbour off |

0:14:34 | speakers from three to one hundred eighty eight |

0:14:39 | we |

0:14:39 | when we will compare with the |

0:14:42 | proposed algorithm we will see that |

0:14:44 | in this case the |

0:14:47 | cluster purity is better |

0:14:50 | it it's understandable many small clusters and they're all pure of one segment to segment |

0:14:57 | but the k the overall and |

0:15:00 | results |

0:15:03 | in our case in the our algorithm is better |

0:15:06 | and the average number of clusters you can see that for three speakers that okay |

0:15:11 | but |

0:15:12 | let's able to one hundred eighty eight speakers it's |

0:15:16 | by a factor of ten almost we have much more |

0:15:21 | clusters that true number of speakers |

0:15:25 | when we go to |

0:15:27 | the be lda we skin |

0:15:32 | we have |

0:15:33 | better speaker purity |

0:15:35 | and |

0:15:37 | much less class see that the |

0:15:41 | by a factor of one how to for two |

0:15:46 | and these summarize |

0:15:48 | the results |

0:15:50 | for three and seven |

0:15:53 | speakers we have |

0:15:54 | a little bit to the better results by cosine with a fixed threshold but when |

0:16:01 | we go from |

0:16:03 | fifteen speakers and more |

0:16:06 | we prefer and they've been this score is k and the and nearest neighbor |

0:16:13 | we see both the results of k and for the number of clusters |

0:16:22 | and |

0:16:24 | okay we propose new system which al |

0:16:28 | but class and performance and |

0:16:33 | much less number of clusters we pay for these and it would be by a |

0:16:39 | computationally because we moved from a random a mean shift to one and she if |

0:16:45 | the |

0:16:49 | and that's all what they have to say two |

0:16:58 | we have a question |

0:17:11 | thank you so insecure remark that for sure to utterance clustering |

0:17:18 | results with a training but in the remainder removes the longer utterances well mm disappointing |

0:17:26 | of you my some other noises so also minimizes to bother explaining this to managing |

0:17:33 | the resulting protocol for improved composite ross is twice is also implement a mattress is |

0:17:40 | possible to enrol |

0:17:43 | with thing because if you'll train it on the long segments there would be that |

0:17:49 | big mismatch between the training condition and the testing that we train you on long |

0:17:56 | segments and calculate the |

0:18:00 | on i-vectors from short segments it would be something |

0:18:06 | not appropriate |

0:18:07 | but maybe with number two into the speaker or subspace are composed to suppose maybe |

0:18:12 | to be more correct reason longer |

0:18:17 | basically much more accurate but not for our problem so yes okay |

0:18:24 | most important is a new sound so yes or no i think that there should |

0:18:27 | be some trade off between the accuracy of the a score or training score and |

0:18:35 | to see the true problem |

0:18:37 | yes |

0:18:38 | two |

0:18:42 | extension of the proposed |

0:18:46 | right |

0:18:52 | or you |

0:18:54 | a can thank you for your presentation i can you please go back to that |

0:18:59 | results section very you showed that values of k and number of speakers |

0:19:04 | stopping |

0:19:07 | maybe it's okay let us know that |

0:19:11 | go for here and then you are increasing the number of speakers and the value |

0:19:17 | of k is x is an is fixed |

0:19:20 | and that the results are going down i mean like and you like and you |

0:19:24 | try with different values of k |

0:19:27 | for different of us to the at least k is the |

0:19:34 | square of the multiplication of s b and s p smell the k of the |

0:19:38 | k nearest neighbor i mean that that's and i |

0:19:43 | this can and j o k is the best or the result with the best |

0:19:48 | k |

0:19:50 | but as you see that |

0:19:52 | so before |

0:19:53 | the rose |

0:19:55 | no big difference |

0:19:57 | if you use fourteen or fifteen or seventeen for each number of speakers |

0:20:04 | for which number of speakers are fixed and we can use the these rifle use |

0:20:08 | the almost the same for |

0:20:12 | and i |

0:20:14 | any number of speakers with the we tested |

0:20:17 | it to reach a plateau and stays there |

0:20:20 | i assume that is we will the increase k two |

0:20:24 | of fifty or seventy two will decrease of the results would begin go down at |

0:20:29 | some point |

0:20:30 | but for reasonable and the |

0:20:32 | "'kay" size |

0:20:34 | you just almost the same results |

0:20:45 | what data did you used to train your p lda when you use a short |

0:20:50 | segment |

0:20:54 | the same data that we used for ubm and i don't remember you should go |

0:20:58 | to costa rica to ask die buster |

0:21:03 | i |

0:21:04 | it sounds anyway but it it's not from the this that let's say a real |

0:21:10 | the same development and set for training the ubm and |

0:21:15 | take part of it just started in short segments in the train the building right |

0:21:21 | but we need to the short segment you're taking multiple short segments per telephone |

0:21:26 | right |

0:21:28 | we take a couple of for phone call and make multiple segments out of it |

0:21:32 | yes but it sure randomly so it from different sessions for the same okay so |

0:21:37 | that so the cuda in use the same |

0:21:40 | short segments from the same phone call |

0:21:42 | a strange automatic could be that several of them will be but |

0:21:48 | we just randomly choose suppose of a really respect i just ask on it because |

0:21:53 | i agree the this jumping back a question that |

0:21:57 | what we've seen with things not for clustering so maybe a different thing but in |

0:22:02 | terms of the p lda parameters that |

0:22:05 | you do better with training those up with the longer ones even what it's doing |

0:22:08 | short duration test this is given for speaker echo so it may not be derived |

0:22:13 | that three's only resides asking the data you did a random selection so yes very |

0:22:17 | unlikely that it was concentrated from the same call |

0:22:20 | so that was my mean first and the |

0:22:25 | and the results are all also all the segments for the clustering were |

0:22:32 | on the test set were chosen randomly and we're and we're an experiment ten times |

0:22:39 | except of the |

0:22:41 | last one of one hundred eighty eight speakers because there are only one hundred eighty |

0:22:47 | eight speakers in the dataset so we can couldn't two randomly |

0:22:55 | ms |

0:23:02 | idea first of all one of the things that the like to in the original |

0:23:07 | it's a means if target |

0:23:09 | was it's probabilistic interpretation in the fact that the analysis start with i don't parametric |

0:23:19 | density estimation meaning that in each point you create a small either gaussian or a |

0:23:24 | triangular say a pdf |

0:23:28 | with triangular grand up with the kind of the threshold which is the uniform with |

0:23:32 | a gaussian grid up again with a notion because that's that there's the differentiation |

0:23:38 | and therefore dates |

0:23:41 | a rule is derived |

0:23:44 | by simple differentiation in order to find them all at which point where converts where |

0:23:52 | convergence |

0:23:53 | i'm wondering if you |

0:23:55 | choose a p lda let's say like so you don't |

0:23:59 | put either cosine distance or a standard i squared distance which is was initial |

0:24:07 | can you tell us because one question is not |

0:24:12 | whether these update rule |

0:24:14 | comes naturally |

0:24:16 | from the same mechanism buddies a new as explained to you i don't parametric |

0:24:23 | okay but we can get estimation |

0:24:28 | as you estimated answers no |

0:24:31 | we also isn't |

0:24:32 | one so it's more realistic what works a useful |

0:24:41 | the question |

0:24:48 | so next and speaking |