0:00:15 hi first i want to think two or stay here because i thought that only me colours and the cameraman will be here the work was than by die unit and like me most about it i and he should present this war but unfortunately about ten days ago he got married so you prefer to go to cost oracle or something else in order to come here and present the more also used a quiz me and i stuck with this work and so you have to suffer me for about ten after that most of this workshop also on the n n's and spatially a very good managing skin all i felt that i also want to say some think about it so i briefly take talk about it a little be then i give a motivation about the clustering problem the basic mean shift algorithm and them but discussion we need and then i present the clustering system some experiments and summary so about the intense okay next so our problem as you we have a texas station there are many text across each lr have not one driver about the driver also changed and we have recording four quite they two days three days that you speak at to talk so we exactly know where the segment the and the start of the n of each segment and we collected these recording devices and and the end of the day we want to segment and no which segments were said by one speaker now the speaker so each month and talk we don't wear an where can take that one speaker now then the next time speaks after two hours three hours maybe to model from different car and what we have use a bag of segments which are unlabeled and we want you please mostly these segments are very short on the average one how seconds two seconds so we want to cluster short segments and usually the population use white be sort of speakers for two speakers and so on so the issues our problem a given mainly short segments we want to segment them into on morgan use group which means that we want have it would cluster purity it means that each cluster we would be occupied mostly by one speaker only but we also want to have speaker purity so we don't want that the same speaker will be spread between ten clusters the basic mean shift algorithm is that we have but many vectors we choose and vector or find by and b and b so the and go set of clothes this vectors and take all the vectors which are below some threshold and then shift than in particular the weighted mean of these vectors take the mean as reference points and the gain looking for the neighbours that below the threshold calculate the mean and you to converse some point and these what we can do we each point we each vector demos talk about this algorithm many times so for more details please refer to tables and lda the after we find the stable point of each wrecked or we are a group all the vectors which are close one to each other according to some threshold and the number of broad we have this is the number of clusters and the points which are in each group out there one of the cluster but we know that a canadian distance is not very good for the purpose of speaker clustering so here what we present with cosine distance now we use the be lda scoring instead of cosine so instead of looking for closest vectors in the sense of forty the as distance we look for based high score between b lda score and the calculate the a new mean when they've function g is the weight in the weight is basically the and be lda score and are the difference we made we didn't do not use a threshold to look for class vectors below the threshold instead we was k and then we set okay and take the k nearest and vectors which have the highest the bleu score so basically we all these creations the and now this case the to a shown is not fixed like in the original algorithm but you the depends on the largest distance of the case vector or but we calculate the same are algorithm in shift so we calculate the mean according to these k nearest vectors shifted the mean and again continue the process hearable the well i-vectors but i-vector also because i will explain that we do the small modification of them and applying mean shift algorithm according to the bleu score and we have the results i just as we mentioned and because i we compared to the brother previous or that in previous work the threshold was fixed we will scroll signs used and these stunts and we use run don't mean shifted mean that we don't i'll go across all the points but only randomly choose them so before clustering we of course need to train ubm total variability matrix but before using one build the score we found that it is better to do pca on the data and gab just another job pca a no we reduce our r vector from for dimensions of the four hundred to two hundred fifty and we tried to compare it to is just i-vector of size two hundred fifty this work better we don't sure why but is the fact it was better and we apply and next would indeed whitening and apply a p lda score on these vectors okay so this was explained before the experiment setup was that we use nist two thousand eight two we got six into a low dimensional segments on average to enhance seconds and and we have average number all five segments per speaker so to three a no we calculate the results according to average speaker purity average cluster purity the k parameter and the other important parameter use how many class so we have at the end comparing to the true number of clusters so we're starting from the beginning we start with the here we go system of cosine distance with a threshold fixed threshold this is the red line and we see that we have a wide be we can we have to know exactly what is the based threshold to make clustering when we use k-means instead can use them but they're holding is that we see that we have a plucked or so it doesn't make a difference if we choose sorting or seventeen so it's much more robust and when we use k nearest neighbor hold all the results are for i thought speakers next we yea instead of using random mean shoes useful mean shift which are much more expensive computationally but we see that we have some gain still be is cosine distance and then we switch from cosine distance to be lda score and we have and it to beat more gain i have to say that both for the lda training and for and w c and training in the cosine system we trained them on short segments well too long segment shortly we will see why we did it on short segments this one when we train be lda on long segments we have very better results but on short segments we improve the remote results dramatically the total variability matrix trained on long segments only we didn't use short segments because it was very bit this all results deal now only we've sort of speakers and this is some summary of all the results we see it's better to move from and i fixed threshold to get a nearest neighbor hope to go from randall mean shift to full mean shift to move to be lda and the hope for results it's a totally not require a problem is how many clusters we have after clustering problem process with when the compared to the actual number of clusters and the red line of the drawing a units of the and a fixed threshold and if you are looking and the result it's not so nice we have true forty six clusters speakers but it was estimated s about one hundred eighty clusters means that we have many small clusters they are very pure but a small and too much but when we will scale and then we can see that we have about a factor of two about six two clusters so we have better cave better clustering performance with much less clusters these are the results when we use the cosine distance with a fixed threshold on different arbour off speakers from three to one hundred eighty eight we when we will compare with the proposed algorithm we will see that in this case the cluster purity is better it it's understandable many small clusters and they're all pure of one segment to segment but the k the overall and results in our case in the our algorithm is better and the average number of clusters you can see that for three speakers that okay but let's able to one hundred eighty eight speakers it's by a factor of ten almost we have much more clusters that true number of speakers when we go to the be lda we skin we have better speaker purity and much less class see that the by a factor of one how to for two and these summarize the results for three and seven speakers we have a little bit to the better results by cosine with a fixed threshold but when we go from fifteen speakers and more we prefer and they've been this score is k and the and nearest neighbor we see both the results of k and for the number of clusters and okay we propose new system which al but class and performance and much less number of clusters we pay for these and it would be by a computationally because we moved from a random a mean shift to one and she if the and that's all what they have to say two we have a question thank you so insecure remark that for sure to utterance clustering results with a training but in the remainder removes the longer utterances well mm disappointing of you my some other noises so also minimizes to bother explaining this to managing the resulting protocol for improved composite ross is twice is also implement a mattress is possible to enrol with thing because if you'll train it on the long segments there would be that big mismatch between the training condition and the testing that we train you on long segments and calculate the on i-vectors from short segments it would be something not appropriate but maybe with number two into the speaker or subspace are composed to suppose maybe to be more correct reason longer basically much more accurate but not for our problem so yes okay most important is a new sound so yes or no i think that there should be some trade off between the accuracy of the a score or training score and to see the true problem yes two extension of the proposed right or you a can thank you for your presentation i can you please go back to that results section very you showed that values of k and number of speakers stopping maybe it's okay let us know that go for here and then you are increasing the number of speakers and the value of k is x is an is fixed and that the results are going down i mean like and you like and you try with different values of k for different of us to the at least k is the square of the multiplication of s b and s p smell the k of the k nearest neighbor i mean that that's and i this can and j o k is the best or the result with the best k but as you see that so before the rose no big difference if you use fourteen or fifteen or seventeen for each number of speakers for which number of speakers are fixed and we can use the these rifle use the almost the same for and i any number of speakers with the we tested it to reach a plateau and stays there i assume that is we will the increase k two of fifty or seventy two will decrease of the results would begin go down at some point but for reasonable and the "'kay" size you just almost the same results what data did you used to train your p lda when you use a short segment the same data that we used for ubm and i don't remember you should go to costa rica to ask die buster i it sounds anyway but it it's not from the this that let's say a real the same development and set for training the ubm and take part of it just started in short segments in the train the building right but we need to the short segment you're taking multiple short segments per telephone right we take a couple of for phone call and make multiple segments out of it yes but it sure randomly so it from different sessions for the same okay so that so the cuda in use the same short segments from the same phone call a strange automatic could be that several of them will be but we just randomly choose suppose of a really respect i just ask on it because i agree the this jumping back a question that what we've seen with things not for clustering so maybe a different thing but in terms of the p lda parameters that you do better with training those up with the longer ones even what it's doing short duration test this is given for speaker echo so it may not be derived that three's only resides asking the data you did a random selection so yes very unlikely that it was concentrated from the same call so that was my mean first and the and the results are all also all the segments for the clustering were on the test set were chosen randomly and we're and we're an experiment ten times except of the last one of one hundred eighty eight speakers because there are only one hundred eighty eight speakers in the dataset so we can couldn't two randomly ms idea first of all one of the things that the like to in the original it's a means if target was it's probabilistic interpretation in the fact that the analysis start with i don't parametric density estimation meaning that in each point you create a small either gaussian or a triangular say a pdf with triangular grand up with the kind of the threshold which is the uniform with a gaussian grid up again with a notion because that's that there's the differentiation and therefore dates a rule is derived by simple differentiation in order to find them all at which point where converts where convergence i'm wondering if you choose a p lda let's say like so you don't put either cosine distance or a standard i squared distance which is was initial can you tell us because one question is not whether these update rule comes naturally from the same mechanism buddies a new as explained to you i don't parametric okay but we can get estimation as you estimated answers no we also isn't one so it's more realistic what works a useful the question so next and speaking