0:00:15hi first i want to think
0:00:17two or stay here because i thought that only me colours and the cameraman will
0:00:22be here
0:00:26the work was
0:00:27than by
0:00:29die unit and like me most about it i
0:00:33and he should
0:00:36present this war but
0:00:37unfortunately about ten days ago he got married
0:00:46so
0:00:48you prefer to go to cost oracle or something else in order to come here
0:00:54and present
0:00:55the more also used a quiz me and i stuck with this work
0:01:01and
0:01:02so
0:01:03you have to suffer me for about ten
0:01:09after that
0:01:10most of this workshop also on
0:01:13the n n's and spatially
0:01:16a very good managing skin all i felt that i also want to say some
0:01:22think about it so
0:01:24i briefly
0:01:25take talk about it a little be then
0:01:28i give a motivation
0:01:30about the clustering problem
0:01:33the basic mean shift algorithm and them but discussion we need
0:01:37and then i present the
0:01:40clustering system some experiments and summary
0:01:45so about the intense
0:01:49okay next
0:01:57so
0:01:59our problem as you we have
0:02:01a texas station there are many
0:02:05text across each lr
0:02:09have
0:02:10not one driver about the driver also
0:02:12changed
0:02:14and we have recording four
0:02:16quite
0:02:17they two days three days
0:02:20that you speak at to talk so we exactly know where the segment
0:02:25the and the start of the n
0:02:27of each segment
0:02:29and we collected
0:02:32these recording devices
0:02:36and
0:02:38and the end of the day we want to segment and no which segments were
0:02:43said by
0:02:44one speaker now the speaker so
0:02:48each month and talk
0:02:50we don't wear an
0:02:51where
0:02:53can take that one speaker now
0:02:57then the next time speaks after two hours three hours
0:03:01maybe to model
0:03:02from different car
0:03:04and what we have
0:03:08use a bag of segments
0:03:11which are unlabeled
0:03:13and we want you please
0:03:16mostly these segments are very short on the average one how seconds two seconds
0:03:22so
0:03:24we want to
0:03:25cluster
0:03:28short segments
0:03:29and usually the population use white be sort of speakers for two speakers
0:03:35and so on
0:03:37so the issues our problem
0:03:43a given
0:03:44mainly short segments we want
0:03:47to segment them into on morgan use group which means that we want
0:03:53have
0:03:54it would
0:03:57cluster purity it means that each cluster we would be occupied mostly by one speaker
0:04:04only
0:04:05but we also want to have
0:04:07speaker purity
0:04:10so
0:04:11we don't want that the same speaker will be spread between ten clusters
0:04:19the basic mean shift algorithm
0:04:21is that we have but
0:04:23many vectors we choose and vector or find by
0:04:29and b and b so the
0:04:33and go
0:04:34set of clothes this
0:04:37vectors
0:04:39and take all the
0:04:40vectors which are below some threshold
0:04:44and then shift
0:04:47than in particular the weighted mean of these
0:04:52vectors
0:04:53take the mean as reference points and the gain
0:04:57looking for the
0:05:00neighbours that below the threshold calculate the mean and you to converse some
0:05:06point and
0:05:07these what we can do we
0:05:09each point we
0:05:11each vector
0:05:13demos talk about this algorithm many times
0:05:17so for more details please refer to tables
0:05:21and
0:05:23lda the after we find the stable point of each wrecked or we are
0:05:31a group all the vectors which are close one to each other
0:05:36according to some threshold
0:05:38and the number of broad we have this is the number of clusters
0:05:44and the points which are in each group out there
0:05:48one of the cluster but we know that a canadian distance is not very good
0:05:52for the purpose of speaker clustering
0:05:57so
0:06:00here what we present with cosine distance now we use the be lda scoring instead
0:06:05of cosine
0:06:08so instead of looking for
0:06:12closest vectors in the sense of forty the as distance we look for
0:06:19based high score between b lda score
0:06:24and the
0:06:26calculate the
0:06:28a new mean
0:06:29when they've function g is the weight in the weight is basically the
0:06:34and be lda score
0:06:37and are the difference we made we didn't do not use a threshold
0:06:44to look for
0:06:47class vectors
0:06:50below the threshold instead we was k and then we set
0:06:55okay and take the k nearest
0:07:00and vectors which have the highest the
0:07:03bleu score
0:07:07so
0:07:08basically we
0:07:11all these creations
0:07:14the and
0:07:15now this case
0:07:17the to a shown is not fixed
0:07:20like in the original algorithm but you the depends on the
0:07:26largest distance of the case vector or
0:07:31but we calculate the same are
0:07:35algorithm in shift so we calculate the mean according to these k
0:07:39nearest vectors shifted the mean and again continue the process
0:07:48hearable the
0:07:50well i-vectors but i-vector also because
0:07:54i will explain that we
0:07:55do the small modification of them
0:07:58and applying mean shift algorithm according to the bleu score
0:08:03and we have the results i just as we mentioned
0:08:06and
0:08:08because i we compared to the brother previous or
0:08:12that in previous work
0:08:13the threshold was fixed
0:08:15we will scroll signs used and these stunts
0:08:19and we use run don't mean shifted mean that we don't
0:08:23i'll go across all the points but only randomly choose them
0:08:32so before clustering we of course need to train ubm
0:08:38total variability matrix but before using one
0:08:42build the score we found that it is better to
0:08:46do pca on the data and gab just another job pca
0:08:52a no we reduce our r vector from for dimensions of the four hundred to
0:08:58two hundred fifty
0:09:00and we tried to compare it to is just i-vector of size two hundred fifty
0:09:04this work better
0:09:07we don't sure why
0:09:09but is the fact it was better and we apply and next would indeed whitening
0:09:16and apply a p lda score on these vectors
0:09:26okay
0:09:27so this was explained before
0:09:34the experiment setup was that
0:09:37we use nist two thousand eight two we got six
0:09:41into a low dimensional segments
0:09:45on average to enhance seconds
0:09:49and
0:09:50and we have average number all five segments per
0:09:54speaker so to three
0:09:57a no
0:09:59we
0:10:00calculate
0:10:02the results
0:10:03according to
0:10:05average speaker purity average cluster purity the k parameter
0:10:10and the other important parameter use how many class so we have at the end
0:10:15comparing to the true number of clusters
0:10:22so we're starting from the beginning
0:10:27we start with the
0:10:30here we go system of cosine distance with
0:10:35a threshold fixed threshold this is the red line
0:10:39and we see that
0:10:41we have a wide be we can we
0:10:44have to know exactly
0:10:46what is the
0:10:47based threshold
0:10:49to make clustering when we use k-means instead
0:10:54can use them but they're holding is that we see that we
0:10:58have a plucked or so it
0:11:01doesn't
0:11:03make a difference if we choose sorting or seventeen so it's much more robust and
0:11:10when we use k nearest neighbor hold
0:11:14all the
0:11:16results are for
0:11:19i thought speakers
0:11:22next we yea instead of using
0:11:26random mean shoes
0:11:28useful mean shift which are much more expensive computationally
0:11:34but we see that we have some gain
0:11:36still be is cosine distance
0:11:40and then we switch from cosine distance to be lda score
0:11:45and we have
0:11:47and it to beat
0:11:48more gain
0:11:51i have to say that
0:11:53both for the lda training
0:11:56and for
0:11:58and w c and training in the cosine system
0:12:01we trained them on short segments well too long segment
0:12:08shortly we will see why we did it on short segments
0:12:13this one
0:12:15when we train
0:12:18be lda on long segments we have very better results
0:12:23but on short segments
0:12:26we improve the remote results dramatically
0:12:30the total variability matrix trained on long segments only
0:12:35we didn't use short segments
0:12:37because it was very bit
0:12:41this all results
0:12:43deal now only we've sort of speakers
0:12:47and this is some summary of all the results we see it's better to move
0:12:51from
0:12:52and i fixed threshold to get a nearest neighbor hope to go from randall mean
0:12:58shift to full mean shift to move to be lda
0:13:04and the hope for results
0:13:11it's a totally
0:13:13not require a problem is how many clusters
0:13:17we have after clustering problem process
0:13:22with when the compared to the actual number of clusters
0:13:26and the red line of the drawing a units
0:13:31of the
0:13:32and a fixed threshold
0:13:36and
0:13:37if you are looking
0:13:39and the result
0:13:41it's not so nice we have
0:13:46true forty six clusters speakers but it was estimated
0:13:51s
0:13:52about one hundred eighty clusters means that we have many small clusters they are very
0:13:58pure but
0:14:01a small and too much
0:14:05but when we will scale and then
0:14:08we can see that we have about a factor of two about six two clusters
0:14:13so we have better cave better clustering performance with much less
0:14:19clusters
0:14:23these are the results
0:14:27when we use the cosine distance with a fixed threshold
0:14:31on different arbour off
0:14:34speakers from three to one hundred eighty eight
0:14:39we
0:14:39when we will compare with the
0:14:42proposed algorithm we will see that
0:14:44in this case the
0:14:47cluster purity is better
0:14:50it it's understandable many small clusters and they're all pure of one segment to segment
0:14:57but the k the overall and
0:15:00results
0:15:03in our case in the our algorithm is better
0:15:06and the average number of clusters you can see that for three speakers that okay
0:15:11but
0:15:12let's able to one hundred eighty eight speakers it's
0:15:16by a factor of ten almost we have much more
0:15:21clusters that true number of speakers
0:15:25when we go to
0:15:27the be lda we skin
0:15:32we have
0:15:33better speaker purity
0:15:35and
0:15:37much less class see that the
0:15:41by a factor of one how to for two
0:15:46and these summarize
0:15:48the results
0:15:50for three and seven
0:15:53speakers we have
0:15:54a little bit to the better results by cosine with a fixed threshold but when
0:16:01we go from
0:16:03fifteen speakers and more
0:16:06we prefer and they've been this score is k and the and nearest neighbor
0:16:13we see both the results of k and for the number of clusters
0:16:22and
0:16:24okay we propose new system which al
0:16:28but class and performance and
0:16:33much less number of clusters we pay for these and it would be by a
0:16:39computationally because we moved from a random a mean shift to one and she if
0:16:45the
0:16:49and that's all what they have to say two
0:16:58we have a question
0:17:11thank you so insecure remark that for sure to utterance clustering
0:17:18results with a training but in the remainder removes the longer utterances well mm disappointing
0:17:26of you my some other noises so also minimizes to bother explaining this to managing
0:17:33the resulting protocol for improved composite ross is twice is also implement a mattress is
0:17:40possible to enrol
0:17:43with thing because if you'll train it on the long segments there would be that
0:17:49big mismatch between the training condition and the testing that we train you on long
0:17:56segments and calculate the
0:18:00on i-vectors from short segments it would be something
0:18:06not appropriate
0:18:07but maybe with number two into the speaker or subspace are composed to suppose maybe
0:18:12to be more correct reason longer
0:18:17basically much more accurate but not for our problem so yes okay
0:18:24most important is a new sound so yes or no i think that there should
0:18:27be some trade off between the accuracy of the a score or training score and
0:18:35to see the true problem
0:18:37yes
0:18:38two
0:18:42extension of the proposed
0:18:46right
0:18:52or you
0:18:54a can thank you for your presentation i can you please go back to that
0:18:59results section very you showed that values of k and number of speakers
0:19:04stopping
0:19:07maybe it's okay let us know that
0:19:11go for here and then you are increasing the number of speakers and the value
0:19:17of k is x is an is fixed
0:19:20and that the results are going down i mean like and you like and you
0:19:24try with different values of k
0:19:27for different of us to the at least k is the
0:19:34square of the multiplication of s b and s p smell the k of the
0:19:38k nearest neighbor i mean that that's and i
0:19:43this can and j o k is the best or the result with the best
0:19:48k
0:19:50but as you see that
0:19:52so before
0:19:53the rose
0:19:55no big difference
0:19:57if you use fourteen or fifteen or seventeen for each number of speakers
0:20:04for which number of speakers are fixed and we can use the these rifle use
0:20:08the almost the same for
0:20:12and i
0:20:14any number of speakers with the we tested
0:20:17it to reach a plateau and stays there
0:20:20i assume that is we will the increase k two
0:20:24of fifty or seventy two will decrease of the results would begin go down at
0:20:29some point
0:20:30but for reasonable and the
0:20:32"'kay" size
0:20:34you just almost the same results
0:20:45what data did you used to train your p lda when you use a short
0:20:50segment
0:20:54the same data that we used for ubm and i don't remember you should go
0:20:58to costa rica to ask die buster
0:21:03i
0:21:04it sounds anyway but it it's not from the this that let's say a real
0:21:10the same development and set for training the ubm and
0:21:15take part of it just started in short segments in the train the building right
0:21:21but we need to the short segment you're taking multiple short segments per telephone
0:21:26right
0:21:28we take a couple of for phone call and make multiple segments out of it
0:21:32yes but it sure randomly so it from different sessions for the same okay so
0:21:37that so the cuda in use the same
0:21:40short segments from the same phone call
0:21:42a strange automatic could be that several of them will be but
0:21:48we just randomly choose suppose of a really respect i just ask on it because
0:21:53i agree the this jumping back a question that
0:21:57what we've seen with things not for clustering so maybe a different thing but in
0:22:02terms of the p lda parameters that
0:22:05you do better with training those up with the longer ones even what it's doing
0:22:08short duration test this is given for speaker echo so it may not be derived
0:22:13that three's only resides asking the data you did a random selection so yes very
0:22:17unlikely that it was concentrated from the same call
0:22:20so that was my mean first and the
0:22:25and the results are all also all the segments for the clustering were
0:22:32on the test set were chosen randomly and we're and we're an experiment ten times
0:22:39except of the
0:22:41last one of one hundred eighty eight speakers because there are only one hundred eighty
0:22:47eight speakers in the dataset so we can couldn't two randomly
0:22:55ms
0:23:02idea first of all one of the things that the like to in the original
0:23:07it's a means if target
0:23:09was it's probabilistic interpretation in the fact that the analysis start with i don't parametric
0:23:19density estimation meaning that in each point you create a small either gaussian or a
0:23:24triangular say a pdf
0:23:28with triangular grand up with the kind of the threshold which is the uniform with
0:23:32a gaussian grid up again with a notion because that's that there's the differentiation
0:23:38and therefore dates
0:23:41a rule is derived
0:23:44by simple differentiation in order to find them all at which point where converts where
0:23:52convergence
0:23:53i'm wondering if you
0:23:55choose a p lda let's say like so you don't
0:23:59put either cosine distance or a standard i squared distance which is was initial
0:24:07can you tell us because one question is not
0:24:12whether these update rule
0:24:14comes naturally
0:24:16from the same mechanism buddies a new as explained to you i don't parametric
0:24:23okay but we can get estimation
0:24:28as you estimated answers no
0:24:31we also isn't
0:24:32one so it's more realistic what works a useful
0:24:41the question
0:24:48so next and speaking