0:00:15i everyone my name is again and i'm working with orange labs and the value
0:00:22in france
0:00:24and then i'm going to talk about the concept of self training speaker diarization
0:00:31so the application we don't working on is
0:00:35the task of across recordings because data traditional applied on t v archives french t
0:00:41v archives
0:00:42and the goal is to index to spew costs of collections of multiple recordings
0:00:48in order for example two provides new mean of dataset exploration and by creating links
0:00:54between different it is so it's
0:00:57so a system is based on a two-pass approach we first
0:01:04process each recording separately applying some kind of speaker segmentation and clustering
0:01:10and then we perform a cross recording a speaker linking and try to link all
0:01:17within recording clusters
0:01:19across the whole collection
0:01:22so they're framework is based on the state-of-the-art speaker recognition
0:01:28framework
0:01:30we are using i-vector of the lda model edition and for clustering we use the
0:01:35article agglomerative clustering
0:01:39so we know that the lda the goal of the lda is to maximize the
0:01:44between speaker variability one minute
0:01:46minimizing the within speaker variability
0:01:50so what we want to
0:01:53investigate in our paper is can we use the target that a as training material
0:01:58and how good
0:02:01could we estimate the speaker variability
0:02:07so first i'm going to represent
0:02:11battery different from work so let's take a an audio file phone problem
0:02:14from a target data
0:02:17our target that is unable so we just have a audio files
0:02:21first we are extracting some features we are using a mfcc features with delta and
0:02:27delta-delta
0:02:29then we perform a combination of speech activity detection and bic clustering to extract some
0:02:36speakers segments
0:02:38on top of those segments we can extract i-vectors using pre-trained ubm and total variability
0:02:45matrix
0:02:49once we obtain a well i-vectors a reliable to score all i-vectors between each other
0:02:55and computer similarity scoring matrix
0:02:59and for that we use p lda likelihood the
0:03:03each are trained the p lda parameters are estimated separate
0:03:09once we have or similarity matrix we can apply a speaker clustering
0:03:15and do you results of the that are just and is a speaker clusters
0:03:21so we can repeat the process for is of all recordings
0:03:27once we've done that we can compute
0:03:30a collection why the similarity matrix and repeat the clustering process and this time i
0:03:36call it the speaker i'm thinking big because the goal is to
0:03:40link the within recording clusters across the whole collection
0:03:45and after the linking a park
0:03:48after the linking part we obtain a the degradation
0:03:54so the usual way of training the ubm t v matrix and estimate the plp
0:04:00of parameters is to use
0:04:03trained that that's that which is labeled based you can and the training procedure is
0:04:08pretty straightforward
0:04:11the problem when we
0:04:14apply this technique we have some kind of mismatch between a target and trained that
0:04:20the first we don't have the same acoustic conditions
0:04:25and seconds we don't necessarily have the same speakers
0:04:29in target and trained that also
0:04:32we could use a information about the target that a maybe we could have better
0:04:36results
0:04:38so what we want to investigates is the concept of self training there is there
0:04:43some meaning we like to only use the target that itself to estimate the parameters
0:04:51and then we are going to complete to the results with a combination of target
0:04:57and trained that
0:05:00so
0:05:01the goal of sell train data revisionist to avoid the acoustic mismatch between the training
0:05:07and target data
0:05:09so
0:05:10what we need to train an i-vector p lda system to train the ubm and
0:05:15the tv matrix we only need a clean speech segments the training is then straightforward
0:05:22and as for the lda parameters estimation we need several sessions by post you got
0:05:27in various acoustic conditions so
0:05:29what we need to investigates is do we have several speakers
0:05:34appearing in different it is that's you know what target data
0:05:37and assuming we know how to effectively cluster of the target data in terms of
0:05:41speaker can we estimate p lda parameters with those
0:05:48so let's have a look on the data
0:05:51we have around two hundred there was a of french broadcast news that drawn from
0:05:56a previous french evaluation campaigns
0:05:59so it's a combination of a tv and radio data
0:06:04i'm of this two hundred hours we selected two shows a target
0:06:08cooperate we selected there's with l c be awful and the f m story
0:06:15and we to all other available recordings and decided to build what we call the
0:06:22train corpus
0:06:24so if we take a look of at the data we see that we have
0:06:30more than forty episodes
0:06:33more than forty results for each other show and we what we cannot this is
0:06:37a speech proportion of the what i call the recording speakers which is a above
0:06:43fifty percent for both corpora
0:06:45corpora
0:06:47so the recurring speakers is speaker who appear in more than one if results
0:06:51as opposed to the one time speaker who only appear in one it is
0:06:56so
0:06:58to the em so of the previous first question
0:07:01yes we have several speaker appearing in different if you that you know target
0:07:07so no
0:07:09we decided to
0:07:11train the original system
0:07:13meaning we suppose we know how to
0:07:18cluster on the data target that so we
0:07:21we use we had the target that are labels in real life we do not
0:07:26so we don't have those labels but for
0:07:29experiments
0:07:30we decided to use them
0:07:32so
0:07:33to train the ubm and the tv matrix and estimate the p l d a
0:07:37parameter parameters we process the same with them
0:07:39with their trained that are we just replace the train data with labels my target
0:07:44that are with labels
0:07:46so what we see detailed that is that for the l c p so we
0:07:49are able to obtain a result
0:07:52so the results are present in terms of a diarisation error rates
0:07:58cross recording there is there is there a residual error rate
0:08:02so for the l c p show we had some results as for the b
0:08:07f m shall we will not able to estimate the lda parameters
0:08:10and we suppose we don't have enough data to do so that we we're gonna
0:08:14investigate that
0:08:18if we compared with the baseline results we see that if we use the information
0:08:23about speakers in the target that we can right we should be able to improve
0:08:29the baseline system
0:08:33so what we one
0:08:35to investigate is
0:08:38it's the minimum
0:08:40amount of data we need to estimate p idea parameters because
0:08:43we so that for the v f m shall we will not able to train
0:08:46p lda while for the l c d so we were able to so
0:08:51we just decided to find out the minimum number of it is that's we could
0:08:57take into the l c p so to estimate suitable p lda parameters so that
0:09:01the group of that with you see here is the d right the a on
0:09:08the l c d so
0:09:10as a function of the numbers of it is it's take and to estimate the
0:09:15p l d a parameter so
0:09:16the total numbers of ap that is forty five and we started the experiments with
0:09:21thirty visits because we see that a before the results that
0:09:27so what's interesting
0:09:29interesting to see is that we need to run thirty seven results to be able
0:09:33to improve the baseline results
0:09:37and when we have
0:09:40thirty seven it is that's we have forty recording speakers
0:09:44what's also interesting to see is that
0:09:47we have the same numbers of speakers and here
0:09:52i don't the
0:09:53the different number of it is that's but the resulting the art is a really
0:09:59well seals and he also what's interesting is that we are able to
0:10:05so we have the same speaker out that
0:10:08what
0:10:10what's happening here is dressed that there are more and more that are gathered for
0:10:14each speaker
0:10:15and we need a minimum amount of that are for each speaker if we take
0:10:20a look at the average number of session task because it's a run seven
0:10:27when you have thirty seven types of
0:10:31as for the df m show
0:10:34when we take it is that we only have thirty five recording speakers
0:10:38and are bring in five it is that in average so it's far less than
0:10:43for the l c d corpus and that's why we are not able to train
0:10:47a dog parameters
0:10:50so now let's place in the real case and we are now not choose not
0:10:56allowed to use of that target data labels
0:11:00so i'm the first to train the ubm and tv matrix what we need a
0:11:04clean speech signal so we just decided to take the output of the speaker segmentation
0:11:10and compute the ubm in tv matrix
0:11:14but we don't have any information about the speaker so we are not able to
0:11:18estimate period of the lda parameters
0:11:21so we just replace the p lda likelihood scoring by focusing based growing
0:11:28and then we have a working system when we look at the results of our
0:11:33stand with then we using t lda
0:11:39that not to suppress the we expect that
0:11:43no what we obtain a speaker clusters so
0:11:47what we this idea is to use the speaker clusters and try to estimate the
0:11:53lda experiments with those clusters
0:11:55when we do when we do so well the training procedure doesn't six it
0:12:04well we so in the oracle experiment that the number of data was limited and
0:12:11we also suspect that the a probability of the clusters are used to back to
0:12:16allow us to estimate the lda permitted
0:12:21so to summarize with the self training experiment
0:12:25for the ubm and t v training we selected segments produced by speaker segmentation we
0:12:31only get the segments with the duration above ten seconds
0:12:37and we also it shows the bic parameters so that the segments are considered tool
0:12:43because to train a to estimate to train the tv matrix we need a clean
0:12:47and we only need we need only one speaker in each segments for training
0:12:53as for the lda we need several session
0:12:57the speaker from values results so first we perform an i-vector clustering based you got
0:13:03a position and use the and put into a speaker clusters to perform i-vector normalization
0:13:08can estimate ple are limited so we just select
0:13:12the output speaker clusters with
0:13:16i-vectors coming from one
0:13:18more than three episodes
0:13:22no so we so that we are not able to train a
0:13:28sufficient system with only detected target that are so we decide to at some train
0:13:34data in the mix
0:13:36so it's the so the classics the idea of a domain adaptation
0:13:41so the main difference in this e system comparing with the baseline is that we
0:13:48replace the ubm and tv metrics by
0:13:51in this experiment ubm and tv metrics are trained
0:13:55on to a target that are instead of training data and then we extract i-vectors
0:13:59from the training data and estimate the lda parameters on the training but
0:14:05so
0:14:06when replacing the ubm and tv matrix we are able to improve around one percent
0:14:12in absolute
0:14:14in terms of d r
0:14:18no
0:14:20well why not try to applied the same process then we it with the center
0:14:24in experiments and take the speaker clusters to estimate a new p lda parameters
0:14:30so as before we the training the estimation of the lda parameter phase we i
0:14:37think we really don't have enough that do so
0:14:40and so we just decided to
0:14:43combined their use of training data and
0:14:47target the task to update the key idea parameter the classic domain adaptation scenario but
0:14:54we don't use any whiting parameters to balance the influence and of trained and target
0:15:00that are we just
0:15:01to the i-vectors from the training data and the i-vectors from this
0:15:07output speaker clusters and
0:15:08combining them and
0:15:10train new p lda parameters
0:15:13so when we combine the that the data we again a improve the baseline the
0:15:18system and again one percent in terms around one percent
0:15:23in terms of the whole
0:15:28and
0:15:29well now that we've done then we why not try to iterate as
0:15:35as long as we obtain speaker clusters we can always to use them and try
0:15:38to improve the estimation of purely a parameters
0:15:43well it doesn't so it doesn't work
0:15:46if you iterate it doesn't improve the system we tried two
0:15:51four iterations but i
0:15:53that it's not okay
0:15:58so
0:16:00let's have a look on the system parameters we use the site it for that
0:16:05or position toolkit it's a package above the psychic toolkit
0:16:10but library
0:16:12for the front end and we use thirteen mfccs with delta and delta-delta
0:16:18we use a two hundred and fifty six components to train the ubm
0:16:24the covariance make matrix is there gonna
0:16:27the dimension of the tv matrix is two hundred the dimension to be the eigenvoice
0:16:33matrix is one hundred
0:16:35we don't use any i can channel matrix
0:16:38for the speaker clustering task we use
0:16:42the combination of connected components clustering and the article argumentative clustering
0:16:48and i as i said before the metric is the data results for an error
0:16:51rates and we use the two hundred and fifty milliseconds
0:17:01so
0:17:02if we summarize we compare the other three for different system first three but we
0:17:08performed a surprise training using only external data
0:17:12and then we
0:17:14use the same training process but we replace the training data with their delicate that
0:17:19this is the oracle experiments
0:17:22and then we focused on
0:17:24and surprise training using only the target data and we so that that's it's
0:17:29that's good enough when comparing with the baseline system
0:17:34so we decided to take back
0:17:36some training data i'm applied in some kind of unsupervised domain adaptation and combined train
0:17:43target
0:17:46so
0:17:47to conclude can say that
0:17:49with so that if we don't have enough data we absolutely need to use external
0:17:54that bootstrap the system
0:17:57but the putting it even using unlabeled target that a which is and perfectly clusters
0:18:04with some kind of them domain adaptation we are able to improve the system
0:18:09so in our future work we want to in to focus on the adaptation framework
0:18:14and used
0:18:17already
0:18:19where we we'd like to use
0:18:23introduce whitening variability between train and target data
0:18:27and we also like to try to work on the iterative procedure because we think
0:18:32that if we are able to a better estimate p lda parameters after one at
0:18:38a rate iteration we should be able to improve the quality of clusters and some
0:18:43kind of iteration should be possible
0:18:46in fact this work was don't already we presented a we submitted a paper at
0:18:52interspeech it will be presented
0:18:55so i can already said that using one thing variability
0:19:01the results are really get better
0:19:05and the iterative procedure also walks we with two or three iterations we are able
0:19:11to slowly improve the that the all
0:19:14and another way of improve
0:19:18improve your remains to be seen but
0:19:22with what's like to try to put strapless that would any label that for example
0:19:26we could try to take the train that a don't use the labels and upper
0:19:31from causing basis clustering because we so that on our approach maybe we didn't have
0:19:36enough data and the target that i to apply this idea so maybe
0:19:41try to bootstrap with more unlabeled data could be working
0:19:47well thank you that that's wonderful
0:19:55documents so i'm for instance
0:20:06thank you for that are i think this is more common that a question but
0:20:09i believe that some of your problems with the em for the p o da
0:20:13our years speaker subspace dimension is higher numbers
0:20:20i think that that's the problem we the that i mentioned that for a t
0:20:24v and p l of the idea is to find a when we don't have
0:20:29enough target data i cannot the problem is
0:20:33i is difficult to estimate the one hundred i mentioned
0:20:39p l d l parameters if you don't have that much speakers
0:20:42did you try to reduce the i don't i do the focus on that well
0:20:56thanks to the presentation thirteen and well like to use it for d c two
0:21:01sounds pretty
0:21:03and you was presenting it on
0:21:08i think that last used e
0:21:10i use the deeper then how the school that
0:21:15well
0:21:16in my experiment
0:21:18the results are not very different between ilp and agglomerative clustering well i just decided
0:21:26to use agglomerative clustering because it's
0:21:30small simple simpler
0:21:34yes computed computation time
0:21:37but not really a big difference between
0:21:43i think
0:21:50so
0:21:51dealing with these different internal extra so one thing i
0:21:56see here and work was
0:21:59what to use a way that i
0:22:03why each latest specifically a little white here
0:22:07no we didn't fight the data are we just we just to the target clusters
0:22:13and the training clusters and
0:22:16put them together in the same dataset
0:22:20so if you look at the equations its own
0:22:25it's the same taste as if use that the whiting parameters
0:22:33of a value which is the relative amount of data between target and try to
0:22:38train better so it is almost equal to zero
0:22:43that's why we need to work and the availability because we are not
0:22:50would every for that i
0:23:03not that this difference anyway you're clustering experiments you decide how many clusters
0:23:13well the
0:23:15the clustering is a function of the that's which
0:23:19and we don't we just saw a select the screenshot by next experiment we that's
0:23:25why we which was to target corporate because this way we are able to do
0:23:32an exhaustive search on the other three shown on the one and one corpus and
0:23:37then
0:23:38we you look if the same crucial applies for the other cultures
0:23:44and the clustering tree structure is around zero so
0:24:01we still have time for a few questions
0:24:07okay so i was curious human centred in this work to you don't want be
0:24:12considered for the reader assumed to be helpful but then you are able to somehow
0:24:16fixed upon the
0:24:17a next once we know what is that
0:24:20i mean what was to what do you think is the most the problem would
0:24:23do so
0:24:25in this in this work the program is we want to introduce a wide thing
0:24:30we don't balance the influence training of target that also
0:24:34and the combination of training and target that we have so much training data
0:24:39that the
0:24:41the whitening parameters is really in favour of the train on the training data
0:24:48when we change the are balance between training target that and give more importance to
0:24:54the target that the films to get better results and then you see that why
0:25:00the routine you can improve some
0:25:02no more of the two or three iterations
0:25:05and that we also i did some kind of yes cost normalization because when you
0:25:12when you when you use a target that too
0:25:17to obtain the p l d a parameter as the distribution of lda also tends
0:25:22to achieve a lot
0:25:24for you need to one
0:25:26normalized to keep the same clustering speech
0:25:29otherwise you don't cluster
0:25:31the same place a total
0:25:33after reported average
0:25:40okay so if no further questions let's thank the speaker