0:00:15 | i everyone my name is again and i'm working with orange labs and the value |
---|---|

0:00:22 | in france |

0:00:24 | and then i'm going to talk about the concept of self training speaker diarization |

0:00:31 | so the application we don't working on is |

0:00:35 | the task of across recordings because data traditional applied on t v archives french t |

0:00:41 | v archives |

0:00:42 | and the goal is to index to spew costs of collections of multiple recordings |

0:00:48 | in order for example two provides new mean of dataset exploration and by creating links |

0:00:54 | between different it is so it's |

0:00:57 | so a system is based on a two-pass approach we first |

0:01:04 | process each recording separately applying some kind of speaker segmentation and clustering |

0:01:10 | and then we perform a cross recording a speaker linking and try to link all |

0:01:17 | within recording clusters |

0:01:19 | across the whole collection |

0:01:22 | so they're framework is based on the state-of-the-art speaker recognition |

0:01:28 | framework |

0:01:30 | we are using i-vector of the lda model edition and for clustering we use the |

0:01:35 | article agglomerative clustering |

0:01:39 | so we know that the lda the goal of the lda is to maximize the |

0:01:44 | between speaker variability one minute |

0:01:46 | minimizing the within speaker variability |

0:01:50 | so what we want to |

0:01:53 | investigate in our paper is can we use the target that a as training material |

0:01:58 | and how good |

0:02:01 | could we estimate the speaker variability |

0:02:07 | so first i'm going to represent |

0:02:11 | battery different from work so let's take a an audio file phone problem |

0:02:14 | from a target data |

0:02:17 | our target that is unable so we just have a audio files |

0:02:21 | first we are extracting some features we are using a mfcc features with delta and |

0:02:27 | delta-delta |

0:02:29 | then we perform a combination of speech activity detection and bic clustering to extract some |

0:02:36 | speakers segments |

0:02:38 | on top of those segments we can extract i-vectors using pre-trained ubm and total variability |

0:02:45 | matrix |

0:02:49 | once we obtain a well i-vectors a reliable to score all i-vectors between each other |

0:02:55 | and computer similarity scoring matrix |

0:02:59 | and for that we use p lda likelihood the |

0:03:03 | each are trained the p lda parameters are estimated separate |

0:03:09 | once we have or similarity matrix we can apply a speaker clustering |

0:03:15 | and do you results of the that are just and is a speaker clusters |

0:03:21 | so we can repeat the process for is of all recordings |

0:03:27 | once we've done that we can compute |

0:03:30 | a collection why the similarity matrix and repeat the clustering process and this time i |

0:03:36 | call it the speaker i'm thinking big because the goal is to |

0:03:40 | link the within recording clusters across the whole collection |

0:03:45 | and after the linking a park |

0:03:48 | after the linking part we obtain a the degradation |

0:03:54 | so the usual way of training the ubm t v matrix and estimate the plp |

0:04:00 | of parameters is to use |

0:04:03 | trained that that's that which is labeled based you can and the training procedure is |

0:04:08 | pretty straightforward |

0:04:11 | the problem when we |

0:04:14 | apply this technique we have some kind of mismatch between a target and trained that |

0:04:20 | the first we don't have the same acoustic conditions |

0:04:25 | and seconds we don't necessarily have the same speakers |

0:04:29 | in target and trained that also |

0:04:32 | we could use a information about the target that a maybe we could have better |

0:04:36 | results |

0:04:38 | so what we want to investigates is the concept of self training there is there |

0:04:43 | some meaning we like to only use the target that itself to estimate the parameters |

0:04:51 | and then we are going to complete to the results with a combination of target |

0:04:57 | and trained that |

0:05:00 | so |

0:05:01 | the goal of sell train data revisionist to avoid the acoustic mismatch between the training |

0:05:07 | and target data |

0:05:09 | so |

0:05:10 | what we need to train an i-vector p lda system to train the ubm and |

0:05:15 | the tv matrix we only need a clean speech segments the training is then straightforward |

0:05:22 | and as for the lda parameters estimation we need several sessions by post you got |

0:05:27 | in various acoustic conditions so |

0:05:29 | what we need to investigates is do we have several speakers |

0:05:34 | appearing in different it is that's you know what target data |

0:05:37 | and assuming we know how to effectively cluster of the target data in terms of |

0:05:41 | speaker can we estimate p lda parameters with those |

0:05:48 | so let's have a look on the data |

0:05:51 | we have around two hundred there was a of french broadcast news that drawn from |

0:05:56 | a previous french evaluation campaigns |

0:05:59 | so it's a combination of a tv and radio data |

0:06:04 | i'm of this two hundred hours we selected two shows a target |

0:06:08 | cooperate we selected there's with l c be awful and the f m story |

0:06:15 | and we to all other available recordings and decided to build what we call the |

0:06:22 | train corpus |

0:06:24 | so if we take a look of at the data we see that we have |

0:06:30 | more than forty episodes |

0:06:33 | more than forty results for each other show and we what we cannot this is |

0:06:37 | a speech proportion of the what i call the recording speakers which is a above |

0:06:43 | fifty percent for both corpora |

0:06:45 | corpora |

0:06:47 | so the recurring speakers is speaker who appear in more than one if results |

0:06:51 | as opposed to the one time speaker who only appear in one it is |

0:06:56 | so |

0:06:58 | to the em so of the previous first question |

0:07:01 | yes we have several speaker appearing in different if you that you know target |

0:07:07 | so no |

0:07:09 | we decided to |

0:07:11 | train the original system |

0:07:13 | meaning we suppose we know how to |

0:07:18 | cluster on the data target that so we |

0:07:21 | we use we had the target that are labels in real life we do not |

0:07:26 | so we don't have those labels but for |

0:07:29 | experiments |

0:07:30 | we decided to use them |

0:07:32 | so |

0:07:33 | to train the ubm and the tv matrix and estimate the p l d a |

0:07:37 | parameter parameters we process the same with them |

0:07:39 | with their trained that are we just replace the train data with labels my target |

0:07:44 | that are with labels |

0:07:46 | so what we see detailed that is that for the l c p so we |

0:07:49 | are able to obtain a result |

0:07:52 | so the results are present in terms of a diarisation error rates |

0:07:58 | cross recording there is there is there a residual error rate |

0:08:02 | so for the l c p show we had some results as for the b |

0:08:07 | f m shall we will not able to estimate the lda parameters |

0:08:10 | and we suppose we don't have enough data to do so that we we're gonna |

0:08:14 | investigate that |

0:08:18 | if we compared with the baseline results we see that if we use the information |

0:08:23 | about speakers in the target that we can right we should be able to improve |

0:08:29 | the baseline system |

0:08:33 | so what we one |

0:08:35 | to investigate is |

0:08:38 | it's the minimum |

0:08:40 | amount of data we need to estimate p idea parameters because |

0:08:43 | we so that for the v f m shall we will not able to train |

0:08:46 | p lda while for the l c d so we were able to so |

0:08:51 | we just decided to find out the minimum number of it is that's we could |

0:08:57 | take into the l c p so to estimate suitable p lda parameters so that |

0:09:01 | the group of that with you see here is the d right the a on |

0:09:08 | the l c d so |

0:09:10 | as a function of the numbers of it is it's take and to estimate the |

0:09:15 | p l d a parameter so |

0:09:16 | the total numbers of ap that is forty five and we started the experiments with |

0:09:21 | thirty visits because we see that a before the results that |

0:09:27 | so what's interesting |

0:09:29 | interesting to see is that we need to run thirty seven results to be able |

0:09:33 | to improve the baseline results |

0:09:37 | and when we have |

0:09:40 | thirty seven it is that's we have forty recording speakers |

0:09:44 | what's also interesting to see is that |

0:09:47 | we have the same numbers of speakers and here |

0:09:52 | i don't the |

0:09:53 | the different number of it is that's but the resulting the art is a really |

0:09:59 | well seals and he also what's interesting is that we are able to |

0:10:05 | so we have the same speaker out that |

0:10:08 | what |

0:10:10 | what's happening here is dressed that there are more and more that are gathered for |

0:10:14 | each speaker |

0:10:15 | and we need a minimum amount of that are for each speaker if we take |

0:10:20 | a look at the average number of session task because it's a run seven |

0:10:27 | when you have thirty seven types of |

0:10:31 | as for the df m show |

0:10:34 | when we take it is that we only have thirty five recording speakers |

0:10:38 | and are bring in five it is that in average so it's far less than |

0:10:43 | for the l c d corpus and that's why we are not able to train |

0:10:47 | a dog parameters |

0:10:50 | so now let's place in the real case and we are now not choose not |

0:10:56 | allowed to use of that target data labels |

0:11:00 | so i'm the first to train the ubm and tv matrix what we need a |

0:11:04 | clean speech signal so we just decided to take the output of the speaker segmentation |

0:11:10 | and compute the ubm in tv matrix |

0:11:14 | but we don't have any information about the speaker so we are not able to |

0:11:18 | estimate period of the lda parameters |

0:11:21 | so we just replace the p lda likelihood scoring by focusing based growing |

0:11:28 | and then we have a working system when we look at the results of our |

0:11:33 | stand with then we using t lda |

0:11:39 | that not to suppress the we expect that |

0:11:43 | no what we obtain a speaker clusters so |

0:11:47 | what we this idea is to use the speaker clusters and try to estimate the |

0:11:53 | lda experiments with those clusters |

0:11:55 | when we do when we do so well the training procedure doesn't six it |

0:12:04 | well we so in the oracle experiment that the number of data was limited and |

0:12:11 | we also suspect that the a probability of the clusters are used to back to |

0:12:16 | allow us to estimate the lda permitted |

0:12:21 | so to summarize with the self training experiment |

0:12:25 | for the ubm and t v training we selected segments produced by speaker segmentation we |

0:12:31 | only get the segments with the duration above ten seconds |

0:12:37 | and we also it shows the bic parameters so that the segments are considered tool |

0:12:43 | because to train a to estimate to train the tv matrix we need a clean |

0:12:47 | and we only need we need only one speaker in each segments for training |

0:12:53 | as for the lda we need several session |

0:12:57 | the speaker from values results so first we perform an i-vector clustering based you got |

0:13:03 | a position and use the and put into a speaker clusters to perform i-vector normalization |

0:13:08 | can estimate ple are limited so we just select |

0:13:12 | the output speaker clusters with |

0:13:16 | i-vectors coming from one |

0:13:18 | more than three episodes |

0:13:22 | no so we so that we are not able to train a |

0:13:28 | sufficient system with only detected target that are so we decide to at some train |

0:13:34 | data in the mix |

0:13:36 | so it's the so the classics the idea of a domain adaptation |

0:13:41 | so the main difference in this e system comparing with the baseline is that we |

0:13:48 | replace the ubm and tv metrics by |

0:13:51 | in this experiment ubm and tv metrics are trained |

0:13:55 | on to a target that are instead of training data and then we extract i-vectors |

0:13:59 | from the training data and estimate the lda parameters on the training but |

0:14:05 | so |

0:14:06 | when replacing the ubm and tv matrix we are able to improve around one percent |

0:14:12 | in absolute |

0:14:14 | in terms of d r |

0:14:18 | no |

0:14:20 | well why not try to applied the same process then we it with the center |

0:14:24 | in experiments and take the speaker clusters to estimate a new p lda parameters |

0:14:30 | so as before we the training the estimation of the lda parameter phase we i |

0:14:37 | think we really don't have enough that do so |

0:14:40 | and so we just decided to |

0:14:43 | combined their use of training data and |

0:14:47 | target the task to update the key idea parameter the classic domain adaptation scenario but |

0:14:54 | we don't use any whiting parameters to balance the influence and of trained and target |

0:15:00 | that are we just |

0:15:01 | to the i-vectors from the training data and the i-vectors from this |

0:15:07 | output speaker clusters and |

0:15:08 | combining them and |

0:15:10 | train new p lda parameters |

0:15:13 | so when we combine the that the data we again a improve the baseline the |

0:15:18 | system and again one percent in terms around one percent |

0:15:23 | in terms of the whole |

0:15:28 | and |

0:15:29 | well now that we've done then we why not try to iterate as |

0:15:35 | as long as we obtain speaker clusters we can always to use them and try |

0:15:38 | to improve the estimation of purely a parameters |

0:15:43 | well it doesn't so it doesn't work |

0:15:46 | if you iterate it doesn't improve the system we tried two |

0:15:51 | four iterations but i |

0:15:53 | that it's not okay |

0:15:58 | so |

0:16:00 | let's have a look on the system parameters we use the site it for that |

0:16:05 | or position toolkit it's a package above the psychic toolkit |

0:16:10 | but library |

0:16:12 | for the front end and we use thirteen mfccs with delta and delta-delta |

0:16:18 | we use a two hundred and fifty six components to train the ubm |

0:16:24 | the covariance make matrix is there gonna |

0:16:27 | the dimension of the tv matrix is two hundred the dimension to be the eigenvoice |

0:16:33 | matrix is one hundred |

0:16:35 | we don't use any i can channel matrix |

0:16:38 | for the speaker clustering task we use |

0:16:42 | the combination of connected components clustering and the article argumentative clustering |

0:16:48 | and i as i said before the metric is the data results for an error |

0:16:51 | rates and we use the two hundred and fifty milliseconds |

0:17:01 | so |

0:17:02 | if we summarize we compare the other three for different system first three but we |

0:17:08 | performed a surprise training using only external data |

0:17:12 | and then we |

0:17:14 | use the same training process but we replace the training data with their delicate that |

0:17:19 | this is the oracle experiments |

0:17:22 | and then we focused on |

0:17:24 | and surprise training using only the target data and we so that that's it's |

0:17:29 | that's good enough when comparing with the baseline system |

0:17:34 | so we decided to take back |

0:17:36 | some training data i'm applied in some kind of unsupervised domain adaptation and combined train |

0:17:43 | target |

0:17:46 | so |

0:17:47 | to conclude can say that |

0:17:49 | with so that if we don't have enough data we absolutely need to use external |

0:17:54 | that bootstrap the system |

0:17:57 | but the putting it even using unlabeled target that a which is and perfectly clusters |

0:18:04 | with some kind of them domain adaptation we are able to improve the system |

0:18:09 | so in our future work we want to in to focus on the adaptation framework |

0:18:14 | and used |

0:18:17 | already |

0:18:19 | where we we'd like to use |

0:18:23 | introduce whitening variability between train and target data |

0:18:27 | and we also like to try to work on the iterative procedure because we think |

0:18:32 | that if we are able to a better estimate p lda parameters after one at |

0:18:38 | a rate iteration we should be able to improve the quality of clusters and some |

0:18:43 | kind of iteration should be possible |

0:18:46 | in fact this work was don't already we presented a we submitted a paper at |

0:18:52 | interspeech it will be presented |

0:18:55 | so i can already said that using one thing variability |

0:19:01 | the results are really get better |

0:19:05 | and the iterative procedure also walks we with two or three iterations we are able |

0:19:11 | to slowly improve the that the all |

0:19:14 | and another way of improve |

0:19:18 | improve your remains to be seen but |

0:19:22 | with what's like to try to put strapless that would any label that for example |

0:19:26 | we could try to take the train that a don't use the labels and upper |

0:19:31 | from causing basis clustering because we so that on our approach maybe we didn't have |

0:19:36 | enough data and the target that i to apply this idea so maybe |

0:19:41 | try to bootstrap with more unlabeled data could be working |

0:19:47 | well thank you that that's wonderful |

0:19:55 | documents so i'm for instance |

0:20:06 | thank you for that are i think this is more common that a question but |

0:20:09 | i believe that some of your problems with the em for the p o da |

0:20:13 | our years speaker subspace dimension is higher numbers |

0:20:20 | i think that that's the problem we the that i mentioned that for a t |

0:20:24 | v and p l of the idea is to find a when we don't have |

0:20:29 | enough target data i cannot the problem is |

0:20:33 | i is difficult to estimate the one hundred i mentioned |

0:20:39 | p l d l parameters if you don't have that much speakers |

0:20:42 | did you try to reduce the i don't i do the focus on that well |

0:20:56 | thanks to the presentation thirteen and well like to use it for d c two |

0:21:01 | sounds pretty |

0:21:03 | and you was presenting it on |

0:21:08 | i think that last used e |

0:21:10 | i use the deeper then how the school that |

0:21:15 | well |

0:21:16 | in my experiment |

0:21:18 | the results are not very different between ilp and agglomerative clustering well i just decided |

0:21:26 | to use agglomerative clustering because it's |

0:21:30 | small simple simpler |

0:21:34 | yes computed computation time |

0:21:37 | but not really a big difference between |

0:21:43 | i think |

0:21:50 | so |

0:21:51 | dealing with these different internal extra so one thing i |

0:21:56 | see here and work was |

0:21:59 | what to use a way that i |

0:22:03 | why each latest specifically a little white here |

0:22:07 | no we didn't fight the data are we just we just to the target clusters |

0:22:13 | and the training clusters and |

0:22:16 | put them together in the same dataset |

0:22:20 | so if you look at the equations its own |

0:22:25 | it's the same taste as if use that the whiting parameters |

0:22:33 | of a value which is the relative amount of data between target and try to |

0:22:38 | train better so it is almost equal to zero |

0:22:43 | that's why we need to work and the availability because we are not |

0:22:50 | would every for that i |

0:23:03 | not that this difference anyway you're clustering experiments you decide how many clusters |

0:23:13 | well the |

0:23:15 | the clustering is a function of the that's which |

0:23:19 | and we don't we just saw a select the screenshot by next experiment we that's |

0:23:25 | why we which was to target corporate because this way we are able to do |

0:23:32 | an exhaustive search on the other three shown on the one and one corpus and |

0:23:37 | then |

0:23:38 | we you look if the same crucial applies for the other cultures |

0:23:44 | and the clustering tree structure is around zero so |

0:24:01 | we still have time for a few questions |

0:24:07 | okay so i was curious human centred in this work to you don't want be |

0:24:12 | considered for the reader assumed to be helpful but then you are able to somehow |

0:24:16 | fixed upon the |

0:24:17 | a next once we know what is that |

0:24:20 | i mean what was to what do you think is the most the problem would |

0:24:23 | do so |

0:24:25 | in this in this work the program is we want to introduce a wide thing |

0:24:30 | we don't balance the influence training of target that also |

0:24:34 | and the combination of training and target that we have so much training data |

0:24:39 | that the |

0:24:41 | the whitening parameters is really in favour of the train on the training data |

0:24:48 | when we change the are balance between training target that and give more importance to |

0:24:54 | the target that the films to get better results and then you see that why |

0:25:00 | the routine you can improve some |

0:25:02 | no more of the two or three iterations |

0:25:05 | and that we also i did some kind of yes cost normalization because when you |

0:25:12 | when you when you use a target that too |

0:25:17 | to obtain the p l d a parameter as the distribution of lda also tends |

0:25:22 | to achieve a lot |

0:25:24 | for you need to one |

0:25:26 | normalized to keep the same clustering speech |

0:25:29 | otherwise you don't cluster |

0:25:31 | the same place a total |

0:25:33 | after reported average |

0:25:40 | okay so if no further questions let's thank the speaker |