Speech Transcript - First investigations on self trained speaker diarization

0:00:15	i everyone my name is again and i'm working with orange labs and the value
0:00:22	in france
0:00:24	and then i'm going to talk about the concept of self training speaker diarization
0:00:31	so the application we don't working on is
0:00:35	the task of across recordings because data traditional applied on t v archives french t
0:00:41	v archives
0:00:42	and the goal is to index to spew costs of collections of multiple recordings
0:00:48	in order for example two provides new mean of dataset exploration and by creating links
0:00:54	between different it is so it's
0:00:57	so a system is based on a two-pass approach we first
0:01:04	process each recording separately applying some kind of speaker segmentation and clustering
0:01:10	and then we perform a cross recording a speaker linking and try to link all
0:01:17	within recording clusters
0:01:19	across the whole collection
0:01:22	so they're framework is based on the state-of-the-art speaker recognition
0:01:28	framework
0:01:30	we are using i-vector of the lda model edition and for clustering we use the
0:01:35	article agglomerative clustering
0:01:39	so we know that the lda the goal of the lda is to maximize the
0:01:44	between speaker variability one minute
0:01:46	minimizing the within speaker variability
0:01:50	so what we want to
0:01:53	investigate in our paper is can we use the target that a as training material
0:01:58	and how good
0:02:01	could we estimate the speaker variability
0:02:07	so first i'm going to represent
0:02:11	battery different from work so let's take a an audio file phone problem
0:02:14	from a target data
0:02:17	our target that is unable so we just have a audio files
0:02:21	first we are extracting some features we are using a mfcc features with delta and
0:02:27	delta-delta
0:02:29	then we perform a combination of speech activity detection and bic clustering to extract some
0:02:36	speakers segments
0:02:38	on top of those segments we can extract i-vectors using pre-trained ubm and total variability
0:02:45	matrix
0:02:49	once we obtain a well i-vectors a reliable to score all i-vectors between each other
0:02:55	and computer similarity scoring matrix
0:02:59	and for that we use p lda likelihood the
0:03:03	each are trained the p lda parameters are estimated separate
0:03:09	once we have or similarity matrix we can apply a speaker clustering
0:03:15	and do you results of the that are just and is a speaker clusters
0:03:21	so we can repeat the process for is of all recordings
0:03:27	once we've done that we can compute
0:03:30	a collection why the similarity matrix and repeat the clustering process and this time i
0:03:36	call it the speaker i'm thinking big because the goal is to
0:03:40	link the within recording clusters across the whole collection
0:03:45	and after the linking a park
0:03:48	after the linking part we obtain a the degradation
0:03:54	so the usual way of training the ubm t v matrix and estimate the plp
0:04:00	of parameters is to use
0:04:03	trained that that's that which is labeled based you can and the training procedure is
0:04:08	pretty straightforward
0:04:11	the problem when we
0:04:14	apply this technique we have some kind of mismatch between a target and trained that
0:04:20	the first we don't have the same acoustic conditions
0:04:25	and seconds we don't necessarily have the same speakers
0:04:29	in target and trained that also
0:04:32	we could use a information about the target that a maybe we could have better
0:04:36	results
0:04:38	so what we want to investigates is the concept of self training there is there
0:04:43	some meaning we like to only use the target that itself to estimate the parameters
0:04:51	and then we are going to complete to the results with a combination of target
0:04:57	and trained that
0:05:00	so
0:05:01	the goal of sell train data revisionist to avoid the acoustic mismatch between the training
0:05:07	and target data
0:05:09	so
0:05:10	what we need to train an i-vector p lda system to train the ubm and
0:05:15	the tv matrix we only need a clean speech segments the training is then straightforward
0:05:22	and as for the lda parameters estimation we need several sessions by post you got
0:05:27	in various acoustic conditions so
0:05:29	what we need to investigates is do we have several speakers
0:05:34	appearing in different it is that's you know what target data
0:05:37	and assuming we know how to effectively cluster of the target data in terms of
0:05:41	speaker can we estimate p lda parameters with those
0:05:48	so let's have a look on the data
0:05:51	we have around two hundred there was a of french broadcast news that drawn from
0:05:56	a previous french evaluation campaigns
0:05:59	so it's a combination of a tv and radio data
0:06:04	i'm of this two hundred hours we selected two shows a target
0:06:08	cooperate we selected there's with l c be awful and the f m story
0:06:15	and we to all other available recordings and decided to build what we call the
0:06:22	train corpus
0:06:24	so if we take a look of at the data we see that we have
0:06:30	more than forty episodes
0:06:33	more than forty results for each other show and we what we cannot this is
0:06:37	a speech proportion of the what i call the recording speakers which is a above
0:06:43	fifty percent for both corpora
0:06:45	corpora
0:06:47	so the recurring speakers is speaker who appear in more than one if results
0:06:51	as opposed to the one time speaker who only appear in one it is
0:06:56	so
0:06:58	to the em so of the previous first question
0:07:01	yes we have several speaker appearing in different if you that you know target
0:07:07	so no
0:07:09	we decided to
0:07:11	train the original system
0:07:13	meaning we suppose we know how to
0:07:18	cluster on the data target that so we
0:07:21	we use we had the target that are labels in real life we do not
0:07:26	so we don't have those labels but for
0:07:29	experiments
0:07:30	we decided to use them
0:07:32	so
0:07:33	to train the ubm and the tv matrix and estimate the p l d a
0:07:37	parameter parameters we process the same with them
0:07:39	with their trained that are we just replace the train data with labels my target
0:07:44	that are with labels
0:07:46	so what we see detailed that is that for the l c p so we
0:07:49	are able to obtain a result
0:07:52	so the results are present in terms of a diarisation error rates
0:07:58	cross recording there is there is there a residual error rate
0:08:02	so for the l c p show we had some results as for the b
0:08:07	f m shall we will not able to estimate the lda parameters
0:08:10	and we suppose we don't have enough data to do so that we we're gonna
0:08:14	investigate that
0:08:18	if we compared with the baseline results we see that if we use the information
0:08:23	about speakers in the target that we can right we should be able to improve
0:08:29	the baseline system
0:08:33	so what we one
0:08:35	to investigate is
0:08:38	it's the minimum
0:08:40	amount of data we need to estimate p idea parameters because
0:08:43	we so that for the v f m shall we will not able to train
0:08:46	p lda while for the l c d so we were able to so
0:08:51	we just decided to find out the minimum number of it is that's we could
0:08:57	take into the l c p so to estimate suitable p lda parameters so that
0:09:01	the group of that with you see here is the d right the a on
0:09:08	the l c d so
0:09:10	as a function of the numbers of it is it's take and to estimate the
0:09:15	p l d a parameter so
0:09:16	the total numbers of ap that is forty five and we started the experiments with
0:09:21	thirty visits because we see that a before the results that
0:09:27	so what's interesting
0:09:29	interesting to see is that we need to run thirty seven results to be able
0:09:33	to improve the baseline results
0:09:37	and when we have
0:09:40	thirty seven it is that's we have forty recording speakers
0:09:44	what's also interesting to see is that
0:09:47	we have the same numbers of speakers and here
0:09:52	i don't the
0:09:53	the different number of it is that's but the resulting the art is a really
0:09:59	well seals and he also what's interesting is that we are able to
0:10:05	so we have the same speaker out that
0:10:08	what
0:10:10	what's happening here is dressed that there are more and more that are gathered for
0:10:14	each speaker
0:10:15	and we need a minimum amount of that are for each speaker if we take
0:10:20	a look at the average number of session task because it's a run seven
0:10:27	when you have thirty seven types of
0:10:31	as for the df m show
0:10:34	when we take it is that we only have thirty five recording speakers
0:10:38	and are bring in five it is that in average so it's far less than
0:10:43	for the l c d corpus and that's why we are not able to train
0:10:47	a dog parameters
0:10:50	so now let's place in the real case and we are now not choose not
0:10:56	allowed to use of that target data labels
0:11:00	so i'm the first to train the ubm and tv matrix what we need a
0:11:04	clean speech signal so we just decided to take the output of the speaker segmentation
0:11:10	and compute the ubm in tv matrix
0:11:14	but we don't have any information about the speaker so we are not able to
0:11:18	estimate period of the lda parameters
0:11:21	so we just replace the p lda likelihood scoring by focusing based growing
0:11:28	and then we have a working system when we look at the results of our
0:11:33	stand with then we using t lda
0:11:39	that not to suppress the we expect that
0:11:43	no what we obtain a speaker clusters so
0:11:47	what we this idea is to use the speaker clusters and try to estimate the
0:11:53	lda experiments with those clusters
0:11:55	when we do when we do so well the training procedure doesn't six it
0:12:04	well we so in the oracle experiment that the number of data was limited and
0:12:11	we also suspect that the a probability of the clusters are used to back to
0:12:16	allow us to estimate the lda permitted
0:12:21	so to summarize with the self training experiment
0:12:25	for the ubm and t v training we selected segments produced by speaker segmentation we
0:12:31	only get the segments with the duration above ten seconds
0:12:37	and we also it shows the bic parameters so that the segments are considered tool
0:12:43	because to train a to estimate to train the tv matrix we need a clean
0:12:47	and we only need we need only one speaker in each segments for training
0:12:53	as for the lda we need several session
0:12:57	the speaker from values results so first we perform an i-vector clustering based you got
0:13:03	a position and use the and put into a speaker clusters to perform i-vector normalization
0:13:08	can estimate ple are limited so we just select
0:13:12	the output speaker clusters with
0:13:16	i-vectors coming from one
0:13:18	more than three episodes
0:13:22	no so we so that we are not able to train a
0:13:28	sufficient system with only detected target that are so we decide to at some train
0:13:34	data in the mix
0:13:36	so it's the so the classics the idea of a domain adaptation
0:13:41	so the main difference in this e system comparing with the baseline is that we
0:13:48	replace the ubm and tv metrics by
0:13:51	in this experiment ubm and tv metrics are trained
0:13:55	on to a target that are instead of training data and then we extract i-vectors
0:13:59	from the training data and estimate the lda parameters on the training but
0:14:05	so
0:14:06	when replacing the ubm and tv matrix we are able to improve around one percent
0:14:12	in absolute
0:14:14	in terms of d r
0:14:18	no
0:14:20	well why not try to applied the same process then we it with the center
0:14:24	in experiments and take the speaker clusters to estimate a new p lda parameters
0:14:30	so as before we the training the estimation of the lda parameter phase we i
0:14:37	think we really don't have enough that do so
0:14:40	and so we just decided to
0:14:43	combined their use of training data and
0:14:47	target the task to update the key idea parameter the classic domain adaptation scenario but
0:14:54	we don't use any whiting parameters to balance the influence and of trained and target
0:15:00	that are we just
0:15:01	to the i-vectors from the training data and the i-vectors from this
0:15:07	output speaker clusters and
0:15:08	combining them and
0:15:10	train new p lda parameters
0:15:13	so when we combine the that the data we again a improve the baseline the
0:15:18	system and again one percent in terms around one percent
0:15:23	in terms of the whole
0:15:28	and
0:15:29	well now that we've done then we why not try to iterate as
0:15:35	as long as we obtain speaker clusters we can always to use them and try
0:15:38	to improve the estimation of purely a parameters
0:15:43	well it doesn't so it doesn't work
0:15:46	if you iterate it doesn't improve the system we tried two
0:15:51	four iterations but i
0:15:53	that it's not okay
0:15:58	so
0:16:00	let's have a look on the system parameters we use the site it for that
0:16:05	or position toolkit it's a package above the psychic toolkit
0:16:10	but library
0:16:12	for the front end and we use thirteen mfccs with delta and delta-delta
0:16:18	we use a two hundred and fifty six components to train the ubm
0:16:24	the covariance make matrix is there gonna
0:16:27	the dimension of the tv matrix is two hundred the dimension to be the eigenvoice
0:16:33	matrix is one hundred
0:16:35	we don't use any i can channel matrix
0:16:38	for the speaker clustering task we use
0:16:42	the combination of connected components clustering and the article argumentative clustering
0:16:48	and i as i said before the metric is the data results for an error
0:16:51	rates and we use the two hundred and fifty milliseconds
0:17:01	so
0:17:02	if we summarize we compare the other three for different system first three but we
0:17:08	performed a surprise training using only external data
0:17:12	and then we
0:17:14	use the same training process but we replace the training data with their delicate that
0:17:19	this is the oracle experiments
0:17:22	and then we focused on
0:17:24	and surprise training using only the target data and we so that that's it's
0:17:29	that's good enough when comparing with the baseline system
0:17:34	so we decided to take back
0:17:36	some training data i'm applied in some kind of unsupervised domain adaptation and combined train
0:17:43	target
0:17:46	so
0:17:47	to conclude can say that
0:17:49	with so that if we don't have enough data we absolutely need to use external
0:17:54	that bootstrap the system
0:17:57	but the putting it even using unlabeled target that a which is and perfectly clusters
0:18:04	with some kind of them domain adaptation we are able to improve the system
0:18:09	so in our future work we want to in to focus on the adaptation framework
0:18:14	and used
0:18:17	already
0:18:19	where we we'd like to use
0:18:23	introduce whitening variability between train and target data
0:18:27	and we also like to try to work on the iterative procedure because we think
0:18:32	that if we are able to a better estimate p lda parameters after one at
0:18:38	a rate iteration we should be able to improve the quality of clusters and some
0:18:43	kind of iteration should be possible
0:18:46	in fact this work was don't already we presented a we submitted a paper at
0:18:52	interspeech it will be presented
0:18:55	so i can already said that using one thing variability
0:19:01	the results are really get better
0:19:05	and the iterative procedure also walks we with two or three iterations we are able
0:19:11	to slowly improve the that the all
0:19:14	and another way of improve
0:19:18	improve your remains to be seen but
0:19:22	with what's like to try to put strapless that would any label that for example
0:19:26	we could try to take the train that a don't use the labels and upper
0:19:31	from causing basis clustering because we so that on our approach maybe we didn't have
0:19:36	enough data and the target that i to apply this idea so maybe
0:19:41	try to bootstrap with more unlabeled data could be working
0:19:47	well thank you that that's wonderful
0:19:55	documents so i'm for instance
0:20:06	thank you for that are i think this is more common that a question but
0:20:09	i believe that some of your problems with the em for the p o da
0:20:13	our years speaker subspace dimension is higher numbers
0:20:20	i think that that's the problem we the that i mentioned that for a t
0:20:24	v and p l of the idea is to find a when we don't have
0:20:29	enough target data i cannot the problem is
0:20:33	i is difficult to estimate the one hundred i mentioned
0:20:39	p l d l parameters if you don't have that much speakers
0:20:42	did you try to reduce the i don't i do the focus on that well
0:20:56	thanks to the presentation thirteen and well like to use it for d c two
0:21:01	sounds pretty
0:21:03	and you was presenting it on
0:21:08	i think that last used e
0:21:10	i use the deeper then how the school that
0:21:15	well
0:21:16	in my experiment
0:21:18	the results are not very different between ilp and agglomerative clustering well i just decided
0:21:26	to use agglomerative clustering because it's
0:21:30	small simple simpler
0:21:34	yes computed computation time
0:21:37	but not really a big difference between
0:21:43	i think
0:21:50	so
0:21:51	dealing with these different internal extra so one thing i
0:21:56	see here and work was
0:21:59	what to use a way that i
0:22:03	why each latest specifically a little white here
0:22:07	no we didn't fight the data are we just we just to the target clusters
0:22:13	and the training clusters and
0:22:16	put them together in the same dataset
0:22:20	so if you look at the equations its own
0:22:25	it's the same taste as if use that the whiting parameters
0:22:33	of a value which is the relative amount of data between target and try to
0:22:38	train better so it is almost equal to zero
0:22:43	that's why we need to work and the availability because we are not
0:22:50	would every for that i
0:23:03	not that this difference anyway you're clustering experiments you decide how many clusters
0:23:13	well the
0:23:15	the clustering is a function of the that's which
0:23:19	and we don't we just saw a select the screenshot by next experiment we that's
0:23:25	why we which was to target corporate because this way we are able to do
0:23:32	an exhaustive search on the other three shown on the one and one corpus and
0:23:37	then
0:23:38	we you look if the same crucial applies for the other cultures
0:23:44	and the clustering tree structure is around zero so
0:24:01	we still have time for a few questions
0:24:07	okay so i was curious human centred in this work to you don't want be
0:24:12	considered for the reader assumed to be helpful but then you are able to somehow
0:24:16	fixed upon the
0:24:17	a next once we know what is that
0:24:20	i mean what was to what do you think is the most the problem would
0:24:23	do so
0:24:25	in this in this work the program is we want to introduce a wide thing
0:24:30	we don't balance the influence training of target that also
0:24:34	and the combination of training and target that we have so much training data
0:24:39	that the
0:24:41	the whitening parameters is really in favour of the train on the training data
0:24:48	when we change the are balance between training target that and give more importance to
0:24:54	the target that the films to get better results and then you see that why
0:25:00	the routine you can improve some
0:25:02	no more of the two or three iterations
0:25:05	and that we also i did some kind of yes cost normalization because when you
0:25:12	when you when you use a target that too
0:25:17	to obtain the p l d a parameter as the distribution of lda also tends
0:25:22	to achieve a lot
0:25:24	for you need to one
0:25:26	normalized to keep the same clustering speech
0:25:29	otherwise you don't cluster
0:25:31	the same place a total
0:25:33	after reported average
0:25:40	okay so if no further questions let's thank the speaker

First investigations on self trained speaker diarization

Speaker Recognition in Multimedia Content

Gaël Le Lan, Sylvain Meignier, Delphine Charlet, Anthony Larcher