Speech Transcript - Short-Duration Speaker Modelling with Phone Adaptive Training

0:00:15	good morning everybody my name is one is all the and that's being gonna present
0:00:19	that were work short duration of speaker modeling we form of that training
0:00:23	i will first start with a brief introduction on which i will explain the minimal
0:00:28	the main motivation of our work
0:00:31	i will that continuum now we've explaining our approach form of that training and the
0:00:36	present experimental setup results and i would finally conclude the we will at some conclusion
0:00:41	and the some future work directions
0:00:45	so
0:00:46	linguistic variation is a significant source of already shown in many of them i speech
0:00:50	processing application such as a short duration of speaker verification and speaker diarization
0:00:57	in both cases we find ourselves to deal with short duration segments when we want
0:01:03	to lure a speaker model
0:01:04	so let's suppose to have a ubm model
0:01:07	and that if we want to learn a speaker model from upper that they show
0:01:10	well we use a long utterance or we have plenty of data that the estimate
0:01:15	it speaker model
0:01:16	a will be near to the idea speaker model serious a the phonetic variation will
0:01:22	be marginalised
0:01:23	while when we use the short utterance for example three seconds do you know what
0:01:28	muppet that they some process then this team at the speaker model is far from
0:01:33	ideal speaker model c insert the phonetic variation on is not marginalised
0:01:38	sold objective of our work keys to improve the speaker modeling while the decrease in
0:01:43	the variation due to the phonetic content and the while increasing the speaker discrimination
0:01:51	a to do these and we started from a method called a speaker that it
0:01:56	training set up to but is technique commonly used in i automatic speech recognition and
0:02:02	the is used to reduce the speaker variation a to estimate models that are more
0:02:08	a phonetic that the morphed weighted discriminant
0:02:12	and to get better estimation
0:02:16	still
0:02:17	the idea is the to let us to model you have it in the original
0:02:20	the acoustic feature space and we say we can discriminate between speakers and that we
0:02:26	can discrimate the also between phones
0:02:28	so in that is at a scenario what sub that's is to project the acoustic
0:02:33	features in a space in which all the phonetic information is retained while the speaker
0:02:39	disk the speaker information is discarded
0:02:43	so
0:02:44	a if we interchange at the roles of speaker informs we can that reach the
0:02:51	opposite results
0:02:53	that means suppress the phonetic variation while at emphasizing that the speaker variation
0:02:59	so we have always our original feature acoustic space which we can discriminate between phones
0:03:06	and speakers and the what but in this idea of snarl but does is to
0:03:10	project the features in that their acoustic feature space in which all the speaker information
0:03:14	is entertaining while we can not discrimate anymore between phones
0:03:20	still
0:03:22	but that was first applied to speaker diarization we have a relative increase in the
0:03:27	speaker discrimination on the of twenty seven percent and the additive decrease in phone discrimination
0:03:32	of features by percent and the speaker and phone discrimination was calculated for the fisher
0:03:38	score as we will see later
0:03:41	and the however the improvement in the diarization error rate a was disappointing
0:03:47	still
0:03:48	what we do in that what we now used we try to optimize and the
0:03:53	bottle with butter in installation the from the convolutive a complexity of speaker diarization
0:03:59	and the this is done by a addressing the problem of speaker modeling in the
0:04:04	case of a shot when the training data nice cars and the by performing a
0:04:10	small scale a speaker verification experiments are
0:04:13	a
0:04:15	by using a database that the man only lap to look at the phonetic level
0:04:21	so i would proceed to a in our approach for that that's training we use
0:04:26	a extensively a constrained maximum likelihood linear regression cmllr
0:04:32	so all cmllr is the technique to reduce be screen and mismatched it with the
0:04:37	reduced the mismatch between an adaptation on the dataset that and the a initial model
0:04:44	it by estimating an affine transformation
0:04:48	so let's assume we have it a gaussian mixture model we've initial mean more and
0:04:53	the a initial covariance sigma so as similar transform estimate a and times and my
0:05:00	six eight where n is the dimension of the feature space and the and then
0:05:06	dimensional buyers vector b
0:05:08	and sets that we cannot that the mean and the covariance signal far initial model
0:05:12	through this equation by we just chief the mean and we calculate the various in
0:05:17	this way
0:05:17	and the what the real important think of similar out that we use than simply
0:05:22	but these two that we have the possibility of transforming an acoustic feature on of
0:05:27	the patient feature a we can my back to the our initial model a by
0:05:32	applying the a line the
0:05:34	a
0:05:36	the inverse transformation
0:05:38	and the and
0:05:40	at the base of but where is similar i and the but the as we
0:05:44	said that a aims to suppress the form that variability in order to provide a
0:05:49	more speaker discriminatively features
0:05:52	and the so we supposed to have a set of s speaker and p phones
0:05:56	and the and set of utterances that are parameterize the a
0:06:01	by a set of acoustic features
0:06:03	and the we suppose also to have a set of initial that a speaker models
0:06:07	always the craft a mixture models
0:06:10	so what
0:06:11	but thus is to estimate the a set of gaussian mixture models a that are
0:06:16	normalized across phones and the for each phone p we aims to estimate a similar
0:06:22	transform a that the cutters the for evaluation on the across the speakers and this
0:06:28	is done it
0:06:28	by solving the problem of maximum likelihood and the it is it a
0:06:34	done it iteratively
0:06:35	still
0:06:37	that's supposed to have a fixed number of iteration and the in the initial iteration
0:06:42	on that we start by a giving as input the initial features and the initial
0:06:46	that a
0:06:48	in
0:06:49	my initial speaker models
0:06:51	so for each phone we estimate the similar transform that is common to all the
0:06:56	speaker models and that we estimate this transformed by using a the data is for
0:07:04	that particular phone for all the speakers
0:07:07	so we use the i'll transform to normalize the feature vectors by using transformation that
0:07:14	we so before
0:07:16	and the by applying the inverse transformation
0:07:19	the normalized feature of be normalized features are then used to estimate the is set
0:07:24	of speaker models that are normalized across phones
0:07:28	so if for each at the last iteration on the band the we got a
0:07:32	normalized features our fast scores and that our finally our finally a speaker models that
0:07:39	are normalized across the phones
0:07:41	otherwise we give we give as input
0:07:44	the am
0:07:45	obtain a features and that the in that a
0:07:48	a speaker models
0:07:51	so
0:07:53	a in our case when we deal with short duration utterances of a full so
0:07:57	that we don't that much data to it to estimate the transform for each phone
0:08:02	so what we do is to estimate the transform for an acoustic classes acoustic class
0:08:08	so that the
0:08:10	these a set of phones so i transform a and acoustic classes that are mean
0:08:16	by using good a binaural regression free it which the main know what is initialized
0:08:21	with the all the phones
0:08:23	and the according to linguistic rule we split these main all the a by choosing
0:08:30	the split the maximize the likelihood of the training data
0:08:34	and that this is don until the increase in the like to you the a
0:08:38	is i urban that fixed frazzled
0:08:41	so when
0:08:43	when it or we reach the last the
0:08:46	the last iteration we calculate each and which a transform
0:08:50	forty eight acoustic classes
0:08:52	and the
0:08:53	it's phoning that acoustic classes that share the same transform
0:08:59	so
0:09:00	in our experimental setup so what we need the two
0:09:05	to evaluate and we'll to my spotting ideal scenario
0:09:09	is to a database you which we have short duration sentences
0:09:15	a
0:09:16	we have clear and accurate phonetic transcription and the we have a limit the level
0:09:23	of noise and channel variation this is because we want to
0:09:26	to see to estimate where
0:09:30	performance of part in that the l and able to my scenario
0:09:33	so
0:09:35	by taking into account these consideration that we a we concluded that nice database are
0:09:41	not the
0:09:43	i are not a
0:09:45	does not fit our the nist database that don't fit our needs because you to
0:09:50	the lack of a target speaker and the phonetic transcriptions
0:09:55	to the channel variation and to the different types of noisy compromise the recordings
0:10:01	so we choose the we base our choice on the timit the database because the
0:10:07	a is a collection of i quality and the read speech sentences
0:10:12	and the
0:10:14	it's end this is a which is last three seconds and the
0:10:19	average and the is manually transcribed at the phonetic level and the you know the
0:10:25	database very is a limited noise and the
0:10:29	bodies a
0:10:30	that is not channel variation
0:10:33	so
0:10:34	however i database is composed of six under speakers of which are four hundred fifty
0:10:39	eight are males and what i don't and each word females it each speaker contribute
0:10:45	to this ten sentences we've average duration of three seconds
0:10:49	and the
0:10:51	we said we divide the database that so that well data for from for under
0:10:57	six to speaker where used to learn the ubm while the remaining speaker the recording
0:11:03	from the remaining speaker
0:11:05	a what i'm sixty eight where used for a city experiments automatic speaker prediction experiments
0:11:10	and the but performance is the analyzed by using her from one to seven sentences
0:11:16	are it to learn the speaker model
0:11:21	so the first opposed to a it was too
0:11:24	a segment the our utterance easy speech and non-speech as segments according to the ground
0:11:32	of transcriptions
0:11:33	we then extract the features that were canonically mel-frequency says that a comfy sent twelve
0:11:39	plus energy blast delta and acceleration coefficients
0:11:43	we've and we've an estimated speaker model by map adaptation from the ubm models that
0:11:51	a
0:11:52	estimation from four to one thousand ready for gmm components
0:11:57	and that by using an initial feature and the initial the thing to speak in
0:12:01	the initial speaker model we applied but that is starting from acoustic classes that where
0:12:10	where a obtain a from the initial set of the thirty eight phones and we
0:12:15	finally got our a normalized features and our normalize speaker models
0:12:22	so but performance was assessed on two different this piece used them at a traditional
0:12:27	gmmubm system and the state-of-the-art high vector p of you system
0:12:33	a baseline and we perform our baseline experiment with the initial set of features that
0:12:39	we defined before was be
0:12:41	at the a
0:12:43	is be experiments we've part where you where the would perform by using the for
0:12:49	to normalize speaker features
0:12:53	so i without with the experiment the results
0:12:56	so to as a set before to assess the speaker and the phone discrimination we
0:13:01	decided to use the fisher score discriminant the fisher score the future score
0:13:05	and that's supposed to have a the
0:13:08	as classes and the a set of and a lot but feature
0:13:12	bilateral i mean each feature is the in
0:13:16	not at the with the
0:13:17	class belong to
0:13:19	so the speaker phone discrimination
0:13:22	it's calculate the a fruit the feature score
0:13:25	that the
0:13:27	at which in than where it at the numerator the inter class distance where you
0:13:32	is the mean of each class
0:13:34	and the at the denominator we have at the intra
0:13:37	a and
0:13:38	intra class distance
0:13:41	basically a represent the spread of the features are around their own mean in the
0:13:45	class
0:13:46	so
0:13:47	if we want to
0:13:50	in if we
0:13:52	if the inter class distance increase it means that the numerator is i are while
0:13:58	the if the we have more normalisation more the features more spread out there are
0:14:03	on their mean when it means that the denominator is i of a numerator
0:14:08	so that
0:14:10	in our experiments we calculate the speaker discrimination and the phone discrimination after ten iterations
0:14:16	of but and we show that the speaker discrimination as a relative increase of forty
0:14:22	percent of the ten iteration
0:14:25	while at the phone discrimination
0:14:27	as a
0:14:28	relative decrease of fifty percent
0:14:31	so a disease a
0:14:35	this is good because it is
0:14:37	it goes along with the previous results that we've got in our previous work
0:14:42	however i would bust the and now to the automatic speaker verification experiments
0:14:49	so as it possible to serve
0:14:51	a we a
0:14:53	for our speaker verification experiments by using them all those from
0:14:57	for
0:14:58	to one thousand before gmm components and the whole for gmm and that they vector
0:15:03	ple the fist thing about these is that the a an i-vector p lda
0:15:09	performance but much better but gmm-ubm system the scale is different from like that
0:15:15	at the
0:15:16	also we bought to a we can see that we have always a
0:15:22	but the performance is rather than a than the baseline system
0:15:27	and the
0:15:28	another thing to not this is that is that for lower model complexity we can
0:15:35	reach but the performance is then the baseline or a similar performance is
0:15:40	then that the baseline
0:15:42	and the a result of the models training we one-sentence when we deal we one
0:15:47	sentence
0:15:48	and the for seven centers is that it is the
0:15:52	we carried comparable performance with the baseline
0:15:56	but we've
0:15:58	the word model complexity for example in the four we forty two jim ubm components
0:16:03	we get the same performance it as the baseline when using two hundred fifty six
0:16:08	components
0:16:09	and the same in the i-vector system where we got better performance with forty two
0:16:14	components
0:16:15	is it
0:16:16	by using but
0:16:17	and the a
0:16:19	compared to the baseline system that we what do you we two hundred fifty six
0:16:23	components
0:16:25	so
0:16:27	in this two tables i'm going to
0:16:30	the present the results
0:16:33	a where a
0:16:35	independently from the model size views
0:16:38	so these are results
0:16:40	a
0:16:41	i've the result by using good the an optimal model size for the speaker model
0:16:46	and that we can see that for the i-vector p lda system at than
0:16:52	fifty percent a increasing the performances of course in an ideal and optimize the environment
0:16:59	while a for the gmm ubm sister system the could but the performance see for
0:17:04	the first four by when the using one that and three training sentences and we
0:17:09	got comparable the results when using five and seven sentence
0:17:14	and the
0:17:16	in these that you that the plot use a we can still see the results
0:17:21	as before these are results when using one single sentence and the we can see
0:17:27	that we have
0:17:28	fifty percent degrees in the are of the i-vector the lda system and the in
0:17:33	the gmm-ubm system at the lines are more are less far apart but we have
0:17:38	the and the degrees from near are of forty two to three six percent
0:17:45	and the
0:17:47	to conclude
0:17:48	a this works in this work we address the problem of speaker model e in
0:17:52	the case that when training data is cars shall by you when using short duration
0:17:57	utterances
0:17:58	and the we optimize and the value it but a at the speaker modeling level
0:18:04	but performing the small the speaker verification the experiments
0:18:09	and the by using timit database that it's lateral at the phonetic level
0:18:15	and the we show that but it's skies formally in that using performed by s
0:18:20	a
0:18:22	well it probably significantly the performance of those two systems gmm ubm and i-vector p
0:18:27	lda
0:18:28	and the what is worth nothing is also that but is able to provide the
0:18:34	equivalent or but the performance by using a lower model complexity
0:18:41	and the
0:18:43	for the future work we aim to go back twelve original goal that these but
0:18:47	for speaker diarization
0:18:49	a we want to explore approaches automatic approach is to a in a closed acoustical
0:18:57	but class transcriptions because of but the but actually doesn't need to
0:19:01	a
0:19:02	it might be no doesn't need phonetic transcription but as long as we are we
0:19:07	are able to label that adding that way that we can map the features to
0:19:11	a particular acoustic class
0:19:13	we can transform we can calculate the transform for that particular class and we can
0:19:19	finally improve the performance is a in the in the system
0:19:23	and one final it was problem as passionately but am and speaker-independent approaches to four
0:19:30	and normalization
0:19:33	effect of rotation
0:20:02	you're i-vector extractor i was trained with sort channels and so as well because be
0:20:08	lda was obviously trained on i think with the sentence is okay so we didn't
0:20:13	manage for example sentences from the same channel speaker to create a big
0:20:17	sentence so that you train the i-vector extractor this way
0:20:21	for example for a one speaker we use of the centres is when we put
0:20:24	it together and we
0:20:26	okay so it's okay so character selected to understand was we used short sentences so
0:20:31	that is exactly what
0:20:50	wise that's used you don't have a couple of minutes
0:20:55	i think balls
0:20:57	r is a
0:21:00	and the team it
0:21:01	a much too simple databases for
0:21:05	because of all the
0:21:08	you a beetle
0:21:11	well close to zero percent
0:21:14	so what does it challenges in a
0:21:17	text dependent i mean this should be applied everywhere i mean if it works so
0:21:21	well
0:21:23	what is what are the challenges
0:21:26	sorry what which to what in the real life i mean this is not be
0:21:30	used in many systems
0:21:33	right so
0:21:34	i mean you should be employed if we have no ever
0:21:40	so what actually this one was that of the work to optimize and the bottle
0:21:45	with button ideas value because as i said a we try to apply the as
0:21:49	a respectively speaker diarization but the problem was that we didn't that comp in this
0:21:55	enough to
0:21:57	because there is out that was disappointing where like a we gotta really little improvement
0:22:02	in the it their decision error rate so we said okay but is not
0:22:08	we tried to see how to try to find out the
0:22:12	upper limit performance that can be shown off but i mean you have been using
0:22:16	timit that there are many versions and t meet what they the timit
0:22:22	with a noise condition all sorts of more we or telephone bandwidths condition
0:22:28	then it has been transform in many ways so why don't you uses
0:22:34	the because as a set of timit where the phonetic transcription also for these we
0:22:40	since we want to what demise dismantling the ideal condition we
0:22:44	this
0:22:47	so i would think would be interesting to see that this is one
0:22:57	of the risk of a primary all is very quickly
0:23:02	the major impediment so to progress in this field lack of data
0:23:12	i towards been very generous and making available we also data but that's pretty much
0:23:16	three
0:23:17	the only
0:23:19	dataset a recent phone search the so that we have so to work on
0:23:24	so
0:23:26	i mean one was experience with realtors novels that the problem really as part of
0:23:31	our
0:23:32	we're not going to be able to make progress almost as far as like can
0:23:36	see
0:23:38	on that's we find some way of showing various at some are among researchers we
0:23:42	need a on this program
0:23:48	you are working with the industrial partner that probably collect some that all right
0:24:09	mutual benefit to sharing data
0:24:12	then we could probably
0:24:15	make sure progress in this way we otherwise it's not software or how we're going
0:24:20	to the one
0:24:30	thank you for those points patrick
0:24:33	i just one and that mention that in odyssey two thousand one when we became
0:24:39	odyssey there was considerable effort put forward to a creating these standard text dependent corpora
0:24:47	to distribute to the participants both per se converse and new ones
0:24:54	put together these nice text-dependent datasets we distributed them to the odyssey members in advance
0:25:03	and plan to have a whole track with a text dependent speaker verification
0:25:08	and the sad news with only a couple of sites participating
0:25:13	so i think craig greenberg was
0:25:18	imply maybe a similar issue with the hazy or evaluations so a lot of these
0:25:26	you know it has to be a two way street to go to the f
0:25:28	for an expense put together corpora
0:25:31	and then have a reasonable number of participants want to take on the challenge so
0:25:37	if there's been a shift in interest to
0:25:41	text-dependent verification i
0:25:44	i think would be good is a community to get together in figure that out
0:25:47	and put together some evaluation

Short-Duration Speaker Modelling with Phone Adaptive Training

Text-dependent Speaker Recognition

Giovanni Soldi, Simon Bozonnet, Federico Alegre, Christophe Beaugeant and Nicholas Evans