Speech Transcript - Text Dependent Speaker Verification Using a Small Development Set

0:00:28	i
0:00:31	yeah
0:00:32	oh
0:00:33	cover
0:00:34	things
0:00:36	set
0:00:48	i
0:00:49	oh
0:01:00	with a ten
0:01:00	my name is like a region in this is
0:01:03	type of this talk in the text dependent speaker verification using the small buttons
0:01:09	okay
0:01:09	so this is a button for this work in two thousand ten
0:01:15	a speaker evaluation
0:01:17	speaker recognition evaluation was held by it was found back
0:01:22	the relation focus mostly on text-dependent speaker verification
0:01:25	i research it participated in this evaluation
0:01:29	so basically a we presented the results of this evaluation last interspeech
0:01:34	and i was also quite satisfactory
0:01:38	however there was some criticism regarding that a set of the evaluation
0:01:45	because interpolation that the thing that was very was quite large about two hundred
0:01:51	and false sessions per speaker
0:01:54	and the and the criticism was that for me a practical applications customers
0:02:00	not a it's a it's not practical to collect such a large dataset
0:02:04	so it was very interesting to see what are the results of a technology when
0:02:10	using the a small that's that and the smokers that was specified as
0:02:14	being with the consisting of a one hundred speakers
0:02:18	and only one session per speaker
0:02:20	so there is no way to multi session
0:02:24	i
0:02:25	there's only one such as well
0:02:27	oh
0:02:28	okay so that i don't of course but is for example
0:02:31	first quickly described a relation that i will describe our speaker verification systems that use
0:02:37	and then we'll talk about how to how we got to within this it with
0:02:42	the statistics
0:02:43	we present results in
0:02:48	okay so there were three textdependent authentication conditions interpolation the first one is the in
0:02:55	the by the global condition
0:02:57	where we use a global and the constraints such as zero to nine for authentication
0:03:03	circuits authentication condition is using the
0:03:06	speaker dependent password
0:03:08	also it indicates the constraints
0:03:11	and this is denoted by the speaker condition
0:03:14	now of course there's the issue is whether they boston also the absolute not so
0:03:18	in the relation and there's assumption that most cases assumption is that the then both
0:03:24	signals
0:03:25	the passwords definitely just all the trials use the same but the sinc password a
0:03:30	target
0:03:31	possible
0:03:33	and the last the condition is called the prompted condition or a proper the random
0:03:38	string is
0:03:40	is useful for authentication this is hardest to accurately that the case but it's a
0:03:46	more the most resilient to condition for against attacks
0:03:50	such as holding X
0:03:53	yes
0:03:56	okay
0:03:57	so basically that was follow the looks like this the last seven hundred fifty speakers
0:04:04	one the where useful development and five hundred fifty four evaluation data was recorded over
0:04:09	four weeks
0:04:10	and four sessions of error for the speaker to landline do so
0:04:15	and each session consists of all these authentication conditions and a lot of more data
0:04:20	that we are going to use the future like
0:04:23	instead of using the constraints just text
0:04:27	it's
0:04:31	okay so
0:04:32	and for the goal is to the conditions we use a
0:04:36	a three predictions for the past four more so
0:04:41	basically someone to model the system i have to say three times for example zero
0:04:45	nine then we'll see
0:04:47	the education i just one time is nine
0:04:51	suppose that
0:04:53	and that the data is a supposed to be used as following for the global
0:04:57	condition
0:04:58	a way to use the same to constraints as evaluated so if the password is
0:05:03	tonight that will use the predictions of do not in the in the model sets
0:05:09	or for speech recognition product condition we're not allowed to use a repetitions of the
0:05:15	same digit strings for the
0:05:19	the reduced development set is a because this is that the one of the speakers
0:05:24	with a single session each
0:05:25	yeah the speakers are recorded in that i have solar
0:05:30	and by what we were to use a any other sources of probably given
0:05:37	resources
0:05:38	such as the nist or switchboard
0:05:41	on top of these two steps
0:05:47	okay so we are here
0:05:48	systems are useful for the information we use it is three text-independent systems the first
0:05:55	one is that you know joint factor analysis based
0:05:58	this ten second one is the i-vector based system not just the i-vectors
0:06:06	and third one is it is not
0:06:08	we use a also a text-dependent system which is a tune in supervector based and
0:06:15	with no compensation and we use this system currently only for the global condition
0:06:20	and five the fact that final score is a fusion of the scores of all
0:06:26	these cases
0:06:27	which are weighted the
0:06:29	the simple rule based
0:06:32	yeah
0:06:34	okay so
0:06:36	just a few details about the that you can base it is not an assistant
0:06:41	so it's quite standard but we have to a specific
0:06:47	verification to presented also need to speech
0:06:50	the first one is a is
0:06:52	score robust scoring
0:06:54	and the second one is estimated with the scoring
0:06:57	and we may be able to build a system format for only for a telephone
0:07:02	you need state
0:07:03	we don't use that was followed data for building the system
0:07:07	a dollar a company that uses the that was found data
0:07:12	and for system but not used as the score normalization score notation is actually done
0:07:17	using the
0:07:21	same thing for the i-th a basis to eight is the same dataset is sources
0:07:26	and the only use that was five data for score
0:07:33	the not system actually makes a useful in the development the data that was available
0:07:39	data we trained a ubm and not from that of a data and you don't
0:07:45	as much as text much as possible so for example for the global condition we
0:07:50	train the ubm and nap is just from the same text that is being used
0:07:54	in verification
0:07:56	speaker population but not allowed to do that so we just use it for example
0:08:01	the constraints
0:08:02	but not just the text
0:08:06	we found that but do we get a lot
0:08:10	that we also use a variant of not which we call two wire not which
0:08:15	is the on top of a removing the that the channel space and we also
0:08:21	some don't two components
0:08:23	of the interspeaker variability subspace
0:08:27	because we consistently found out in that years that is
0:08:32	thus
0:08:33	yeah
0:08:34	we also using a geometric mean compressing kernel
0:08:38	was
0:08:41	but which control
0:08:43	and
0:08:45	okay
0:08:46	and we do serious conversation again using that was
0:08:50	the H supervector based system is very similar to the gmm nap system
0:08:56	the only difference is that instead of extracting gmm supervectors we extract hmm supervectors
0:09:02	and the rest of system is the set so basically a chance of those are
0:09:06	started by instead of training and ubm train a speaker independent hmm from the development
0:09:12	data
0:09:13	and then if a lot to extract these supervectors we just a take the a
0:09:18	take a session we use that data to estimate the session independent hmm using map
0:09:25	adaptation and we just take a gmm means from the different states normalize the sense
0:09:31	that
0:09:33	okay so
0:09:35	now talk about how we were able to cope with the reduced dataset
0:09:40	is a
0:09:41	what we look at least at four different system we can see that the jfa
0:09:45	and i-vector based systems
0:09:47	are not very
0:09:48	should not be possible to very much to this the buttons that because we're not
0:09:53	using it a very tall we only false normalization
0:09:57	so wait for the moment we didn't we yeah work on these systems we just
0:10:01	a use that this system as is and see what happens
0:10:05	for the not based systems the problem is that much more serious because
0:10:11	it will using the development is that the very extensively and first of all we
0:10:15	have less data for that fortunately yeah speak an hmm
0:10:20	was used a we don't have any multisession speakers
0:10:24	so if we want to for example to train now we will be able to
0:10:30	and also as quantisation began mistake
0:10:34	so
0:10:36	or vice versa for these two systems for the gmm based mapping the hmm based
0:10:40	not systems
0:10:41	and a weekend
0:10:43	we have also a we consider in the in some slides in the results
0:10:48	we focus on these systems because they walk much better than jfa i-vector on this
0:10:52	task so
0:10:53	it's very important to do this
0:10:57	okay so for the gmm based not system and the first component is the ubm
0:11:02	we compare two way to estimate its training don't are
0:11:07	reduced dataset or training on nist data
0:11:10	for now we compare scream at the first one is to train a waveform the
0:11:16	nist data
0:11:17	the second one was to estimate not a for all from produce data although i
0:11:22	don't have a multisession speakers
0:11:25	by using a in approach that we call a common speaker subspace
0:11:30	in conversation which we used in two thousand seven
0:11:34	and i will then excitable explain a bit more i
0:11:38	approach
0:11:39	and of course that the third method you just combine the two compensation that the
0:11:44	use of them
0:11:46	so this common speaker subspace compensation that so it is basically
0:11:50	as for my
0:11:51	for it firstly
0:11:53	we estimate this space this subspace from a large step sizes from all speakers
0:11:59	so it is and the in our case where the one hundred speakers and we
0:12:03	just expressed supervectors for these one hundred sessions and we just do pca on these
0:12:10	supervectors
0:12:12	okay and know what its columns because that's just because it in some way to
0:12:18	represent that he just speaker as such
0:12:22	the speaker subspace
0:12:24	i
0:12:25	so i guess maybe contrary to that the logical we will use a subspace
0:12:30	so instead of focusing that recognition in speaker such as we just remove
0:12:35	is the dominant components of the speaker subspace
0:12:38	actually sample speaker told it also contains the
0:12:42	components of that channel subspace
0:12:45	but remote this subspace
0:12:47	and in but we get after removing we call this the speaker unique subspace
0:12:53	because
0:12:54	in the in the space that you get after this is a reasonable because we
0:12:58	expect we don't expect to have any information that is common to many speakers
0:13:03	because we already remove this
0:13:05	this subspace that is complete
0:13:07	speakers
0:13:08	and the intuition that we have also examined that is it may be wise to
0:13:12	do with nation in this a speaker subspace and we got quite interesting
0:13:18	so this is what i mean
0:13:21	right
0:13:23	okay for agent based not a for speaker dependent hmm we cannot use the nist
0:13:29	data because we need to be text dependent so
0:13:33	only choice is to use that we do test set
0:13:37	for now
0:13:38	we have to be a different methods the first one
0:13:42	the training to form the com using that common speaker subspace method folder into this
0:13:48	is the dev set
0:13:50	second it is to use a feature space now
0:13:54	a which range from the nist data and the third one is a combination
0:14:01	okay so just before a is a present the results just to see that the
0:14:06	quality of the system that you see so for nist two thousand a on
0:14:11	one that standard the telephone
0:14:15	condition and males only
0:14:18	we see that they get the point two
0:14:21	quite a reasonable results in zero the scores jfa of four and i-vector are now
0:14:28	also for that the question is still
0:14:34	okay so that was also for different i-vector based system
0:14:39	first for the match and conditions so that train both involved in the basic issues
0:14:44	time same channel at a landline or so far
0:14:47	what we see here is that
0:14:50	we get a degradation in a round twenty five percent for jfa and also
0:14:57	something similar for i-vectors
0:14:59	we don't really understand why
0:15:01	thus
0:15:03	now it is for the mixed channel B we also see similar
0:15:09	degradation for jeff in i-vectors
0:15:12	i
0:15:13	between seven percent and
0:15:16	each
0:15:17	okay this is what expected because we have a only one hundred sessions those conversations
0:15:23	speaker
0:15:24	okay so for that you cannot stand
0:15:28	we see
0:15:29	that's for example a training the ubm from this is not doesn't give us as
0:15:34	good results as to train phone to reduce test set
0:15:38	and also when we do not see
0:15:42	it's actually better to train did not the reduced dataset using the common speaker subspace
0:15:49	method
0:15:51	and of course if we do if you just combine these sub-spaces
0:15:55	we get the best results
0:15:57	and
0:15:58	we see that a
0:15:59	we still get a quite a large degradation for global condition forty one percent relative
0:16:04	this is because the global condition makes most of the use from the training from
0:16:08	the development data
0:16:10	and this paper conditions of the population don't both we make such as one of
0:16:14	the data because they are not text matched
0:16:17	thus we think that this addition
0:16:19	i
0:16:20	it's not as severe
0:16:23	for the mismatched condition we see quite similar
0:16:27	i
0:16:28	trans
0:16:30	oh
0:16:32	this is for the high each of the system
0:16:35	i
0:16:36	again
0:16:37	we see that its ability to bring the not the cluster densities and of course
0:16:43	because of space
0:16:44	conversations
0:16:46	but we do get a some improvements when we just a
0:16:50	two results the user was not used and
0:16:55	and the competition does have some
0:16:57	so
0:17:02	we try to allow us to make that the hmm system which is the best
0:17:06	system that for the global condition which is
0:17:08	the most important of all
0:17:11	see what is also the main source of degradation caused we see that we
0:17:15	we have some significant degradation
0:17:19	the oh so what we can see if only some of these results is that
0:17:23	if we just compare the full development set and we and we compared to system
0:17:28	which we
0:17:29	starting to the development set for which meant really used for compensation
0:17:34	but we don't use it for not really see that we don't get such a
0:17:39	significant degradation
0:17:41	so the bottom line is that we then sent this that the probably the results
0:17:47	division is that the number three
0:17:53	okay but when a few sources
0:17:55	okay so we see that we get a degradation between thirty percent and points
0:18:02	what we can be
0:18:04	still image database of the results
0:18:07	especially for the global condition which is important in this task
0:18:12	so we still
0:18:13	yeah the zero point six for the right channel condition but we said no mismatch
0:18:20	in addition we might be
0:18:25	so to conclude we validate our cyst
0:18:28	as long good indication conditions using the full development sets and to skip button sets
0:18:34	jfa and i-vector degradation is roughly five fifteen percent
0:18:39	for the nap based systems that that's degradation is more dramatic a due to the
0:18:43	strong the use of that was problem data
0:18:46	actually for the global condition
0:18:48	so for
0:18:50	for you yeah speaker dependent hmm training data that's that is fine
0:18:55	to use you get some small degradation due to it
0:18:58	but for not really a it's important to do to do something that's it to
0:19:03	do a combination of a twenty four nist and
0:19:06	and using the cost because subsystem remember that
0:19:09	note to get the documents that's
0:19:12	is five for the fused system we got degradation
0:19:16	percent average
0:19:17	therefore we conclude that the it's the we can build a text-dependent is this can
0:19:22	be
0:19:24	even if we don't have any multi
0:19:26	okay sessions
0:19:33	i
0:19:42	i
0:19:43	i
0:19:45	oh
0:19:47	you
0:19:49	what
0:19:50	also for the global condition the and we are allowed to use saying that
0:19:56	for the and that's that that's the one hundred
0:20:02	sessions
0:20:03	useful idiots equal but for the speaker condition that the right and the proposition but
0:20:07	not allowed to use the same
0:20:09	say
0:20:11	we use under the constraint
0:20:16	i
0:20:19	oh
0:20:21	oh
0:20:21	without yeah
0:20:26	yes and it's not obvious it's not
0:20:45	okay that the lot is just a
0:20:47	a fixed the tickets
0:20:50	i can say is you know
0:20:51	then
0:20:53	speakers is that in practice six varies estimate a global because you're doing you always
0:21:00	what attracts use the same is the case when full test for both involvement in
0:21:07	verification
0:21:08	but i
0:21:10	that the use case is that if we present has its own i think it's
0:21:15	yeah
0:21:16	well the only difference in disability and a difference is the use of the development
0:21:20	data
0:21:21	and the bottom the condition is where you're probably with a think it's
0:21:39	a
0:21:41	okay we actually didn't really
0:21:44	what can this so basically that
0:21:48	the results that i that i
0:21:53	presents a actually i
0:21:55	used in some cases you can say it you don't this so i'm not
0:22:02	i
0:22:04	i
0:22:07	i
0:22:10	i
0:22:11	i
0:22:16	they basically we did look at it and we don't see here we don't feel
0:22:20	that is a problem for this application we only need a single class
0:22:25	oh
0:22:26	okay
0:22:27	we just a result
0:22:29	i don't
0:22:34	i
0:22:39	so the idea for that there is that it for that for example a development
0:22:44	set here for the global condition
0:22:46	a we actually needed to record a speaker saying zero nine now what happens if
0:22:52	the money one was to change as possible to a different one then it would
0:22:56	be to go again record speakers
0:22:58	saying the same thing is
0:23:01	because we actually using this for development
0:23:04	i think it's not a weighted thing but i don't business and marketing a
0:23:11	a person's be that this is not the from their experience is customers really but
0:23:17	is not practical
0:23:18	but when you want to deploy such a system you will not be most times
0:23:23	you would not a be able to report
0:23:26	so many recordings and you the think that it is a practical to take one
0:23:32	speakers and recorded once but i don't think it's practical to
0:23:36	to take a two hundred speakers and
0:23:39	hold it over four weeks four
0:23:54	yeah because this is the speaker so if you have development set it is from
0:23:58	the set using the same text then you get much better results
0:24:02	if you train your models all actually a utterances saying zero nine
0:24:09	you and we have this in the paper last from as it does but you
0:24:13	will get much more like i don't know fifty percent reduction of modeling error rate
0:24:18	seventy
0:24:20	and then if you just you a try to exclude other for model text for
0:24:24	other things
0:24:32	oh
0:24:33	oh
0:24:35	i
0:24:35	oh
0:24:39	yeah
0:24:42	that there are some cases the more them are a not saying
0:24:49	i
0:24:51	i
0:24:53	oh
0:24:55	i
0:24:56	oh
0:24:57	yeah
0:24:59	the other reason
0:25:01	which are not at a sensory technological perspective
0:25:05	i

Text Dependent Speaker Verification Using a Small Development Set

SESSION 10: Speaker Recognition - Application

Hagai Aronowitz