Přepis řeči - Investigation of Speaker-Clustered UBMs based on Vocal Tract Lengths and MLLR matrices for Speaker Verification

0:00:07	but it's not
0:00:08	yeah um
0:00:10	i guess so end of a long day so
0:00:12	thanks to saying that
0:00:14	um
0:00:15	this is basically well almost
0:00:17	a lot of overlap with whatever
0:00:19	the first
0:00:20	speaker did for this particular session
0:00:23	um which is basically trying to find out
0:00:25	if uh
0:00:29	um it basically
0:00:31	we can have
0:00:31	uh different background models
0:00:33	four different sets of speakers
0:00:35	and the V L B A proposing at least in this paper is that uh
0:00:39	we can uh have
0:00:40	uh speakers blasted according to the vocal tract length
0:00:43	and also uh another way of doing it is trying to use
0:00:46	a similarity between the mllr mattresses
0:00:49	and we show that uh using uh
0:00:50	few sets of
0:00:51	these uh speaker clusters
0:00:53	we can obviously get some improvement and uh performance as opposed to
0:00:57	using a single
0:00:58	uh ubm
0:00:59	so uh the overview of the top is
0:01:02	i'm pretty much uh
0:01:04	is that of them indicated one is to
0:01:06	you bureau of your
0:01:07	the conventional speaker verification we
0:01:09	very often use
0:01:10	a single background model
0:01:12	and uh then uh what do we
0:01:14	the the reason why we might want to use
0:01:16	a speaker cluster wise
0:01:17	back to models
0:01:18	and that two ways you could do the clustering at least in this paper that's what we have suggesting
0:01:23	one is to use
0:01:24	people collect
0:01:25	trent
0:01:25	tract length parameter itself
0:01:27	and the other is to use a speaker dependent mllr matrix
0:01:30	support vector
0:01:32	and uh then this
0:01:33	sure how we can build background models for each of these
0:01:36	uh individual speaker clusters
0:01:38	and then we compare the performance
0:01:40	first with the speaker
0:01:41	uh a single gender independent ubm
0:01:44	and then we compare with a gender dependent uh ubm
0:01:48	and the gender dependent of
0:01:50	speaker cluster model
0:01:53	is that uh some of those
0:01:55	overlapped but what the first speaker
0:01:57	actually here
0:01:58	uh so
0:01:59	at least in pointing out uh it's basically a binary decision problem so given that this feature
0:02:04	and some claimed identity
0:02:06	we're trying to find out
0:02:07	if the identity uh we compare
0:02:10	uh the log likelihood ratio with the same model
0:02:13	and and alternate model
0:02:15	and see if if it's
0:02:17	beyond a certain threshold uh there except that or
0:02:20	we rejected
0:02:21	and solve and the question is
0:02:23	what should be uh the alternate hypothesis uh one of course
0:02:27	is
0:02:27	yeah
0:02:28	to say that
0:02:30	a good yeah
0:02:31	one is to say that they are on that hypothesis is
0:02:33	a universal background model well
0:02:36	which is a single model
0:02:37	that people use for all speakers
0:02:39	the database
0:02:40	um then there are other approaches that we have
0:02:43	uh
0:02:43	a set of
0:02:44	uh
0:02:45	speaker models
0:02:46	cohorts that close to a particular speaker
0:02:49	ah well so we take a linear combination of some combination of these
0:02:52	uh
0:02:53	scores
0:02:54	or we could build a a background model for me
0:02:56	using these cohorts itself
0:02:58	for that particular speaker so one
0:03:00	has
0:03:00	one background model for all speakers
0:03:03	the other has
0:03:04	uh a background model for each speaker
0:03:06	oh
0:03:07	the the other way of doing it is
0:03:09	have some compromise between the two
0:03:11	which is to say that i'll have a background model for a group of speakers
0:03:15	and then the question becomes how like group
0:03:17	speakers
0:03:17	so we're proposing
0:03:19	two different ways that we can group the speakers
0:03:21	oh oneness
0:03:22	basically using
0:03:23	the vocal tract length parameter
0:03:25	and the other is to use a speaker
0:03:27	pacific
0:03:28	mllr matter
0:03:30	so um so this is the basic idea that you're talking about so instead of using one background model
0:03:35	and then uh comparing the likelihood with that background model and corresponding speaker uh
0:03:40	clean model
0:03:42	well we actually have
0:03:44	different sets of models
0:03:45	but different speaker clusters
0:03:47	and uh so the how we're gonna build these speaker cluster background models is what we are uh what are
0:03:53	you talking about the next slide
0:03:55	and the speaker clustering itself was basically done uh using i don't vocal tract length parameter
0:04:00	all you maximum likelihood mllr supervector
0:04:04	so um the idea of the motivation for trying to use vocal tract and parameter for speaker clustering
0:04:10	is
0:04:11	because
0:04:11	if
0:04:12	uh you know basically
0:04:13	if it's logical differences in our contract is going to give rise to some differences in the east
0:04:18	but
0:04:19	so uh
0:04:21	what shown here is obviously a in in the fall in the dark lines a male speaker and the
0:04:26	that a solid line female speaker so there are differences in the spectral for the same file
0:04:31	for the simple reason that the uh sociology of the product
0:04:34	system of very different from male speaker female speaker
0:04:37	in terms of size
0:04:38	and therefore
0:04:39	we assume that if
0:04:40	also
0:04:41	if
0:04:41	as a group of speakers have a similar
0:04:44	a vocal tract length diameter of similar physiology
0:04:47	they probably they produce very similar
0:04:49	set of spectral characteristics for the sound
0:04:52	and therefore we can group these speakers together
0:04:54	and assume that they have very similar characteristics in terms of
0:04:57	uh
0:04:58	features of the produce for a particular sound
0:05:00	uh obviously a we need to vtln
0:05:04	i mean we do not have
0:05:05	uh a difference
0:05:06	speaker
0:05:07	and therefore one has to uh you know use some sort of model a reference model
0:05:12	if you will
0:05:12	and that's what we're trying to use uh the background model
0:05:15	itself as a reference model against which we are going to score
0:05:19	uh different
0:05:20	uh features or the different
0:05:22	what parameters
0:05:23	and choose the one that does
0:05:25	uh be best for that so each speaker basically
0:05:28	is
0:05:28	uh his or her
0:05:30	speck uh vocal tract parameters estimator respective background model
0:05:34	this is similar to whatever we don't speech recognition too
0:05:37	but we use insert
0:05:38	ubm
0:05:39	the speaker independent model
0:05:41	oh the other way that we could
0:05:43	uh possibly classified speak
0:05:45	uh into groups
0:05:46	if you use
0:05:47	the mllr matrix itself
0:05:49	uh and that there have been lots of evidence that mllr does capture quite a bit of information about a
0:05:55	particular speaker
0:05:56	so we
0:05:57	that
0:05:57	the columns of the mllr like to uh tactics to form a supervector
0:06:01	and then we do a very simple uh
0:06:04	clustering of these mllr support vectors
0:06:07	among speakers in the database of the using this technique he means that about them
0:06:11	and just using the simple euclidean distance
0:06:13	the plaster these different speeches
0:06:15	so given the ubm
0:06:17	and the speaker training data we get mllr matrix for each speaker
0:06:20	we stack columns
0:06:22	uh to form a supervector
0:06:23	so this identifies a discount places a speaker
0:06:27	and then we have
0:06:28	group the speakers depending on
0:06:30	uh the clusters that have formed by those
0:06:32	about the last subject
0:06:36	so um
0:06:37	and then
0:06:38	oh so now that we have
0:06:40	uh sort of group
0:06:41	these speakers into different classes
0:06:43	uh we will
0:06:45	a different background model for each of these
0:06:47	a group of speakers
0:06:49	and what we have done here is to just basically use a simple
0:06:52	mllr adaptation of the
0:06:54	ubm model
0:06:55	to uh get a new set of
0:06:57	means
0:06:58	for me
0:06:59	each of these speaker cluster background models in each of these speaker adapted models
0:07:03	i've got from the ubm by just a consummation of the
0:07:06	and these are estimated
0:07:08	the the transformation matrix
0:07:10	the estimated by using
0:07:11	all the data from a particular speaker clusters so that's what is written here
0:07:15	so given the ubm you form uh for each cluster
0:07:18	its own
0:07:19	background model
0:07:20	so this plastic will be based on either using
0:07:23	vtln as a parameter so close to one 'cause one one how are clustered into another round
0:07:28	or it could be uh
0:07:30	a set of
0:07:31	mllr
0:07:32	uh
0:07:33	cluster
0:07:34	speakers
0:07:35	so this is the implementation aspects of given the ubm
0:07:38	um first i do and identification of each of the speakers in the database
0:07:43	and find out what was the corresponding week and then parameters so let's say if i'm looking at the vtln
0:07:48	but i'm at one point
0:07:49	two zero
0:07:50	i find that speakers three four six all of them have this but i'm just like group them together
0:07:55	so and then if i'm looking at the vtln but i'm at a point eight two
0:07:58	the speaker I D's
0:07:59	two eight nine
0:08:00	uh possibly belong to this
0:08:02	so i group them together
0:08:03	and then using the scruples
0:08:05	because
0:08:06	i transform the gmmubm
0:08:08	to form
0:08:09	a background model
0:08:11	which basically he's a man
0:08:12	mllr adaptation of this particular
0:08:14	group of speakers
0:08:16	and then i do the individual speaker modelling by doing a map adaptation
0:08:21	uh
0:08:22	so the so the background model
0:08:23	uh and then from the background model for each of the individual speakers
0:08:26	i use
0:08:27	yeah corresponding addicted to do map adaptation
0:08:31	so
0:08:31	divide it can be used
0:08:32	well if i had used uh uh clustering
0:08:35	all speakers based on
0:08:36	mllr itself
0:08:41	so
0:08:41	um so if you look at the test phase they are almost similar to whatever the conventional case it is
0:08:46	except that too small differences so given the test utterances
0:08:50	i find uh
0:08:51	the ideal basically likelihood ratio by comparing the speaker model
0:08:55	and the background model
0:08:56	in the case of the conventional case still be one single ubm sitting
0:09:00	and the
0:09:01	the the speaker model is got by adapting this
0:09:04	to get the particular speaker model
0:09:06	and then i do uh you know uh uh
0:09:09	threshold based analysis whether to accept or reject
0:09:12	um the exact things that
0:09:13	but slightly different models are used yeah
0:09:16	here the background model is actually
0:09:18	but
0:09:18	specifically for that particular speakers
0:09:21	cluster
0:09:22	and then the speaker model is got by adapting this
0:09:25	i see ubm so i have the
0:09:27	speaker model that slightly different you know
0:09:29	and then again i do a log likelihood ratio
0:09:31	this
0:09:31	so basically what these systems use
0:09:33	identical uh computation cost except that
0:09:36	the models
0:09:37	a slightly different in what we used for the background
0:09:40	uh
0:09:40	this is uh just
0:09:41	a standard database
0:09:43	but we use of things that need
0:09:44	uh
0:09:46	two thousand two
0:09:47	um
0:09:48	for background modelling
0:09:50	and uh
0:09:51	evaluation
0:09:52	uh is one type train and once i guess
0:09:54	in this two thousand four
0:09:58	so uh what we notice is that
0:10:01	um
0:10:02	depending on the number of vtln clusters that before uh
0:10:06	uh you know depending on how many at first that the yellow
0:10:09	uh we see that as the number of classes increases you do
0:10:12	see some decrease in the
0:10:14	uh in the E R so this is what you could if you use a single
0:10:18	gender independent ubm
0:10:20	so this is the uh that you would get
0:10:22	if you use vtln
0:10:24	and this is the yeah that you would use if you use
0:10:26	mllr based
0:10:27	speaker clustering
0:10:29	we find that uh M L L about
0:10:30	slightly better than
0:10:32	vtln but both of them
0:10:33	you
0:10:34	significantly better
0:10:35	formance
0:10:36	then
0:10:36	um
0:10:37	then
0:10:37	the single ubm based at that
0:10:39	the same thing holds true for them in minimum dcf also
0:10:43	uh so the couple of
0:10:45	things that you notice wonders what of them uh what we can and then mllr can use some improvement in
0:10:49	performance
0:10:50	as opposed to single
0:10:51	uh ubm
0:10:53	and mllr performs
0:10:55	slightly
0:10:55	uh
0:10:56	sometimes
0:10:56	oh
0:10:57	quite a debate you better than vtln
0:10:59	and what we find is that
0:11:00	forty and find out the parameter uh clusters that give the best performance
0:11:07	and this is
0:11:07	because point that yeah so which again shows that mllr doing much a little better than
0:11:13	this black girl which is got by vtln plastic
0:11:16	and the blue one is the regular single uh ubm base
0:11:21	execution
0:11:22	so the question to be asked is why use mllr uh performing better than vtln
0:11:28	and so what i did is i mean there are lots of other information that was available so but if
0:11:32	you look at the black and the white at the bottom
0:11:34	the black response to basically having female speakers and the white response to having
0:11:39	male speakers so here we have chosen
0:11:42	fourteen clusters which was the one maybe not the maximum performance for vtln
0:11:46	and you see that there are a lot of clusters that
0:11:48	the vtln has both male and female speakers so if you look at this
0:11:52	uh i like was one a lot
0:11:54	and like what is this which means
0:11:56	that are both male and female speakers for this particular one
0:11:59	similarly four point ninety point nine six
0:12:01	you see that there is some overlap between the male and female speakers
0:12:05	on the other hand when you look at the mllr supervector
0:12:08	and if you look at
0:12:09	uh the black and white
0:12:11	yeah very distinct uh the the the screen that
0:12:13	picks up
0:12:14	then
0:12:14	female clusters as they are
0:12:16	and the two possible vector uh mllr clusters pick up only the male speakers
0:12:21	so there seems to be a white
0:12:23	uh
0:12:24	nice
0:12:24	uh yeah purity in terms of clustering
0:12:27	uh you don't agenda
0:12:28	it is when you use mllr like uh
0:12:30	supervectors
0:12:31	and
0:12:32	we think possibly that's one of the reasons why mllr seems to be obvious consistently
0:12:36	perform better than
0:12:38	vtln
0:12:39	so we just wanted to go one step further and see
0:12:41	if if that was indeed the case then if we separate the clusters
0:12:44	according to gender
0:12:46	then how would the gap between mllr and vtln disappear we get
0:12:50	very similar performance
0:12:51	using both of them
0:12:52	and that's what the next set of experiments
0:12:54	basically indicate
0:12:56	so here what we have done is
0:12:58	uh now we have a gender wise
0:12:59	ubm so
0:13:00	but i do you beams one for a million and one for females obviously you see some improvement in performance
0:13:05	compared to the gender independent ubm
0:13:07	but also what was
0:13:09	uh you select what what we conjectured seems to be holding too
0:13:12	once we classify once we just uh
0:13:14	do gender wise
0:13:15	uh
0:13:16	splitting of the clusters
0:13:17	then vtln and mllr gives all give almost compatible performance
0:13:23	still mllr slightly better but uh nevertheless
0:13:26	uh the performance
0:13:27	it is
0:13:28	almost compatible
0:13:29	the same thing holds true for
0:13:31	uh the minimum dcf also
0:13:33	so
0:13:33	so the point that we want to make is vtln if you use it just but for a clustering
0:13:38	sometimes
0:13:39	i gives
0:13:39	that's a good performance
0:13:41	for the simple reason that it performs
0:13:43	it's
0:13:43	picks up both the male and female speakers for the same alpha
0:13:46	but of either gender wise
0:13:47	of uh clustering
0:13:48	then what
0:13:49	mllr and vtln give almost
0:13:51	same
0:13:52	a comparable performance
0:13:53	uh and in any case both of these methods of clustering but that's obvious
0:13:57	or perform
0:13:58	uh
0:13:58	the gender wise
0:13:59	single ubm for each
0:14:01	gender
0:14:01	yes
0:14:04	so and that's reflected also in the debt go
0:14:07	you can see that both the ubm about what
0:14:09	the mllr clustered and the
0:14:12	we can then clustered what of age and gender wise clustered now
0:14:15	have very similar performance and they always do better than a gender wise
0:14:19	you be
0:14:22	so uh
0:14:23	so the bottom line is that if you are willing to increase
0:14:26	uh the number of background models
0:14:28	and they're not much yeah we find that to get a reasonably good performance
0:14:32	if you just use
0:14:33	yeah
0:14:34	i think of something like to uh males and two females
0:14:37	clusters
0:14:37	you get
0:14:38	um
0:14:39	some gain in performance
0:14:41	a boat in the case of gender dependent and gender dependent case
0:14:44	uh the computational cost at least at this
0:14:46	is the same as a single ubm because we just uh comparing the two models
0:14:51	uh mllr supervector uh performs better than vtln in most cases
0:14:55	but the gap
0:14:56	narrow down
0:14:57	if you're willing to use
0:14:58	agenda voice
0:14:59	speaker clustering
0:15:01	so
0:15:02	does it
0:15:09	we have time for one last question
0:15:20	you close to speakers that they use the uh different ubm depending on the training speakers right
0:15:27	i was like to
0:15:28	you have
0:15:28	but your training
0:15:30	yeah samples yeah you close to the training speakers so
0:15:33	one training speaker only has one
0:15:36	is associated with one ubm
0:15:37	get it right
0:15:38	okay
0:15:38	but now
0:15:39	on
0:15:40	yes
0:15:41	when you have a new sample
0:15:42	uh_huh um
0:15:44	if you're doing
0:15:45	are you talking about
0:15:46	i was
0:15:47	speaker
0:15:48	whether a speaker verification or oh i see you're speaker verification task
0:15:52	yeah one particular person only
0:15:55	and then he's anglo that ubm yeah
0:15:57	it's not a speaker and it S not the speaker deterrence would be much more expensive
0:16:01	yeah
0:16:02	so here
0:16:03	it is because of associated uh
0:16:05	background model
0:16:06	but
0:16:06	it's not each speaker having each cluster speakers have the one button
0:16:10	right
0:16:10	okay but the the the
0:16:11	reasons not more expensive is that your only considering one training speaker
0:16:15	right
0:16:23	'kay
0:16:24	i think there's no more time to say thank you very much for submitting this session and
0:16:29	joey
0:16:30	oh

Investigation of Speaker-Clustered UBMs based on Vocal Tract Lengths and MLLR matrices for Speaker Verification

SESSION 3: Background modeling in Speaker recognition, Forensics

Přidáno: 14. 7. 2010 11:08, Autor: Achintya Kumar Sarkar, S. Umesh (Indian Institute of Technology Madras), Délka: 0:16:36