Speech Transcript - Study on the Effects of Intrinsic Variation using i-Vectors in Text-Independent Speaker Verification

0:00:16	i
0:00:17	well
0:00:19	i
0:00:23	i
0:00:24	i
0:00:25	i
0:00:26	oh
0:00:29	i
0:00:37	two
0:00:40	roughly
0:00:42	since
0:00:42	and as a student of which risky in computer science and college
0:00:49	i'm glad shows you
0:00:51	the study of the effects of it just a nation using i-vectors interesting dependent speaker
0:00:57	location
0:01:00	best i would use the main challenge in speaker verification
0:01:04	and then i will
0:01:06	the
0:01:07	is actually about the problem of research
0:01:10	and their proposal
0:01:12	have
0:01:13	then i would use the i-vector framework for discrimination model
0:01:19	including the from what the intersection of all the signal speech
0:01:25	and then i would just use the
0:01:28	the elements
0:01:29	in those
0:01:30	excuse in the daytime
0:01:32	description of speaker verification systems
0:01:35	and the experiment results
0:01:38	i don't of the solutions
0:01:43	and backchannels in speaker verification comes from two it first one is
0:01:48	extrinsic the right but
0:01:51	and the other one is interesting there are P G
0:01:55	the best alignment the associated with that is
0:01:58	come outside of the speakers such as mismatched channels
0:02:02	or environmental noise
0:02:05	the intrinsic variability is associated with that is that
0:02:10	from the speakers
0:02:12	such here is speaking style
0:02:14	emotion
0:02:15	speech one and state helps
0:02:19	and it can there are a lot of research
0:02:22	focus on the extrinsic drive each
0:02:25	but an example of research about
0:02:29	in this to the right which has been proposed so
0:02:32	in this paper we focus on the intrinsic the remote but
0:02:37	the one stack is fess of we use the right but
0:02:41	in speaker verification
0:02:46	the problem with focus on
0:02:48	on the performance of speaker verification
0:02:51	i'm at best yeah that the right into the remote speech
0:02:55	so there are two questions
0:02:58	best one is
0:02:59	how the speaker verification system before
0:03:01	where enrollment and testing on the and mismatched conditions between just arrived at
0:03:07	and the second parties
0:03:09	how the colleges focus on model that was over at each
0:03:13	okay in addressing the effects of interesting eighteen speaker verification
0:03:19	so wait one
0:03:21	yeah
0:03:23	would be the proposal more than the signal right but with i-vector framework and want
0:03:28	to say that that's
0:03:33	and
0:03:34	first we have to define the variation forms
0:03:38	because interested over but comes form
0:03:41	all the data associated with the speakers
0:03:44	but they are still practise
0:03:46	so waste best
0:03:48	define the base form that is neutral spontaneous
0:03:53	speech at normal rate and a four inch at least
0:03:58	for many cases
0:03:59	basic well
0:04:00	weight
0:04:02	you
0:04:02	either that you know
0:04:04	variation forms
0:04:06	from six aspects including speech rate
0:04:09	with the state S
0:04:11	speaking by
0:04:12	emotional state speaking style and the speaking language
0:04:16	for example in the speaking rate
0:04:19	we have
0:04:19	fast speech or slow speech
0:04:22	you think you basic skaters
0:04:25	oh well
0:04:27	clean i zero
0:04:29	for example the model of speech means
0:04:32	the speakers have a candy is a mouse
0:04:36	talk
0:04:37	in that way
0:04:39	the recognizer with
0:04:41	the other night they are to use the speech data
0:04:45	has a cat qualities noise
0:04:49	and
0:04:50	the speaker why don't including not so hot and whisper
0:04:55	in the emotional state but have happy
0:04:58	emotion and their own body motion
0:05:01	and the
0:05:03	the speaking style
0:05:04	a reading style
0:05:07	yeah
0:05:07	about the speaker which we have most chinese language recognition
0:05:13	so for me six aspects
0:05:15	we have to have
0:05:18	variation forms and the way
0:05:21	recording for the data i
0:05:23	for experience
0:05:27	then
0:05:29	are we just use the i-vector framework point is the variation more
0:05:33	and is the i-vector modeling has been successful in the application
0:05:38	for the channel compensation
0:05:42	the i-vector framework is composed to pass festivities
0:05:46	we can project the supervector
0:05:50	and
0:05:51	into the i-vector the total she so the total variability space
0:05:57	he sees the low dimensional space
0:06:01	the second part is that i that was
0:06:03	okay
0:06:04	we can use the cosine similarity score
0:06:08	to actually use the
0:06:10	similarity between a test
0:06:13	utterance and yeah
0:06:15	training
0:06:18	please
0:06:19	how baltimore in nineteen score i'd be please
0:06:22	i-vector framework
0:06:24	because
0:06:25	before they also partly
0:06:27	studies
0:06:28	about the i-vector format for modeling the
0:06:32	can compose compensation
0:06:34	channel
0:06:35	so
0:06:37	we want to see if it is derived for the
0:06:40	what we are interested about ivy
0:06:46	seconds how to label the effects of images
0:06:49	ratios we use a set of technologies
0:06:53	which is used to have to be the best soulful
0:06:57	channels
0:06:59	there were having we use to lda and this is a P
0:07:04	the idea behind the lda is
0:07:07	minimizes the within speaker variety by maximizing the between speaker for speech
0:07:14	we have
0:07:15	define the compression and the
0:07:17	the lda projection matrix are obtained
0:07:20	by
0:07:23	is composed also
0:07:25	it can batters
0:07:26	which is how to decrease the eigenvalue of the equation
0:07:31	and
0:07:33	within class
0:07:34	very well as normalization
0:07:37	do not it
0:07:38	the lowest weight the idea is that
0:07:41	you
0:07:42	exact the direction of high inter-speaker each
0:07:47	which she's
0:07:49	though the partition
0:07:51	the taxes in projection matters is obtained by
0:07:56	could cut computation so equation with
0:08:00	chomsky
0:08:01	people
0:08:02	the composition
0:08:04	i G E is
0:08:06	partition magic's
0:08:08	and the buttons as we use process
0:08:11	partition methods
0:08:13	that using it was that was since direction
0:08:18	so
0:08:19	G
0:08:20	partition magic's
0:08:22	and they use
0:08:24	you don't ten
0:08:26	finally compose the eigen vectors of the within class
0:08:30	covariance normalization
0:08:32	metrics
0:08:36	so i would use the experience about how to use that perform well
0:08:42	in the interesting
0:08:45	relation box
0:08:47	one best or we use the line junction tree which involves we have recording
0:08:52	yeah we went into
0:08:53	so i don't she for the tree and the test
0:08:57	then we'll description about all
0:09:00	so the speaker recognition system you
0:09:04	yeah
0:09:04	which use the gmm-ubm baseline system
0:09:08	it's just as it's the speaker recognition system and then ways you would
0:09:14	so we'll
0:09:15	i've based speaker recognition system with different
0:09:19	interested over verification
0:09:21	instance
0:09:22	i'm thinking of a large and then we use the expression
0:09:26	results
0:09:30	the ranges over the variation corpus that we use
0:09:34	these counts for
0:09:36	one hundred
0:09:37	we must use events
0:09:39	which she has
0:09:42	they to try to solve the speech chinese
0:09:45	yeah it used for eighteen years ago to tell you i guess
0:09:50	yeah
0:09:51	two how variation forms just a
0:09:54	still people
0:09:56	yeah
0:09:58	each student speaks for stream units
0:10:02	for each variation form
0:10:04	so that the
0:10:07	then each day what is it about two ten
0:10:10	parts
0:10:11	so each part not for
0:10:14	eighteen seconds that is used for training and testing
0:10:18	and some of that
0:10:20	okay resolution is a parts
0:10:23	and these or model soundtrack
0:10:27	we use the data machines in the intrinsic variation corpus
0:10:32	the function
0:10:33	have been for a specific you present to apply use for training would be a
0:10:38	we just thirty speakers
0:10:41	and fifty male physician variables that to which uses gender dependent and gender independent ubm
0:10:48	the last for eighteen hours and the current the trial
0:10:52	orientation forms
0:10:54	then we use thirty speaks
0:10:57	around six P
0:10:58	data to train
0:11:00	the total reliability space which is a much extreme
0:11:05	also it is not for eighteen hours
0:11:08	and of course we have to
0:11:11	we use straight
0:11:12	different
0:11:15	interesting the composition a large
0:11:18	lda up to their energy so have to train the projection last week's forty
0:11:23	and you
0:11:24	and speakers
0:11:26	which asking for time outs
0:11:29	for training partition a six
0:11:32	asked we used one speakers which included in two thousand four hundred utterance
0:11:38	for the task
0:11:40	and
0:11:41	all tell variation forms
0:11:46	and that way you five
0:11:48	speaker recognition systems
0:11:50	we use the gmm-ubm speaker
0:11:53	six
0:11:54	speaker recognition system as a baseline system
0:11:56	which is
0:11:58	the gmm-ubm is composed of
0:12:01	several
0:12:02	also
0:12:03	mixture
0:12:05	the feature volumes days thirteen on original mfcc and ubm is composed so if you
0:12:11	want to five hundred child gaussian mixtures
0:12:15	and that is a speaker verification system is
0:12:20	use the lp in terms of them but also with a combination of whatever you
0:12:26	know
0:12:27	and the i-vector dimension of that are these two hundred
0:12:35	this table
0:12:36	oh
0:12:38	you incorporate for you for each enrollment condition when testing utterance
0:12:44	so the total variation forms
0:12:46	and
0:12:47	for us to use the speech recognition
0:12:50	you to include we choose the spontaneous speech is that this case
0:12:55	then we have
0:12:56	a six aspects including speech studies that you know one speaking rate emotional state physical
0:13:04	state
0:13:05	speech and language
0:13:07	there are
0:13:08	well calibration forms
0:13:11	and for each variation forms
0:13:13	way
0:13:15	we use them
0:13:16	for the enrollment condition
0:13:19	and trust
0:13:21	this year we said well with water variation forms and we can see that E
0:13:27	yeah i based system
0:13:29	perform much better than the gmm ubm
0:13:33	baseline system
0:13:34	the best results obtained of the egg
0:13:38	which is a combination
0:13:40	of lda and wccn
0:13:44	and also we have
0:13:48	see
0:13:50	in what a different variation forms
0:13:52	we found if you used to whisper
0:13:55	as you won't match
0:13:57	then
0:13:59	the eer is
0:14:01	or not
0:14:03	so that perform a whole
0:14:10	that way
0:14:11	calculated avoid for speaker repeated and
0:14:15	yeah
0:14:16	iteration calls
0:14:17	and from this table we can see that
0:14:21	i-vector system i-vector be used in a speaker tracking system is better
0:14:26	then the gmm ubm
0:14:28	speaker locations
0:14:29	in reducing the variation corpus
0:14:32	and
0:14:34	the best results you obtained in the i-vector
0:14:38	based
0:14:39	speaker consistent with the relation okay
0:14:43	yeah
0:14:43	and
0:14:44	we lately
0:14:47	section six
0:14:49	as an
0:14:52	icsi's a det curve or a speaker system
0:14:57	i S gmm ubm based on this
0:15:01	pitch and the
0:15:04	so that these two
0:15:07	see in system with that would be a and wccn
0:15:11	we can see
0:15:12	there are three the improvements for the performance
0:15:19	this to this paper shows
0:15:22	the camera the reason between gmm-ubm system and i-vector system
0:15:28	you
0:15:29	matched and mismatched conditions
0:15:31	so faster we can see the first two comes is used matched conditions
0:15:38	the last two is for mismatched conditions
0:15:42	and they use
0:15:43	we can we computed for each
0:15:47	variation forms
0:15:48	and we can see for each variation forms
0:15:52	mismatched
0:15:54	in this to matched conditions
0:15:56	the huge the yard is much bigger
0:15:59	there is a match the conditions
0:16:02	and the second we can be always
0:16:04	can you know the gmm-ubm system and the i-vector is the system
0:16:10	and we can see
0:16:11	for example for spontaneous
0:16:14	margin for
0:16:16	the one the ones for the gmm-ubm the yellow ones for the i-vector systems and
0:16:26	the
0:16:27	there are the
0:16:30	when the whole whisper
0:16:32	version of all the i-vectors this system is that
0:16:36	have a
0:16:38	significant
0:16:40	we actually
0:16:42	oh
0:16:44	and
0:16:45	this table shows for each testing condition when spontaneous
0:16:51	utterance find for enrollment
0:16:54	when the
0:16:56	cost
0:16:57	the most are you know the whole way we speak
0:17:01	we spontaneously so it can see when testing with each iteration vol
0:17:08	"'cause"
0:17:09	turn moment for me is that spontaneous so if you castaways it also the spontaneous
0:17:14	for the yeah using it should be a small and the best results we obtained
0:17:20	with obvious isn't it
0:17:22	and
0:17:24	also in the past few enrolment we use it
0:17:29	spontaneous bombard castaways the whisper
0:17:33	duration and they were found that
0:17:35	the
0:17:36	yeah is it might speaker
0:17:39	and the whole performance
0:17:42	shot duration
0:17:44	this
0:17:45	speaker say that
0:17:48	so since the whisper variation used to
0:17:53	but different from the heart a very simple
0:17:55	so we do so we
0:17:58	presented is table which shows if you
0:18:01	norman we see whisper utterances
0:18:03	what about the eer for a for each testing condition
0:18:08	and we can see that
0:18:11	the results using
0:18:13	using become much worse
0:18:15	for example for the gmm-ubm system is wrong
0:18:20	what you
0:18:21	percent
0:18:22	then the best results are obtained in the matched recognition which she's
0:18:28	seventeen percent
0:18:30	yeah
0:18:31	also
0:18:32	for the whole picture we can see that
0:18:34	the i-th basis people in system
0:18:37	is still
0:18:38	perform well
0:18:39	the problem better than the gmm-ubm system
0:18:45	the combination of lda and which is an
0:18:48	we also performed best
0:18:55	so we have well occlusions that's
0:18:58	mismatch using you just a confederation course channel variation in speaker recognition performance
0:19:04	and the second these the i-vector framework one but then gmm ubm you modelling agency
0:19:10	the variations
0:19:12	and especially with a combination of four
0:19:16	lda an adapted and the best they can get the best results
0:19:20	this that the whisper utterances that much different form of the variation forms
0:19:26	is that brings the matched condition of speaker recognition performance
0:19:31	so of to work will in the model domain there will try much more useful
0:19:38	just iteration compensation
0:19:40	and also in the visual domain will
0:19:45	will propose some in
0:19:46	i don't mess between four
0:19:50	for example we do you
0:19:52	the
0:19:53	whisper where the whisper variation in the best results
0:19:59	maybe best if
0:20:02	after the vad
0:20:04	the list the
0:20:06	the
0:20:07	whisper low quality is much shorter the model
0:20:11	rep a speech
0:20:13	the second these whisper the speech she is
0:20:16	different
0:20:18	for is much different from other speech sound which involves so we can do some
0:20:24	work in the feature domain
0:20:26	to include just the performance of the speaker but he system
0:20:31	that's all thank you
0:20:35	i
0:20:50	yes
0:20:51	we will record the this database in the fatter profitable and they all use the
0:20:58	one
0:20:58	they all students and the i and why they
0:21:04	some
0:21:05	what in a paper
0:21:08	tell
0:21:09	which you
0:21:11	they have to act the emotion
0:21:14	yeah i something
0:21:15	target
0:21:16	how to act
0:21:18	some
0:21:39	i
0:21:41	i
0:21:44	yes for example
0:21:48	i
0:21:49	i
0:21:55	yes for example if you if you speech parameter we may be
0:21:59	we have to you can alter so those listed you model and motion stays at
0:22:07	so when we are part of the database we try to just a change you
0:22:12	one mation also some
0:22:15	some of deformities relation
0:22:17	so we just to try to
0:22:20	asked to separate are the eyes signals on
0:22:24	elation
0:22:41	assume we have
0:22:42	investigation
0:22:44	in future work
0:22:45	some of it
0:22:47	thank you

Study on the Effects of Intrinsic Variation using i-Vectors in Text-Independent Speaker Verification

SESSION 06: Speaker Recognition - Channel Robustness

Sheng Chen