Přepis řeči - Experiments in SVM-based Speaker Verification Using Short Utterances

0:00:06	uh good morning everyone i'm much more claritin uh that would be presenting somewhere that it uh
0:00:12	it it is here at Q U T back numbers
0:00:14	try to
0:00:15	up now relocated to another one
0:00:18	S anyone's wondering
0:00:19	are presenting on behalf of the colts as as well robbie by brendan baker and strata street hard
0:00:25	the web today is basically an experimental study on how svms perform
0:00:30	when you decrease the amount of
0:00:32	speech that is available to them for speaker there
0:00:36	some brief outline
0:00:37	or the motivation why we did this study
0:00:40	uh
0:00:40	and then we'll do some experiments looking at how each of the components of a standard
0:00:46	gmm svm
0:00:47	it's them
0:00:48	how how it responds to the rim job
0:00:50	no
0:00:51	page uh being available to it
0:00:53	this includes the background dataset
0:00:55	session compensation particularly now
0:00:57	uh we look at the a bit of an analysis of the variation in the kernel space with short utterances
0:01:04	and for score normalisation dataset
0:01:06	then a present some
0:01:07	creations
0:01:09	so motivation
0:01:11	uh it's quite well known that as you reduce the amount of speech available to assist them
0:01:15	we're going to have a reduction
0:01:16	performance
0:01:18	no there have been some previous studies uh which generally focus on the gmmubm approach and even more recently with
0:01:25	the uh joint factor analysis
0:01:27	uh but nothing really targeted in the svm case and this is why
0:01:32	uh we're doing this work here
0:01:34	uh one of the things to mention here's acuity participated in the valley to which is almost a miniature nist
0:01:40	evaluation i guess you'd say
0:01:41	in two thousand on
0:01:43	and some of the observations we got from this uh evaluation
0:01:48	was that
0:01:48	the svm outperform L J I sister
0:01:52	when we had ample amount of spaces
0:01:54	six minutes
0:01:55	uh where is the op
0:01:56	that was true for me twenty second
0:01:58	condition subject i perform better
0:02:01	there was a distinct difference between the generative and discriminative
0:02:04	right is
0:02:05	um
0:02:06	that was depending on the duration of each
0:02:09	come in
0:02:10	another observation here was also the chair i was more effective when
0:02:13	estimating the session and
0:02:15	take it sells places
0:02:16	on a duration of speech that was similar to evaluation condition
0:02:22	so we're going to look at that a bit over this in
0:02:26	of course it's the ends are quite right
0:02:28	right
0:02:28	in the speaker verification community we just have to look at the presentations last week
0:02:32	um this two thousand ten where almost all
0:02:35	submissions had uh the gmm svm
0:02:38	configuration in this somehow
0:02:41	uh so we're looking now at
0:02:43	now
0:02:44	having to to a T is to select element development ah ah
0:02:48	uh when we have mismatch mismatch
0:02:51	training and trot segment durations
0:02:53	in the svm configure
0:02:56	so the main questions here for the svm systems uh
0:03:00	to what degree
0:03:01	limited speech affect
0:03:02	yes fan back
0:03:03	base class
0:03:04	okay
0:03:05	and also which system components on my sense
0:03:07	steve
0:03:08	just speech quantity
0:03:09	uh so we're presenting these results
0:03:12	oh
0:03:12	with the hypo
0:03:13	pointing direction which time to uh counteract
0:03:17	effects
0:03:17	i should say
0:03:19	most of you know about the gmm svm system i would suppose
0:03:23	uh where we using stacked gmm component means that speech is for the svm classification
0:03:28	we now we can get good
0:03:29	formance when you have plenty of speech available
0:03:32	and
0:03:33	in this work we're looking at uh the important
0:03:36	of matching and development dataset
0:03:38	to the guy white
0:03:39	conditions
0:03:39	for each of the individual component
0:03:43	let's take a look at uh
0:03:44	the flow diagram of the
0:03:47	system
0:03:47	and basically we have three main datasets that uh go into development
0:03:52	first of all we want to train i transfer matrix
0:03:55	perception come
0:03:56	section
0:03:57	particularly now
0:03:58	uh so we have a transform training data
0:04:00	we also have a background dataset
0:04:02	for about
0:04:03	provide negative information during
0:04:05	svm training
0:04:08	and lastly we have score normalisation dataset secured
0:04:10	choose to apply score normalisation
0:04:15	the upright for this
0:04:16	study
0:04:17	uh
0:04:18	is that we're going to go from a baseline svm system that's one without
0:04:22	score normalisation and noise session comp
0:04:24	citation
0:04:24	and build onto that progressively
0:04:26	looking at how it to the additional components
0:04:29	um are affected by the duration
0:04:32	speech
0:04:33	uh so these three sets as i mentioned whether the background dataset
0:04:37	training data set
0:04:38	session compensation and lastly score
0:04:42	so maybe a quick look at the uh system we're working with here's the gmm svm system five hundred twelve
0:04:48	finding you the end
0:04:49	twelve dimension if
0:04:51	mfccs with appended delta is
0:04:53	impostor daughter was like ninety from sre are for
0:04:56	and we use this stuff by the background dataset and uh ct score normalisation
0:05:02	with no we use uh only
0:05:04	dimension dimensions
0:05:06	greatest variation
0:05:07	and then one from sre lance
0:05:09	which boarding
0:05:12	here we are
0:05:12	valuations we perform here from the nist two thousand
0:05:15	i corpora
0:05:17	particularly the shore to ensure three condition
0:05:19	now this usually has two and a half minutes of conversational speech per utterance
0:05:24	uh
0:05:24	and the way we looking introduced duration
0:05:27	is uh into focus condition
0:05:30	for short condition and sure sure
0:05:32	dish
0:05:32	and for sure condition really the training segment as is
0:05:36	pulling
0:05:37	and
0:05:37	we uh
0:05:38	progressively
0:05:40	truncate the test utterance
0:05:41	to to decide
0:05:43	in the short short
0:05:44	case
0:05:45	we truncate by train and test
0:05:47	to the same direction so it's essentially not
0:05:49	uh duration in this evaluation
0:05:53	so let's look at the baseline svm performance
0:05:56	any particular going to go back to
0:05:58	uh what we'll do it in detail later and say how phones compared to the G M and
0:06:03	it's just a guess
0:06:03	point of reference all
0:06:05	what we will
0:06:07	so here we using uh
0:06:10	baseline and what we're timing state of the art
0:06:13	um
0:06:14	which is now not so true
0:06:15	uh
0:06:16	with the oh i vector part coming out
0:06:18	um
0:06:19	we're looking at the baseline and study are both gmm and svm systems
0:06:24	four systems that were developed using the full two and a half minutes of speech in training
0:06:29	test
0:06:30	so we're not
0:06:30	uh explicitly dealing with the
0:06:33	load
0:06:33	actions as
0:06:34	fig
0:06:36	the first thing we notice here
0:06:37	this
0:06:38	solid line
0:06:40	all the baseline
0:06:41	arches
0:06:41	we say that the baseline svm part
0:06:44	uh gives us
0:06:45	better performance than the gmm baseline
0:06:48	uh
0:06:49	just doesn't like the gmm baseline he has nice session compensation
0:06:53	and our score normalisation which might
0:06:56	you what
0:06:56	being conservative
0:06:58	but
0:06:58	as we reduce the duration of speech the S P N
0:07:01	uh
0:07:02	quickly deteriorates in performance compared to the gmm system
0:07:07	uh
0:07:08	it's not quite noticeable in the state of the art
0:07:11	um but the gmm is
0:07:12	uh
0:07:13	in front of this in the hallway
0:07:15	now if we look at the short short
0:07:17	uh conditions this is where both train and test of being reduced
0:07:21	actually see that the svm baselines
0:07:24	them out
0:07:25	on the
0:07:26	cycles data they are
0:07:28	uh
0:07:28	once we reduce be like the eighty second sorry
0:07:31	uh having that
0:07:33	the development of the system on for two and how you know
0:07:36	speech here
0:07:37	might be the reason for this but we're got to look into that
0:07:40	in the case the G M G M M system however
0:07:43	less than ten seconds that was saying the baseline jump in front of
0:07:47	D better
0:07:48	yeah
0:07:50	so there's a good some significant differences and issues we need to look into he
0:07:54	and hopefully
0:07:55	uh the development datasets that we look into here will help us out with that
0:08:00	let's start with the background dataset
0:08:02	and here we're going to look at the svm system
0:08:05	and
0:08:06	how changing the speech direction in the background dataset affects performance
0:08:10	without score normalisation
0:08:11	and without session compensation
0:08:15	so as we know it background dataset gives us the negative information in svm training
0:08:20	we generally have
0:08:21	many more negative examples thanks fine examples in the nist sre is
0:08:26	and we previously signed uh that the choice of this dataset greatly affects model quality
0:08:32	a real question comes up with E S P N C is how we select this data set
0:08:37	in mismatched train test duration
0:08:40	we should we be matching the duration to the try not hurt
0:08:43	the test utterance
0:08:44	all the shorter of the two out
0:08:48	so colour us there is a three slides here to print for present
0:08:52	firstly we've got a short short conditions that match
0:08:55	training and testing direction
0:08:57	and that's quite obvious that it's better to match
0:08:59	background to the uh evaluation conditions here
0:09:02	in the fall shorts that's for training
0:09:05	short testing
0:09:06	actually signals better to match
0:09:08	the background dataset to the test
0:09:10	the shorter
0:09:11	test after
0:09:15	in the last condition which we have introduced a shortfall social testing
0:09:19	training
0:09:20	for test
0:09:21	uh
0:09:22	and again we don't see what
0:09:23	uh as as large a discrepancy in the short their durations
0:09:27	but
0:09:28	we're actually saying that matching to the shorter
0:09:31	training utterance give us a little bit of an impertinent towards the uh larger rice and see
0:09:37	so what conclusions can we draw from this will let's look at the equal error rate as well on this
0:09:41	click here to give us a bit more
0:09:43	for you
0:09:44	and we
0:09:44	particularly by pressing on the ten second condition here
0:09:49	first thing we can see here is that matching the background dataset to the training segment
0:09:54	does not always maximise
0:09:55	one
0:09:58	however if we matched to the test segment
0:10:01	in our results were always getting the best
0:10:03	dcf performance
0:10:05	and in contrast
0:10:06	if we want the best equal error upon
0:10:08	we next to the shortest you're right
0:10:11	so is a bit of a choice can be made it depending on what you want justice
0:10:15	the what operating point you wanna i
0:10:20	so in the following
0:10:22	chairman switch a reason uh to use
0:10:24	the shorter test our
0:10:26	as the duration that we're matching up
0:10:29	granddaughters set
0:10:31	that's look now session compensation
0:10:34	nuisance attribute projection
0:10:37	a or maybe some kind of spice the directions of greatest uh session variation
0:10:42	and as a small honourably and showing that uh
0:10:45	the dimensions captured in the U
0:10:47	transform matrix are projected out of the kernel space
0:10:50	'cause transform you has to be learned from a training data set
0:10:55	now what would be using in this transformed right training dataset when we've got limited test page
0:11:00	what is what
0:11:01	train and test speech of minutes
0:11:05	on this board first are we looking at the whole short condition
0:11:09	uh
0:11:10	L system he has no score normalisation but the background as being that's to the shorter test
0:11:15	abhorrence in each of these cases
0:11:18	and it's quite clear that using match
0:11:20	not training in this
0:11:22	that's matching to the short test after
0:11:25	gives us the best
0:11:26	phone
0:11:27	and in fact if we use
0:11:29	full net
0:11:30	training
0:11:31	the referent
0:11:31	system that's one without nap
0:11:33	jumps in front in the longer duration
0:11:35	sorry
0:11:36	here we really wanna match to the net
0:11:38	uh to the
0:11:39	shorter
0:11:40	test duration in than that trance
0:11:45	and in that i was tied to the mice
0:11:47	challenging trust
0:11:48	so the short
0:11:52	now let's look at the short short isis an interesting case
0:11:56	because
0:11:56	we actually observe that even though we match
0:11:59	the net training data set to the ten second duration
0:12:03	where
0:12:04	still finding the best
0:12:05	performance comes from baseline system so one without now
0:12:09	so why is this the we we pointing up the full nap training of pasta great
0:12:13	one
0:12:14	quite
0:12:14	significantly
0:12:15	uh but matt's not just isn't something in front of the base
0:12:19	so nasty
0:12:20	point somewhere that
0:12:21	not
0:12:22	uh files to provide benefits
0:12:24	uh in the limited training and testing
0:12:29	so what point is
0:12:30	well he's a plot where would match than that
0:12:32	training
0:12:33	based on the yeah duration
0:12:35	in the short short
0:12:36	remember this is short short condition whereas
0:12:39	for sure we actually
0:12:40	got more
0:12:41	a benefit out of that
0:12:43	well actually see that
0:12:45	just below forty second mark a nasty
0:12:47	uh is where the reference system jobs in front
0:12:50	i compensated
0:12:53	so then
0:12:54	why is this happening
0:12:56	let's look at the uh variability and we can
0:13:00	so if and the not wasn't quite robust to limited
0:13:02	training and testing speech
0:13:04	um
0:13:05	in the context of jack by
0:13:07	uh systems
0:13:09	the session subspace
0:13:10	variation withstand too
0:13:12	increase
0:13:13	uh as the re
0:13:15	the length of
0:13:16	training and testing either
0:13:17	do you reduce
0:13:18	so we're going to say that's assigned times in the svm kernel
0:13:25	on the slide we have a table with um number of durations
0:13:29	will be short short
0:13:30	uh draw condition
0:13:32	and we
0:13:33	also got a
0:13:34	top i reference on that rare
0:13:36	relevance factor all night
0:13:38	uh and we're
0:13:39	presenting the total variability
0:13:42	uh in the
0:13:44	they get space and session space
0:13:46	um
0:13:47	oh the svm kernel
0:13:49	and we actually say that
0:13:50	in contrast to what was observed which i pi
0:13:53	we're getting a reduction in both of these bases as duration is
0:13:57	great
0:13:58	no wonder why is this the case what is the difference here
0:14:01	and so what we did
0:14:02	was actually take an inconsequential town close to zero
0:14:06	uh so that
0:14:07	uh
0:14:08	S supervectors have more room to maybe
0:14:11	we actually find that we do in fact agree with the jedi
0:14:14	uh
0:14:15	observations and that we are getting
0:14:17	more
0:14:18	uh i greater magnitude of cargo in each of these cases
0:14:22	if we uh
0:14:23	change irrelevant
0:14:24	back to
0:14:25	too close to zero
0:14:27	so here we consider a map adaptation relevance factor has a significant influence on the observable variation in the svm
0:14:33	kernel space
0:14:34	that's just something to be aware of
0:14:37	now what's interesting night irrespective of the town that we use
0:14:41	we're getting very similar
0:14:43	um
0:14:44	session to speaker right here so you
0:14:47	session variation that's coming out is a more dominant
0:14:51	uh as the duration is reduced
0:14:53	and of course this is why speaker
0:14:55	okay
0:14:55	she's more difficult with
0:14:57	uh
0:14:57	shorter
0:14:58	speech segment
0:15:01	so why then
0:15:02	we're getting more session variation
0:15:04	why is now struggling to estimate that
0:15:06	um
0:15:07	as we reduce the duration
0:15:10	just look at this uh for you
0:15:12	we have
0:15:13	this session variability in the magnitude of session variability and speaker variability
0:15:18	in the top one hundred eigenvectors estimated by now
0:15:21	um
0:15:23	for direction of eighty seconds and ten second
0:15:26	now the
0:15:27	solid lines i do seconds that one's a ten sec
0:15:30	and session variability is the black line
0:15:33	first thing we notice he is that
0:15:35	when we have longer
0:15:37	durations
0:15:37	speech
0:15:38	this large
0:15:39	for the session variation is great
0:15:41	so we're getting more
0:15:43	session variation
0:15:44	that can be represented in a lower than men
0:15:48	uh whereas as the duration
0:15:50	reduces we
0:15:51	flattening out would be coming bit more isotropic in our session
0:15:55	a variation
0:15:57	in contrast L speaker variation
0:15:59	slide is actually quite similar
0:16:03	this aligns with the uh table we just saw
0:16:06	where these session variation is uh
0:16:09	it coming from one domain
0:16:12	then that was developed on the assumption that the majority of session variation lots and like dimensional space
0:16:19	so
0:16:19	it's our understanding of it
0:16:21	the because of the
0:16:23	um
0:16:24	isotropic
0:16:25	uh more isotropic session variation that
0:16:28	coming about on these reduced up
0:16:30	says
0:16:31	that
0:16:31	the assumption no longer holds and this is why it's unable to our benefit
0:16:36	in the short short condition
0:16:38	so how do we can overcome this problem
0:16:40	we're still working on the
0:16:45	next to move on to score normalisation
0:16:47	uh
0:16:48	it quite a lot because everyone knows
0:16:50	it's colonisation is he
0:16:52	i think of the last you
0:16:54	presentations
0:16:55	uh basically can correct statistical variation in class
0:16:58	cations goals
0:16:59	and attentive
0:17:00	scowl schools from
0:17:02	uh i given trout or by what is
0:17:04	fusion
0:17:04	using a to Z normal T normal check line and test centric approaches respectively
0:17:10	and again we using an impostor cohort something we need to
0:17:13	select that way
0:17:16	no typically
0:17:17	score normalisation cohorts should match the evaluation conditions
0:17:21	the context the
0:17:22	S P Ns we want an R
0:17:24	how important is it to match these
0:17:26	uh conditions
0:17:27	and how much to score normalisation X
0:17:29	benefit us when we have limited space
0:17:34	this type of here we've got the uh
0:17:36	full short condition on the second row
0:17:39	and the short short condition down the bottom they're looking at the ten sec
0:17:43	condition in particular
0:17:44	we have three different horrible selection method see none which other all schools are normalised
0:17:49	full
0:17:50	which means out by tells the and T norm
0:17:52	uh cardboard so using two and a half minutes
0:17:55	speech
0:17:55	and then match
0:17:57	sorry
0:17:57	in the case of the full ten second
0:18:00	condition he met
0:18:01	simply means is that you know matter and
0:18:03	a truncated to that end
0:18:05	whereas in the ten second ten second case
0:18:07	but it's the ending on that
0:18:09	right
0:18:12	that's quite obvious that the full uh hard what's it going give us worst performance we
0:18:17	we can see
0:18:18	and that maps no longer holds offer the best
0:18:22	so uh quite elementary but
0:18:24	the uh interesting observation here is that
0:18:28	uh
0:18:29	the relative performance gain from applying score normalisation
0:18:32	seems quite minimal sorry
0:18:34	the question is
0:18:36	uh
0:18:37	at what point are we willing to
0:18:39	you go about choosing at a score normalisation sets to try and help
0:18:43	on
0:18:45	so that try and help answer that question we looked at the
0:18:48	relative gain in min dcf that score normalisation provides
0:18:52	as we reduce the duration of speech
0:18:56	we say that would
0:18:56	the full eighty seconds weakening i attend the same kind which is
0:18:59	hmmm
0:19:00	quite reasonable
0:19:01	it's in the lower durations of speech five and ten seconds we've got less than two percent relative gain
0:19:07	are these really worth yeah i do
0:19:08	trying to choose at a good normalised
0:19:11	that
0:19:11	uh and the risk
0:19:12	that
0:19:13	and normalisation
0:19:14	set
0:19:14	uh
0:19:15	i'm not actually kind of chosen well and
0:19:17	reduced
0:19:18	on
0:19:19	that's another question is right now
0:19:22	thank conclusion we've been investigated
0:19:24	sensitivity of the populist the end system
0:19:27	uh to reduce training and testing segments
0:19:29	and we found the best phone i'm from selecting a background
0:19:33	uh that match the shortest test duration depending on
0:19:37	when you want to optimise the dcf or equal error rate
0:19:40	but not a transforms trained on data matching
0:19:43	it sure just
0:19:44	a direction that was the best performance
0:19:46	and score normalisation
0:19:48	how much
0:19:49	conditions were also the best
0:19:51	the highlight an issue in that
0:19:53	when dealing with a limited speech and this is judy session variability
0:19:57	becoming more isotropic the speech duration was reduced
0:20:00	and
0:20:01	score normalisation provider uh what you
0:20:04	in the
0:20:06	uh condition
0:20:08	thank you for
0:20:17	thank you for the
0:20:18	that's a systematic
0:20:20	uh
0:20:21	investigation into the effects of uh
0:20:23	uh
0:20:24	duration
0:20:25	um
0:20:27	as far as i can see
0:20:29	but trick
0:20:30	uh there's a patient this morning which i'm not sure
0:20:32	you
0:20:33	you will you have no impact at the sleeping well that had a uh right
0:20:37	we're not going on you know that
0:20:39	i i think
0:20:40	um
0:20:41	uh
0:20:42	patrick
0:20:43	observations this morning
0:20:44	uh
0:20:45	yeah
0:20:46	a nice
0:20:48	explanation
0:20:49	of what you see
0:20:50	so
0:20:51	the short
0:20:52	explanation
0:20:53	uh if you using relevance map
0:20:55	uh_huh then
0:20:56	um
0:20:58	you
0:20:59	introducing
0:21:01	speaker dependent
0:21:02	uh
0:21:03	within speaker
0:21:05	variability
0:21:06	uh that's what
0:21:07	but recall uh the uh
0:21:09	the original script
0:21:11	um
0:21:11	so
0:21:14	you agree with me that explains
0:21:16	perhaps explains
0:21:18	what you see
0:21:24	i'll i'll have to talk for the other ones are honest representation
0:21:27	one
0:21:29	so any any others
0:21:31	any other questions
0:21:37	about
0:21:38	posted
0:21:39	um
0:21:40	your name uh you're you're matrix for the
0:21:43	now
0:21:43	and to do and relevance map and maybe we pca on
0:21:47	that information
0:21:49	sorry a saying
0:21:50	yeah
0:21:50	not quite well
0:21:52	sorry
0:21:52	um my question is regarding the
0:21:54	uh how you really mean the U matrix uh
0:21:57	to project the way
0:21:58	so you're doing relevance map
0:22:00	uh a man on bad
0:22:02	you're not P C
0:22:04	computing it
0:22:05	pca pca on uh
0:22:07	your uh um centre
0:22:09	real time at that or or
0:22:13	i know that uh to estimate you matrix we are doing some kind of pca to go to los lights
0:22:19	for computational reasons
0:22:21	but then we go back to the original
0:22:23	so that would
0:22:24	but not so my question is uh
0:22:27	vicki lapsing when you learned that you matrix
0:22:30	is that uh if you just doing a regular pca which is uh
0:22:34	the computer low dimensional approximation of your uh
0:22:38	if you put all your body that's vectors
0:22:40	i mean you do uh low rank approximation about me to basically what piece you know
0:22:45	you're not taking into account
0:22:47	the
0:22:47	the count
0:22:48	uh that when you do your part to analyses
0:22:52	um
0:22:53	you using the count somehow
0:22:55	to uh
0:22:56	wait
0:22:57	they
0:22:57	four tones
0:22:58	of uh
0:23:00	information in in different parts of the
0:23:03	the pen to
0:23:04	so um i my question is mostly we're going to
0:23:08	are you somehow incorporating
0:23:10	the information that
0:23:11	when you have a lot of gaussian and i'm very few points
0:23:15	not all the gaussian get us assign points
0:23:18	and then when you
0:23:19	train your subspace
0:23:20	you're subspace
0:23:21	does not know that
0:23:23	so maybe that accounts for a lot of these uh
0:23:26	observations are you happier
0:23:28	understanding point actually i think
0:23:30	i think either
0:23:31	uh i don't believe we're actually explicitly take into account
0:23:35	um
0:23:37	the fact that some gaussians might miss out on
0:23:40	oh
0:23:40	patient
0:23:42	and yeah i think i can understand
0:23:44	saying that it's might have an effect on the
0:23:46	but on the united
0:24:01	um
0:24:01	uh
0:24:02	you mean yeah
0:24:04	i'm a little
0:24:06	sure about the
0:24:08	so what
0:24:08	i i mean i'm all
0:24:09	you're cool
0:24:10	studies because you want to see what works best
0:24:13	but you also want to understand why it works best
0:24:16	so what you said sort of
0:24:18	or magnitude of standpoint was
0:24:19	you doing this
0:24:20	process
0:24:22	oh map to get gaussian
0:24:24	and then you're comparing the means a some training
0:24:27	gaussians you got mad
0:24:29	with some test gaussians you go with mapping using U S B M
0:24:32	and if it's not the same amount of data
0:24:35	things go wrong
0:24:36	basically
0:24:37	and so
0:24:38	uh
0:24:40	the solution you're applying is your single make it the same length
0:24:44	um
0:24:45	it would seem like
0:24:46	uh
0:24:48	you
0:24:49	yeah but you did that study without normalisation
0:24:52	okay so of course when the noise
0:24:54	uh you dicks
0:24:55	back
0:24:56	all kinds of normalisation is there as you said
0:24:59	two
0:25:00	deal with
0:25:01	differences like just another differences
0:25:03	um i'm wondering whether
0:25:05	by doing it without normalisation that was true i
0:25:08	making the worst possible condition that
0:25:11	it wouldn't be fixed produced
0:25:12	but your solution ended up being discard data
0:25:15	so did you read it would so the first question i guess is
0:25:18	when you truncated the training samples did you literally just discard the rest of the data where did you
0:25:23	create additional short training utterances out of those
0:25:26	and i would discover that i
0:25:28	okay
0:25:28	so
0:25:29	one obvious thing is if you if you take a thirty second utterance
0:25:32	truncated to ten seconds it would be wasteful not to use the other twenty
0:25:35	seconds as two more to the second term
0:25:38	um
0:25:39	but
0:25:39	besides that
0:25:40	that observation
0:25:42	i'm worried about the
0:25:44	yeah
0:25:45	uh
0:25:46	if you had used normalisation uh_huh
0:25:48	you might
0:25:49	fix the problem
0:25:50	to begin with did actually run they've also with school and
0:25:53	quantisation but we can't we found based
0:25:56	similar
0:25:56	by
0:25:57	but we wanted
0:25:58	two
0:25:58	uh try and get back to a very basic system just to help
0:26:02	i guess you'd say the breeders understanding and floor
0:26:05	of that i
0:26:06	i i i i'm i'm hearing in many papers especially today
0:26:10	a strong desire and everyone's part
0:26:13	two
0:26:14	find a way to do things without normalisation is it
0:26:17	somehow normalisation were a bad thing
0:26:20	when it seems to me that normalisation is
0:26:25	almost
0:26:25	beyond the obvious thing that you have to model the speech hmmm
0:26:29	it seems like the only other thing
0:26:31	you know very high level since
0:26:33	is a normalisation
0:26:34	after all we're doing
0:26:36	we're doing some kind of hypothesis test
0:26:38	verification
0:26:39	and
0:26:40	that
0:26:40	inherently requires
0:26:43	knowing how to set a threshold which require
0:26:45	or some kind of normalisation
0:26:47	and
0:26:47	if
0:26:48	to the extent that we try to get away from that
0:26:51	we're trying her hands behind her back
0:26:54	um
0:26:55	i mean it's good it's good to look for methods that are
0:26:58	inherently better
0:26:59	but
0:27:00	i guess i would
0:27:01	say
0:27:02	you know what
0:27:03	we should still do normalisation it can ever
0:27:06	okay
0:27:10	done properly
0:27:18	oh what is
0:27:19	where that my my claim was
0:27:21	uh
0:27:23	i well that's good to look for better models
0:27:26	um
0:27:27	i i don't see it
0:27:29	i don't
0:27:29	i understand the desire to do away with normalisation
0:27:33	seems like normalisation
0:27:35	is
0:27:36	at the crux of the problem
0:27:37	and ultimately
0:27:39	fig
0:27:40	fixed whatever else you do wrong
0:27:42	and if you never heard
0:27:53	yes normalisation does exactly that so
0:27:56	uh
0:27:57	what
0:27:57	what we are unhappy with
0:27:59	that we did do something wrong so
0:28:02	uh we we're trying to do
0:28:04	that's a bit of
0:28:05	uh
0:28:07	and
0:28:08	if and then we find
0:28:09	it's still not perfect
0:28:11	yeah
0:28:11	then i'm sure we will keep a normalised
0:28:14	so the other way to look at it is
0:28:16	but the
0:28:17	normalisation is just another
0:28:19	modelling stage
0:28:21	uh
0:28:21	the
0:28:22	extracting the mfcc features as modelling that the acoustic signal
0:28:26	and then
0:28:27	uh
0:28:28	gmms is is
0:28:29	modelling the mfccs and
0:28:32	uh i victor's again this morning
0:28:35	the
0:28:35	the gmm supervectors and then in the end
0:28:38	there's a score modelling stage
0:28:40	uh
0:28:42	so
0:28:43	at the end you just expecting more most pages might be nice just to use
0:28:48	uh
0:28:49	the number of
0:28:49	all stages but the
0:28:51	probably probably
0:28:54	we might just go on
0:28:55	mobilising forever
0:28:59	can we uh
0:29:00	uh have the next week

Experiments in SVM-based Speaker Verification Using Short Utterances

SESSION 4: Speaker and language recognition – scoring, confidences and calibration

Přidáno: 14. 7. 2010 11:08, Autor: Mitchell McLaren, Robbie Vogt, Brendan Baker, Sridha Sridharan (Queensland University of Technology), Délka: 0:29:17