Speech Transcript - An Initial Investigation on Optimizing Tandem Speaker Verification and Countermeasure Systems Using Reinforcement Learning

0:00:13	hello my name is on second at every stop and i'm here tell you about
0:00:15	our work on an initial investigation on optimising tandem speaker verification and ponder missus systems
0:00:21	using reinforcement learning
0:00:23	so that we all just on the same page
0:00:26	speaker verification system
0:00:29	verifies that the
0:00:30	claimed identity
0:00:31	and the
0:00:32	provided speech sample all the same come from the same people so person so other
0:00:37	make speaker verification system takes in a claimed identity
0:00:40	and some speech sample and if the identity max's
0:00:43	the identity of the who spoke
0:00:46	then all good the system will pass and then
0:00:49	likewise if somebody else
0:00:52	claims that they are someone to put their not and provides speech sample
0:00:56	it shouldn't let them pass
0:00:59	good
0:00:59	a very simple many of you work on this field
0:01:02	of course when it comes to security and systems like this there are some bad
0:01:05	guys want to break the system
0:01:08	so for example in this case
0:01:10	somebody could record
0:01:11	tommy speech here with a mobile phone and then later use that recorded speech
0:01:16	to claim that they are taught me by feeding saying that they are tommy and
0:01:19	playing the audio and the system will gladly expect accept that and previous work has
0:01:24	shown to the s p is proved with a seventeen
0:01:27	silence so that they are if you don't protect for this the extra system will
0:01:32	gladly accept this kind of a trial even though it shouldn't
0:01:37	likewise
0:01:38	you could
0:01:39	use they gathered dataset or collect data up some
0:01:41	somebody speaking
0:01:42	and then you speech synthesis or voice conversion
0:01:45	to again generate speech that sounds like tommy and feed it that the system and
0:01:49	it will again accent it all fine
0:01:52	and again this has been shown in previous competition it's a bob problem but
0:01:56	you can also fixed for this
0:01:58	so
0:01:59	this is
0:02:00	where
0:02:01	they come to missus come in so a condom is a system takes in
0:02:05	they
0:02:06	a speech sample that's well was provided to the extra system as well and also
0:02:11	checks that the sample comes from a
0:02:15	human speaker instead of like a mobile phone or it's not synthesise piece or voice
0:02:21	conversed speech so
0:02:23	it's like upon a five human speech
0:02:26	so for example somebody
0:02:29	has recorded somebody else's speech
0:02:31	feature the system but now it's fed to the condom is a system as well
0:02:35	and the count them is the system says it's or reject and then the other
0:02:39	car does not get access so it leads inside attacker
0:02:43	good so far and these competitions have shown that when you train for these kind
0:02:48	of situations you train to detect these would play
0:02:50	attacks of this
0:02:52	since the speech
0:02:53	you can detect them and all works fine
0:02:57	so
0:02:59	but one mac we had with this
0:03:02	setup is that
0:03:04	the
0:03:05	a yes we system
0:03:06	and they
0:03:07	condom is a system or trained completely independent from mixer so the fa system has
0:03:13	its own dataset its own laws its own training protocols
0:03:17	and so on someone
0:03:19	likewise the insistent has its own datasets intone was its own trainings protocols and its
0:03:23	own network architecture and someone so on
0:03:27	these are trained separately but then they are invalid together so they are evaluated as
0:03:31	one big bigger system
0:03:33	so
0:03:34	where you have a completely different
0:03:36	l is metric and these two systems have never been trained to minimize actually this
0:03:42	evaluation metric that been trained on their own tasks
0:03:46	so
0:03:47	we had this
0:03:48	coffee room idea
0:03:50	a where
0:03:52	what if when we have this kind of
0:03:55	bigger whole system
0:03:57	what if we train the
0:04:00	a svm to see insist then on the evaluation metric directly so
0:04:05	maybe on pop they already training already existing training they had we also optimize them
0:04:11	to minimize the or maximize the appellation metric for better results
0:04:16	and
0:04:17	however
0:04:18	sadly
0:04:19	it's not so very straightforward
0:04:22	so
0:04:23	we have this system where we split but speech to a svm cm system
0:04:27	they produce
0:04:28	i both of them produce like accent and reject label so i either accept or
0:04:33	reject and these are then fed to the appellation metric which usually computes
0:04:38	the error rates so false reject rate real reaction rate and false acceptance rate
0:04:43	and
0:04:44	these are then used in various ways depending on the evolution metric
0:04:48	the kind of come up with the one number to show how good the system
0:04:51	as a whole is
0:04:54	however
0:04:55	if we assume that these two i since the interest they are differentiable so they
0:04:59	are like on your networks which is quite common these days
0:05:04	if we wanted to
0:05:05	minimize the automatic we would need to compute the gradient of the evaluation metric with
0:05:10	respect to the
0:05:11	two systems we have or its parameters
0:05:14	sadly
0:05:15	the weak and that's compute the gradient over these hot addition of like accent reject
0:05:21	and
0:05:22	but these all required for be error rates
0:05:25	for the whole asymmetric so for example the tandem
0:05:28	decision cost function which we will be using later
0:05:32	requires these two error rates
0:05:34	and it's going to weight them in different ways but we can't from that
0:05:38	go back to the compute the gradient all the way back to the systems
0:05:42	because these heart this isn't is not differentiable operation and tools we can't calculate the
0:05:48	gradient
0:05:49	so
0:05:54	relating
0:05:55	i really in a related topic
0:05:57	other work has suggested as soft a remote error metrics
0:06:02	so for example
0:06:04	by softening the f-score f one score or are we undercut lost you can come
0:06:10	up with a differentiable version of the
0:06:13	score metric
0:06:13	and then you can do this computation so they
0:06:17	by softening it means they these heart decisions are kind of stuff and so that
0:06:20	you have a function you can actually there they got derivative of
0:06:24	and then you can compute disparity and however
0:06:27	the tandem decision cost function we have here does not have such source of person
0:06:31	so
0:06:32	instead
0:06:34	we looked into t v important nodding
0:06:37	so
0:06:38	in reinforcement learning there is this we're gonna simplified setup is like this so
0:06:43	the computer agent
0:06:45	sees a images or slight problem but it's
0:06:47	predicted some information from the game or the environment
0:06:51	the agent chooses an action a
0:06:54	and
0:06:55	the action is then executed on the environment
0:06:58	and depending if the outcome of the action is good on that the agent receives
0:07:03	some rework and
0:07:05	in reinforcement learning
0:07:06	the goal of it of this the whole setup is to
0:07:11	get a smarts reward as possible so modify d h and so that you get
0:07:15	as much reward as possible
0:07:17	and one way
0:07:18	to do this is
0:07:20	kind of a week the gradient
0:07:22	well so
0:07:23	we could take the gradient
0:07:25	of the expected reward so the reward i'll i've reached
0:07:29	overall difference in the set up to situations
0:07:32	and they got gradient of that with respect to t probably see with the respective
0:07:37	age and so if we can do this we can then
0:07:39	of course update the agent
0:07:41	two
0:07:43	towards the direction that increases the amount of reward
0:07:46	however
0:07:47	we hear also have this that problem that you can't really
0:07:51	differentiate is a decision part where we
0:07:53	choose an
0:07:54	one specific action of many
0:07:56	and
0:07:58	you execute that on the environment so we can't different see that and we can
0:08:01	compute the gradient
0:08:04	however there is a thing called police a gradient
0:08:07	where which kind of estimates this gradient
0:08:12	we do it is kind of a equation where
0:08:15	instead of calculating the gradient
0:08:17	of the reward directly it computes the gradient of the
0:08:21	log probabilities of the
0:08:23	selected actions
0:08:25	and weights them by t report
0:08:28	we got
0:08:29	and this has been shown in reinforcement learning
0:08:32	two
0:08:33	be quite effective us ready and also been shown that you can replace the
0:08:38	the air the td reward with any function and then
0:08:43	by running this you will also find
0:08:45	get the correct gradient so you can
0:08:47	get the correct same results with enough samples
0:08:52	so
0:08:52	going back to our tandem optimization
0:08:55	where we had the
0:08:56	same problem of
0:08:57	a heart decisions which we can't
0:09:00	differentiate we just apply this
0:09:03	police a gradient here that both to get it great in theorem here
0:09:07	where b
0:09:08	equation is more or less same
0:09:10	just with team or different meeting
0:09:17	so
0:09:18	we then proceeded to test this how well it works
0:09:22	with a rather trivial a set up so
0:09:25	we have two datasets
0:09:26	the fox let one
0:09:28	us
0:09:28	the stated and more specifically the speaker verification
0:09:31	part of it
0:09:32	and then we have t is feasible of nineteen
0:09:35	for the are synthesized speech and the other condor mercer tests and i labels and
0:09:41	for a has to be task we except extract the x vectors using t pretrained
0:09:46	tell the models
0:09:46	and forty s feasible we extract easy to see features
0:09:50	and these are fixed so these are not being trained in does and in this
0:09:53	setup
0:09:55	we then train the a actually system and the c insistent as thirty as normally
0:10:00	don't using these two datasets
0:10:03	and then evaluate them together using the d dcf cost function as
0:10:10	present in the a speech both nineteen competition
0:10:13	this we take the two pretrained systems and before random optimization on the mass previous
0:10:19	is shown and then finally we evaluate the results
0:10:23	and compared them with the pre-trained results and see if it actually helps
0:10:28	so
0:10:29	let surprisingly
0:10:30	the tandem optimization helps in out
0:10:34	very short not shelled
0:10:35	so
0:10:37	one way we see this
0:10:39	is by looking at this a learning curve where on the x-axis you have
0:10:44	the number of updates
0:10:47	you do so you can compared to this because where you have the loss and
0:10:50	the number of you box
0:10:51	and on the y-axis
0:10:53	you will have
0:10:54	the relative change in immolation set
0:10:57	compared to the operating system so if it was zero percent
0:11:01	it means the
0:11:02	the metric did not chains since the pretrained system
0:11:07	so
0:11:08	the main metric we wanted to minimize
0:11:11	is the minimum
0:11:12	the decision cost function normalized
0:11:15	and this indeed
0:11:17	decrease over time as we do the this a tandem optimization so
0:11:21	from zero percent change it went to minus twenty five percent change so yes it
0:11:26	improved
0:11:28	as a
0:11:30	then we also studied
0:11:31	how the
0:11:33	in the a visual systems changed over the training
0:11:36	so
0:11:37	for example to compliments or equal error rate in the
0:11:42	condom is a zone pass process detecting if it's move or not
0:11:45	it also improved by around ten percent
0:11:47	in this task but interestingly the a s v
0:11:51	e r
0:11:52	increased over time and we help of places that
0:11:56	because this is because
0:11:58	when
0:11:58	we have a way that
0:11:59	the a s p system in condom is a task and the condom sre in
0:12:04	pac task so looked at tasks we notice that of that these phantom optimization
0:12:10	the
0:12:12	to have improved in there
0:12:13	it's others task so a as we was better encounter mister
0:12:18	tasks
0:12:18	and counter measure was also a slightly better in the a speaker verification task so
0:12:23	we hypothesize that this kind of outweigh that the speaker verification systems
0:12:31	normal task of do that can correct speaker and it started to kind of
0:12:35	thick the condom answers these proved samples instead
0:12:39	so
0:12:40	we also compared this to a simple baseline
0:12:43	where instead of using this tandem optimization we just independently trained
0:12:49	continue training to two systems using the same samples as in the quality grading methods
0:12:54	so
0:12:56	basically we just use the same samples and use the a s p samples to
0:13:01	continue updating the a speech is then and then be used the condom is a
0:13:06	samples to update the down to missus system independently completely different from each other and
0:13:11	we see the same at a sweep behaviour here but
0:13:15	counters mercer systems the equal error rate just exploded in the beginning and then slowly
0:13:20	creeps back down and in the end operates over multiple runs
0:13:26	we see that the
0:13:27	both the grading method improves the results by twenty six percent and the fine tuning
0:13:31	improve the results by seven point eight four percent
0:13:35	but to note that the results on the fine tuning
0:13:39	have a lot how your variance
0:13:42	than in the
0:13:44	police a gradient
0:13:45	mesh version
0:13:48	but
0:13:49	these all results are very positive but as a this wasn't very initial investigation and
0:13:55	i highly recommend that you
0:13:57	check out the paper for more resultant figures
0:14:00	and that's all
0:14:02	so
0:14:02	thank you for listening to be sure to check out the paper and the code
0:14:05	behind that leak

An Initial Investigation on Optimizing Tandem Speaker Verification and Countermeasure Systems Using Reinforcement Learning

Spoofing and Countermeasure 1

Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi