Speech Transcript - Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion

0:00:15	and the lower one my name's change a high
0:00:18	come from session and then we're still singapore
0:00:21	i'm present our recent work a lot about
0:00:24	black box attacks
0:00:26	a automatic speaker verification is in treedec control was conversion
0:00:31	this was this work has been done with a context
0:00:33	and actual
0:00:37	a nice my presentation into four hours
0:00:39	the introduction
0:00:41	related works and propose a nested
0:00:43	experiments and results
0:00:45	and finally go to the conclusion
0:00:48	that's start with the introduction
0:00:51	with the development of automatic all automatic speaker verification
0:00:56	the speaker verification system has been used in many applications
0:01:00	such as banking
0:01:02	matching authentication
0:01:04	and i have been c applications
0:01:07	i have more than yes the system also please read from spoofing attacks
0:01:13	it is found that
0:01:14	the s a system use of one able to various kinds are spoofing attacks
0:01:20	to handle this problem
0:01:21	different the condom errors i developed for spoofing attacks
0:01:25	to be has a security a speaker verification system
0:01:30	in practice
0:01:31	that's two things system
0:01:33	no
0:01:34	can
0:01:35	can be
0:01:37	and the can be can be realising with different techniques
0:01:42	for example impersonation
0:01:45	the back and the synthetic speech
0:01:48	two channels something that is dish
0:01:51	the different
0:01:52	models can be used for example yes
0:01:55	what was promotion
0:01:57	in this work
0:01:59	we focus on
0:01:59	the attacks
0:02:01	generated by the was from which is just a
0:02:05	drawn from an hackers point of view
0:02:08	it is possible to generate a kind of was right context
0:02:11	with feedback from the okay
0:02:14	system
0:02:16	as and impostor attacks a to be some knowledge of that type of the system
0:02:22	in to improve the prove one wall street
0:02:25	as the extended processed is an example from image processing
0:02:30	given a image usually
0:02:32	the system recognise is added and the
0:02:35	i have more but at the as norm online is the image is means classified
0:02:41	by the system
0:02:42	as i came then
0:02:43	this shows the potential street associated with a rifle was what text
0:02:48	in this work we would like to know something to that of are so i
0:02:52	x
0:02:53	with a speaker verification system
0:02:55	it will this will have to be used on more robust is this the
0:02:59	in the future
0:03:02	slu of
0:03:04	spoofing problem attackers perspective
0:03:07	using no other was to attack scenario
0:03:12	attacker can use and means moving system to generate a score of this is the
0:03:15	to turn is of the sample
0:03:18	to attack the having yes be system
0:03:21	i don't were
0:03:23	the were so attack scenario
0:03:26	like are copied it is proving system
0:03:29	with a feedback of the yes we and the generates as we have also prove
0:03:35	the samples
0:03:36	two or attack they have to be system again
0:03:41	of course
0:03:42	this kind all
0:03:43	us to sample
0:03:44	you know
0:03:46	provide them was reading
0:03:48	two yes this system
0:03:53	with
0:03:54	with different level
0:03:56	knowledge z
0:03:59	maybe three types all other also attacks
0:04:02	including black box attack
0:04:05	three parts okay
0:04:06	and might want to attack
0:04:08	well that also attack attacker only have a lot
0:04:11	information on how the
0:04:14	c system
0:04:16	full
0:04:18	reebok's attack
0:04:19	note taker have
0:04:21	informational both input and output of the space system
0:04:27	for the one of the tack
0:04:29	okay so
0:04:31	the fully informational yes please system
0:04:34	so such right on our shows that there is a straight
0:04:37	however in real part is that have occurred may not
0:04:42	it would have
0:04:44	as many information as the about
0:04:46	so
0:04:48	the black hole attack isn't more
0:04:51	and easy to arise in
0:04:54	in the gravity
0:04:57	so we
0:04:58	case
0:04:59	as a focus on these four
0:05:03	then we go to the related work and propose a method
0:05:07	first we will introduce a voice conversion
0:05:11	what's machines that
0:05:13	technique that modifies speaker identity all phones all speaker to a target speaker
0:05:18	based on change of the linguistic information
0:05:22	e
0:05:23	conventional framework
0:05:24	the commission model is
0:05:26	we will be sounds they are the data from source and target speaker
0:05:31	so the coming from all the will be
0:05:36	specifically for speaker pair
0:05:40	however for the movie have tag
0:05:43	a more
0:05:44	one
0:05:45	a more in uses the which are really not correlate it out once conversion
0:05:50	for example imaging resource conversion
0:05:53	the basic idea are used to train a feature mapping model between the
0:05:58	speaker independent feature as speaker dependent feature
0:06:02	for example
0:06:03	given a harvest age forcibly used for the speaker independently but feature and speaker-dependent acoustic
0:06:10	feature
0:06:12	then used is to features to trails us
0:06:16	conversion model
0:06:19	as a and b g feature
0:06:22	is the use of speaker
0:06:24	independent
0:06:25	that means
0:06:26	as well as the speaker on the count it as a have the speech content
0:06:30	is the same
0:06:31	the did you do not change
0:06:35	so
0:06:36	in such a framework
0:06:38	it is an easy to actually a many-to-one conversion
0:06:41	and
0:06:42	in this form free more the so stage is not required during training
0:06:47	so this will be
0:06:49	more easy to use for proving attack
0:06:56	so that's cholesky then did not have also attack scenario
0:07:00	in not ever so attack scenario
0:07:03	alright
0:07:03	as recent as we stand
0:07:07	but we and acoustic feature we will be straight from the target speech to train
0:07:12	the
0:07:12	commercial model
0:07:14	the model will be a day
0:07:17	with a lost
0:07:18	calculated to predict acoustic features
0:07:21	and
0:07:23	generally have features from target speech
0:07:28	during tracking
0:07:30	the
0:07:30	but is extracted from the source speech
0:07:34	then
0:07:35	of if we just such so sleepy g into commercial model together comedy the acoustic
0:07:39	feature
0:07:42	we use a book order to come word the acoustic feature
0:07:47	on tuesday
0:07:49	comedy the speech samples
0:07:51	to be former
0:07:53	formant tag
0:07:54	to that c system
0:08:00	this is a
0:08:01	keeping otherwise commission model
0:08:04	it's optimize the for speaker similarity an ecology
0:08:08	so it is not designed for us to the system
0:08:11	is me nonoptimal
0:08:13	well forcing yes the attack
0:08:17	for our proposed the feedback control wise conversion
0:08:23	the main difference is
0:08:24	we provide a feedback from the yes we system
0:08:27	during training
0:08:31	as negative example
0:08:33	during training for each mini batch
0:08:35	we tried the
0:08:37	target speech with is trying to the g
0:08:41	from target speech into generated predict the acoustic feature
0:08:45	the first part most is calculated between the prediction acoustic feature and actually acoustic feature
0:08:51	as a baseline be known as it is you discourse conversion
0:08:56	and a lot of heart
0:08:59	we also use a local the could generate the comedy the speech signal
0:09:04	well from the printing acoustic features
0:09:06	and
0:09:07	which is known
0:09:08	speech signal to agnes's system
0:09:11	together
0:09:12	together
0:09:13	well
0:09:15	sleeping bag as another for the lost
0:09:18	for they model updating
0:09:23	during the packing
0:09:25	and is the same as
0:09:27	as these elements of we're
0:09:30	okay bridges attractor used to used for the two major problems for speech and we
0:09:36	feed this source the region into the commercial model together
0:09:40	conversely the acoustic ensure
0:09:42	and
0:09:43	a local there is used to generate the company speech signal to people yes work
0:09:52	no that's is then
0:09:54	how the combined lost
0:09:56	is use it for the
0:09:58	i was commercial model training
0:10:01	as we know
0:10:03	i four that most current no that's a scenario
0:10:06	we do not have knowledge of each in the relationship no we don't have the
0:10:10	knowledge of the relationship which in the ones which are good
0:10:14	and then yes be lost
0:10:16	so
0:10:17	no
0:10:18	there's no
0:10:19	within phone the signals part
0:10:25	but
0:10:25	that has to be lost you
0:10:28	change of the combine lost curve
0:10:30	so to the average using pass signals for the voice conversion more training we use
0:10:36	an adaptive learning rate schedules
0:10:38	based on the loss
0:10:40	well that the dishes that the to achieve the colleges
0:10:43	for example
0:10:44	the learning rate will be i just
0:10:46	we will be adjusted
0:10:48	or reduced
0:10:49	once a total loss is increased on the validation set
0:10:56	that's close to the instrument and the result
0:11:00	for three weeks then the database use our experiments are is convinced two hours
0:11:05	the training part and validation art
0:11:09	for training
0:11:10	we can go
0:11:11	we workshop three models
0:11:13	i
0:11:14	course the images structure which is trained out of the target strata
0:11:19	the i-vector extractor trained on combine
0:11:23	or combine colours all
0:11:25	switchboard and nist sre corpus from two thousand six two thousand channel
0:11:31	the convolutional this tree down yes physical two thousand nineteen development set
0:11:37	we
0:11:38	choose fixes target speakers including three male and stripping though
0:11:43	for each speaker we choose
0:11:45	but hundred and channel utterances
0:11:48	core model training
0:11:51	volume relations that we using as faced with two thousand nineteen evaluation dataset which contain
0:11:58	conditions and sixty seven speakers
0:12:01	we just trying to utterances per speaker
0:12:03	so in total we how thousand
0:12:06	and
0:12:07	three hundred and forty utterances
0:12:12	pretty bad two systems
0:12:15	to perform in our experiment
0:12:17	other forces it is a peep into his voice conversion system result sleeping bag
0:12:23	another is
0:12:24	feedback control once conversion system which is our proposed
0:12:28	system
0:12:30	incorrectly
0:12:32	the combined the racial
0:12:33	if set to zero point seven
0:12:38	for most model
0:12:40	we use the same a network structure which consist of two d r s team
0:12:46	rst nonlinear
0:12:48	with
0:12:48	one can find one two
0:12:51	continuing these of each year
0:12:54	than other work includes all
0:12:56	but system a forty two dimensional p b g feature
0:13:00	well as the
0:13:01	dimensional output is two hundred and forty
0:13:04	considering the house
0:13:05	it you dimensional mel spectrum
0:13:08	exist and then dynamic an actual error-rate features
0:13:11	the rippling what colour they really is used to speech signal reconstruction
0:13:17	this figure shows the training curve
0:13:21	a only
0:13:23	training and validation set
0:13:26	the line shows the baseline b g based voice conversion
0:13:31	the
0:13:32	or shall i shows they
0:13:34	create a control wise conversion is a convolutional zero point five
0:13:39	the lies shows a
0:13:42	with that control voice conversion with a
0:13:45	combined racial of zero point seven
0:13:49	forms a result of from the training kernel we can see
0:13:57	so the
0:13:58	the feedback control was from version
0:14:01	okay
0:14:02	generally i think at low or other or lost during training for training both training
0:14:08	and validation set is especially for they
0:14:13	for the s p loss
0:14:18	and according to this curve we can see
0:14:21	we combine loss
0:14:23	no
0:14:24	come biracial otherwise database
0:14:27	there is in
0:14:28	there won't find so
0:14:30	which was zero point seven as our
0:14:33	well
0:14:34	our setting
0:14:35	probably their experiment
0:14:39	the objective the initial values you carried to your that the speaker verification system
0:14:48	from of for scroll l
0:14:50	we can see that
0:14:51	yes these systems form
0:14:53	a very effectively one the impostor trials are used
0:14:57	reason you carried little but those represent
0:15:02	and the performance
0:15:03	decreases significantly
0:15:05	one the p g police force equation
0:15:08	attacks are performed
0:15:12	we're z you carried will be increased to all word
0:15:16	twenty five percent for
0:15:18	all the scenarios
0:15:21	and
0:15:23	it is also assumes that
0:15:25	the proposed the feedback control was conversion
0:15:28	is able to folder to increase the performance
0:15:32	which shows no when the but details yes these systems to that of the text
0:15:40	we all well
0:15:41	we use two figures show up having example to show the effectiveness of our proposed
0:15:47	it
0:15:49	that also attack
0:15:50	no
0:15:52	the
0:15:53	no set
0:15:55	and the round i shows the impostor score distribution and the blue line shows the
0:16:02	score distribution of the channel nine channels
0:16:06	and the yellow line shows the score distribution of the ilp be noted digit is
0:16:11	a large portion baseline
0:16:13	and
0:16:14	purple line shows the scroll score distribution all our proposed method
0:16:19	we can see our propose a method that can push the
0:16:24	the score
0:16:25	two horses each i mean
0:16:28	which means which shows the effect leaves names or propose a nested
0:16:34	and does go to the conclusion
0:16:37	in this form
0:16:38	we formulate up to have also attack scenario for embedded control the ones from portions
0:16:42	system
0:16:43	which effectively
0:16:45	given degrees a speaker verification system performance
0:16:49	we also evaluated the proposed
0:16:51	and was not accent to remove frameworks
0:16:54	space proved two thousand nineteen corpus
0:16:57	which is widely used force the for system benchmarking
0:17:02	but also provide that
0:17:04	and then at the cost study
0:17:07	proposed the frameworks and exposes a weak links
0:17:10	also the common speaker verification systems
0:17:13	in facing
0:17:14	voice conversion attacks
0:17:17	that's for all my presentation
0:17:19	single for attention

Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion

Spoofing and Countermeasure 1

Xiaohai Tian, Rohan Kumar Das, Haizhou Li