Speech Transcript - A Multi-condition Training Strategy for Countermeasures Against Spoofing Attacks to Speaker Recognizers

0:00:16	hi everyone i'm john allen from computer with such and still montreal
0:00:20	to them going to talk about a multi condition training strategy for contra miseries against
0:00:27	spoofing attacks
0:00:29	two speaker recognizers
0:00:30	this is the joint work we did wrong wanted to and channel four
0:00:35	in this presentation i'm going to provide an overview of our work
0:00:40	one and two ends of introduction
0:00:44	i employing to deanna channel that utilizes documentation to increase
0:00:50	the amount of
0:00:51	training data for improving
0:00:54	performance on the unseen test data
0:01:04	the outline of my talk for a i lists start with the
0:01:09	introduction element known then i'm going to talk able
0:01:13	spoof an detection documentation
0:01:16	baseline use for this task
0:01:20	and to an approach is to list of introduction using to deanna literature
0:01:26	and finally lame going to provide some results for performance evaluation and
0:01:32	and i'm going to conclude my toe
0:01:39	i had is the introduction and background
0:01:43	given a p r of recording
0:01:45	the goal of
0:01:47	speaker verification system is to do to mine
0:01:50	whether the recordings are from the same speaker or
0:01:55	from two different speakers
0:01:57	in order to do so
0:01:59	and speaker of a speaker verification system utilizes a set of recognisable
0:02:04	and very what verifiable voice characteristics
0:02:09	which are normally considered a unique and the specific to a person
0:02:15	districts are normally extracted
0:02:19	in the feature extraction module it of a speaker verification system
0:02:23	in a controlled setting
0:02:25	speaker verification system perform very well
0:02:29	but it performs
0:02:30	it performance degrades in real-world setting
0:02:34	where n in boston can pretend to be a generally speaker by foraging
0:02:39	a genuinely speaker voice recording
0:02:43	or when there is a mismatch between training and test environment
0:02:48	in this work we and mainly concerned with the
0:02:51	for signal generally speaker y's by of embossed of
0:02:59	in an speaker verification system
0:03:02	the claimed identity
0:03:05	can we generally are forwarded by a for mister
0:03:09	well was goal is to get any illegitimate access to the system
0:03:15	so the manipulation of and authentication system by impostors is nobody known as well thing
0:03:22	speaker verification system are born unable to spoofing attacks generated by
0:03:29	in the replay
0:03:31	speech synthesis voice conversion
0:03:34	and in boston impersonations
0:03:37	except in impersonations all other three attacks are normally considered major trade
0:03:44	to a speaker verification system
0:03:47	among the three major attack times
0:03:50	all replay is known as the physical access advance where m is
0:03:55	speech synthesis
0:03:57	and voice conversion networks are known as the logical alexis attacks
0:04:05	makes them to talk about this poof and detection
0:04:10	fortunately all data styles
0:04:14	discussed in the previous slide that means replay
0:04:19	speech synthesis and voice conversion leave some traces in the converted to speech in the
0:04:25	formal be able artifacts
0:04:28	spoofing detection techniques normally use this to do what are to fix
0:04:33	in order to distinguish
0:04:35	spoof the speech from the generated speech
0:04:40	to make speaker verification systems recording is spoofing attack
0:04:46	speaker verification and the spoofing detection system can be a
0:04:51	connected in parallel
0:04:53	in the left side of the finger already present it is performed detection is followed
0:04:58	by is to get a verification system
0:05:01	well the recording of the claim identity is files is
0:05:05	initially passed through the speaker verification system to make verification decision
0:05:10	if the identity
0:05:12	is accepted but the verification system
0:05:15	it is then passed through with spoofing detection system
0:05:18	to find out what with that the plan mightn't is actually
0:05:22	generally in order to put noticeable
0:05:26	in that are set of the free good or whatever speaker verification system is followed
0:05:31	by this perform detection system in this case
0:05:35	there
0:05:37	claim identities if the claim i didn't is found channeling only then it is past
0:05:42	where verification system tool make verification decision
0:05:48	finally
0:05:51	speaker verification is performed detection system can be connected in parallel
0:05:57	i in this case
0:05:59	the fused score or of
0:06:02	speaker verification and the spoofing detection
0:06:06	system is used to make accept or reject decision
0:06:11	that advantage of this approach is the only want racial is required us
0:06:17	to perform verification decision
0:06:20	like the to those in a few clean and total seventeen editions of this spoofing
0:06:24	challenge
0:06:26	to the nineteen addition of is visible of challenge into the ninety additional admissible challenge
0:06:33	the participants wider us to be list nn on the spoofing detection system irrespective of
0:06:38	a speaker verification system
0:06:42	but in two thousand nine doing additional was something challenge organisers provided the verification is
0:06:48	called for the participant
0:06:51	so this participant can e-model it is performed detection is score
0:06:55	in terms of tandem detection cost function when used to alongside with the verification system
0:07:08	next i'm going to talk about that augmentation
0:07:13	more animation only models such as deep-learning architectures
0:07:19	may have billions of parameters and normally require a large amount of data for training
0:07:26	and but in
0:07:28	most of the application cases have been large amount of data is normally not possible
0:07:35	well as for example consider the case of is visible challenges where training data provided
0:07:41	to the participant or not are not sufficient enough
0:07:46	to expect generalize performance using
0:07:50	deep-learning be approaches
0:07:53	so
0:07:56	two used to barely i architecture we need to increase the training data
0:08:00	the process of increasing and the amount and the diversity of a training data is
0:08:06	nobody non is
0:08:09	documentation
0:08:10	documentation normally serve
0:08:12	two proposes
0:08:13	one propose is the domain adaptation on roman generalisation
0:08:18	in this case the main goal is to compensate for mismatch
0:08:22	environmental between training and test data
0:08:25	and this approach is normally widely used in the speech based applications
0:08:30	for example speaker recognition a speech recognition
0:08:34	another corpus for the documentation is the regularization
0:08:38	the main goal is to improve performance on unseen test data
0:08:43	by
0:08:44	increasing the training data
0:08:46	in this work or corpus was to
0:08:49	do regularization
0:08:53	for this work we try to adopt and domain adaptation this strategy that preserve the
0:08:59	artifacts of the spoofing attacks
0:09:02	and that the same time and does not
0:09:05	use any external data such as
0:09:08	noise reverberation et cetera
0:09:11	data from addition a strategy adopted in this war this presented
0:09:17	in this figure
0:09:19	all the slide
0:09:21	i hear additional training data were created by using speed perturbation with the freeways the
0:09:27	partition perturbation vector or of
0:09:29	zero point nine and one point one
0:09:32	low-pass and high-pass filtering on the training data
0:09:38	by doing the documentation in this case we were able to increase the training data
0:09:44	five tens of the original training that no
0:09:51	next am gonna talk about
0:09:53	speech representation only used
0:09:56	for this work
0:09:59	in the course of
0:10:00	to build a few demon to total seventeen additions of
0:10:04	is a peaceful of challenges
0:10:06	and after the evaluation it became almost clear that the most effective countermeasures errors
0:10:13	for spoken detection is then local the speech representation
0:10:18	by local mean frame level feature
0:10:21	which are typically extracted over ten millisecond interval
0:10:25	for is this for to the nineteen challenge does
0:10:30	we
0:10:31	use three way to use your colour speech representation
0:10:35	one of them is
0:10:37	and linear frequency cepstral coefficient feature
0:10:40	and various is to have to compute this feature is presented in left
0:10:46	ten side of figure
0:10:48	another
0:10:50	feature is the
0:10:53	sequence is if we check honest and you cepstral coefficient feature
0:10:57	which was phone very effective for it
0:11:00	to than fifty variation of stopping challenge task
0:11:04	and
0:11:05	we use
0:11:06	this feature also in the stars
0:11:10	as this feature was provided in the one
0:11:13	with the baseline
0:11:15	and to compute the sequence of feature the various steps are presented in the right
0:11:21	hand side of the figure
0:11:27	another local the speech representation we use for this work is the
0:11:33	provide spectrum which is the product of power spectrum in group delay function
0:11:39	this feature incorporates both the amplitude and phase spectral compare the
0:11:44	and various steps for completing this feature is presented
0:11:50	in this figure out of the slide
0:11:55	next and lead to talk about the baseline used for supporting detection in the stars
0:12:02	in order to make competition a performance we used to baseline provided by the are
0:12:07	gonna the one of the baseline is sequences the feature with gmm classifier
0:12:12	and another baseline is the
0:12:14	elasticy feature with the same gmm classifier
0:12:19	besides we also created our own baselines one of our baseline is mfcc with the
0:12:25	g m and then and there is the i-vector p l d v is then
0:12:28	what of our baseline where
0:12:31	but is encoded toolkit
0:12:37	in the speaker of this slide presented the gmm based framework for a simple from
0:12:42	detection
0:12:44	in this framework
0:12:45	i generally dennis of the gmm models are trained
0:12:49	using genuine and it's
0:12:51	spoofing speech training data
0:12:54	then given a test recording generally noticeable decision is made based on the likelihood ratio
0:13:02	computed using the trained gmm models
0:13:09	next i'm going to talk about the end when approaches that we used for is
0:13:13	performed detection in this stuff
0:13:19	in an end-to-end approach non local the speech representation are normally do typically map twist
0:13:25	proving detection a score
0:13:27	in this approach for modeling we add up to the two d n in
0:13:31	more detail was architecture is presented in
0:13:34	table one of the slight
0:13:37	in this architectures several
0:13:40	one variational convolutional perform to encode
0:13:44	to encode
0:13:47	input local the speech representation to local countdown as ours
0:13:51	statistics putting laity the eagles to summarise this sequence of
0:13:56	local counter miseries into a global condor major
0:14:00	finally the global control method is projected into have final output the score
0:14:07	trying to affine transformation and along with the complete model
0:14:16	for training
0:14:18	binary cross entropy laws is you lost in a standard binary classification setting
0:14:24	the as you can see
0:14:27	the we have seen previously the training database quite unbalanced for is busy of channel
0:14:32	data
0:14:33	is the guys almost nine terms of the one of five training data
0:14:39	so
0:14:41	many pages that created in such a way that genuine example some bald
0:14:46	several times party walks was to ensure
0:14:50	the mean images are balance
0:14:53	training and study using still that stochastic gradient descent algorithm with the meeting best size
0:14:59	of sixteen
0:15:01	only have to selection is also employed for this does
0:15:09	next i'm going to present some results on is this above tools of the nineteen
0:15:13	challenge should have limited evaluation data
0:15:17	the matrix used for almost performance evaluation of normalized minimum ten then detection about ten
0:15:23	them detection cost function and the equal error rate
0:15:28	for experiment on logic and physical access task
0:15:32	there's
0:15:33	be useful to those of nineteen channel and uttering we used
0:15:37	for physical access stars small data generated using similar to repair tasks where is
0:15:43	for
0:15:44	for physical access task
0:15:46	this book data generate reason similar to replay attacks where is for a
0:15:52	logic alexi starts to go to generated using various
0:15:56	i p synthesis and voice conversion and but algorithm
0:16:00	in table two presents
0:16:03	the number of the gender and recordings of recording and the number of the speaker
0:16:07	a contained in trained emblem it and evaluation partitions of
0:16:12	logical x is an physical access task
0:16:17	we can see from this table that training device quite unbalanced in this small
0:16:25	physical accessible for detection results in terms of tandem detection cost function and equal error
0:16:31	rate
0:16:32	on the diablo meant as well as evaluation first which reported in
0:16:38	this table three and five
0:16:41	we can arousal from the presented results that
0:16:45	documentation how
0:16:46	to improve performance in both test set
0:16:52	in this
0:16:54	logic alexis puffing detection results
0:16:57	in this slide represented logical accessible from detection results in terms of
0:17:02	tandem detection cost function on an equal error rate
0:17:06	on the development as well as you evaluation sets
0:17:11	for the logic alexis stars
0:17:14	documentation it's phone effective only on the development set
0:17:18	and overall we can see that the and one approach employing td an architecture provided
0:17:24	better performance
0:17:26	then the baseline
0:17:29	on both logical and physical alexis stuff
0:17:36	finally conclusion we can say
0:17:40	data limitation is found helpful specifically for p it does for the score from detection
0:17:47	employee deep-learning architecture
0:17:51	four
0:17:52	in order to the documentation for the spoofing detection we have to make sure the
0:17:57	signal transformation employed
0:18:00	data that augmentation technique must preserve the art effects introduced by
0:18:06	spoofing algorithm
0:18:09	and tuna and they approach employing t d n and lead to documentation outperform the
0:18:15	baseline
0:18:16	and to an approach we double
0:18:19	documentation
0:18:21	and to an approach
0:18:22	to deal in a two d n and with two d an architecture to with
0:18:26	that but without data limitation
0:18:28	outperformed all the baseline bottleneck and logically lex's infringing alexis task
0:18:35	domain and i know that augmentation by still perturbation and
0:18:41	filtering
0:18:42	basically low-pass and high-pass filtering is
0:18:46	found useful for physical access tiles but for
0:18:49	logically alexi stars
0:18:51	speaker part of vision is found harmful
0:18:56	feature normalization use of voice activity detection and already to different abbing deviation of the
0:19:02	filters less than
0:19:04	sixty four
0:19:06	filter are and number in a tree commander for the spoofing detection task
0:19:15	thank you very much for your attention

A Multi-condition Training Strategy for Countermeasures Against Spoofing Attacks to Speaker Recognizers

Spoofing and Countermeasure 2

Joao Monteiro, Jahangir Alam, Tiago Falk