Speech Transcript - Using Multi-Resolution Feature Maps with Convolutional Neural Networks for Anti-Spoofing in ASV

0:00:13	hi everyone thank you for joining my presentation
0:00:17	i and function one from any c corporation
0:00:20	today i would like to present my paper
0:00:23	using multiresolution feature maps
0:00:26	with
0:00:27	convolutional neural networks for and his movie in years to be
0:00:33	here are the content first i would like to give the introduction and the review
0:00:38	of multiple
0:00:39	feature maps popular in here series moving detection
0:00:44	next
0:00:44	i will introduce our proposed
0:00:47	multiresolution feature map
0:00:49	and how it is used with three a feature extraction
0:00:52	and the ways neural networks
0:00:55	well give three popular c and variance as examples
0:01:00	resonant eighteen seen that fifty and lc-nn
0:01:05	and we show the effectiveness of the proposed method
0:01:08	in experiments
0:01:10	and also cave and analysis in terms of computational cost
0:01:14	finally i'd like to summarise this presentation
0:01:21	automatic speaker verification yes we
0:01:25	offers flexible biometric authentication and has been increasingly employed in such telephone based services as
0:01:34	telephone banking
0:01:35	in four and six at call center and so
0:01:39	yes means reliability depends on its resilience to spoofing
0:01:44	it is true of any biometric technology
0:01:48	therefore with the increase of
0:01:50	use
0:01:52	of yes we spoofing detection in speech is also getting more attention
0:01:58	direct you scenarios of spoofing attacks
0:02:01	logical access and physical access
0:02:04	most equal access enclosed text-to-speech synthesis and voice conversion
0:02:11	in physical access is mentally would play where the target identities voice is recorded and
0:02:17	replay
0:02:18	we play is very easy
0:02:20	to implement and is it as you may know their heart to detect
0:02:25	yes bill challenge
0:02:28	two thousand fifteen to two thousand nineteen have been driving efforts on one and just
0:02:33	poking countermeasures
0:02:35	and the resulted in significant findings
0:02:38	yes miss fifteen
0:02:40	focuses on spoofing attacks generate a different speech synthesis
0:02:44	and the voice conversion
0:02:47	yes three seventeen focused on
0:02:50	replay attacks
0:02:52	then it's
0:02:53	studies of
0:02:54	two thousand nineteen
0:02:57	addressed all types so
0:02:58	of
0:03:00	spoofing
0:03:00	in previous two challenges
0:03:02	and further extended data sets
0:03:04	in terms of spoofing technology
0:03:08	number of
0:03:09	conditions and volume of data
0:03:13	with a lot of those researched in down with the challenges
0:03:17	the training have as being
0:03:21	shifted from gmm was features like mfccs thing to c or c f's this the
0:03:27	beginning
0:03:28	two
0:03:29	but deep neural networks
0:03:31	ways
0:03:31	hi time fruit a time frequency resolution features
0:03:35	that's has been proved to achieve higher accuracy
0:03:39	following this
0:03:41	conventional methods
0:03:43	trials and carefully which type of acoustic feature to use
0:03:47	yes it is essential in speech processing tasks
0:03:51	including spoofing detection
0:03:53	however
0:03:54	is realising only one type of acoustic features may not be sufficient to
0:03:59	detect
0:04:00	globals to think vectors when facing and saying
0:04:04	spoofing speech
0:04:10	as we know
0:04:12	from cuba
0:04:14	audio segment
0:04:16	multiple acoustic feature maps can be
0:04:20	extracted
0:04:21	such as mfcc security see
0:04:23	so if sissy fft
0:04:26	security and so on
0:04:28	it may be difficult to determine
0:04:31	one type of acoustic feature maps
0:04:34	will be the past four weeks moving detection
0:04:37	you know all one type
0:04:39	well of the acoustic feature
0:04:42	different settings
0:04:44	used in that extraction will resulted in obtaining of different informations
0:04:49	for example
0:04:50	fft spectrogram extracted
0:04:53	with different window lengths contain spectral information have
0:04:58	resolutions
0:04:59	that
0:04:59	different higher and lower frequency bands
0:05:04	i shorter window
0:05:05	will lead to high resolution in terms of time
0:05:09	and the low
0:05:10	lower resolution in terms of frequency
0:05:14	on the contrary a novel weighting though
0:05:20	we extract f t
0:05:23	which has
0:05:24	higher a for a higher resolution in frequency and a low resolution in time
0:05:30	the trade-off between time and the frequency resolution makes it difficult to extract
0:05:37	sufficient information was one fft
0:05:39	spectrogram along
0:05:41	therefore
0:05:42	the use of multiple acoustic feature maps
0:05:46	is needed to alleviate the problem
0:05:50	the question is
0:05:52	called to use a logical acoustic feature maps together
0:06:01	future physicians and score fusion is
0:06:03	score fusion is that kind of late fusion which can be used
0:06:07	to choose
0:06:09	score produced from systems
0:06:12	used
0:06:14	individual feature maps
0:06:17	however score fusion can be computational cost of these things it's needs to train neural
0:06:23	networks for multiple times
0:06:26	in addition fusion weights need to be determined in advance
0:06:31	as for feature fusion
0:06:33	there are a feature mapping concatenation
0:06:37	alone a single dimension such as time or frequency dimension
0:06:42	there is also linear interpolation two
0:06:45	feature additional allowed we is a fisher follow the ways
0:06:51	neural networks has the advantage of all call automatic
0:06:55	feature selection so we chose feature fusion over the score fusion
0:07:08	we proposed
0:07:10	multiresolution patient maps
0:07:12	which the stack
0:07:14	multiple feature maps
0:07:15	of the same dimensionality
0:07:18	into
0:07:20	three intermission input for deep neural network
0:07:25	it is soon table four
0:07:27	two d c n
0:07:29	the modification of neural network is also where same
0:07:33	only needs to
0:07:35	modify the first a layer of it and neural network from condition
0:07:42	one times
0:07:44	c one c one here is the output animation of the first layer neural network
0:07:49	from this dimension two
0:07:54	the number of all the channels
0:07:57	which means number of the feature maps
0:08:01	times
0:08:01	the out with animation
0:08:05	so the proposed method makes
0:08:07	it possible to extract more information
0:08:10	from input a signals
0:08:13	with relatively little can
0:08:15	additional cost
0:08:25	the experimental data used in this
0:08:28	study was physical access subset
0:08:31	all yes this move to solve the nineteen challenge
0:08:35	it can't end up of fifty thousand spoofed
0:08:37	and of
0:08:39	five cells and
0:08:40	well enough i'd
0:08:41	utterances in the training partition
0:08:43	as well as twenty four thousand spoofed
0:08:46	and if
0:08:47	five cells and quantified it
0:08:49	utterances in the development
0:08:51	partition
0:08:52	the development dataset were in the conditions saying
0:08:56	in the training data
0:08:59	it contains
0:09:01	twenty seven recording acoustic configurations
0:09:04	and indirectly configurations
0:09:07	evaluation
0:09:09	contents
0:09:09	much larger size of data
0:09:11	and include
0:09:13	spoof the utterances of unseen conditions as well
0:09:19	is
0:09:20	the scenario replaying estimates of two thousand nineteen
0:09:25	replay attacks
0:09:27	are stimulated
0:09:29	we see acoustic environment
0:09:32	the talker here
0:09:34	speak to that yes three
0:09:37	system
0:09:38	and attacker
0:09:41	recorded the speech
0:09:44	in defined distances
0:09:46	and then
0:09:47	he or she will reply
0:09:49	a speech back
0:09:51	to the and this
0:09:53	the point
0:09:54	to the us to be system
0:09:59	our experiments we used a t spectrogram and we used a window length of eighteen
0:10:05	twenty five and thirty
0:10:07	millisecond
0:10:08	that the spectrograms dimension was trying to two hundred fifty seven times four hundred
0:10:15	we used a unified
0:10:17	fft feature maps
0:10:21	since the lens of evaluation utterances are usually not is known beforehand
0:10:28	we first extended utterances
0:10:31	to their minimum multiple for the four hundred
0:10:36	frames
0:10:37	and then cut down into
0:10:40	multiple for hunter phone
0:10:41	segments with
0:10:43	two hundred
0:10:45	frame overlap
0:10:52	experiments were carried out
0:10:54	using the following three c and variance
0:10:57	resonating and seen that fifty
0:11:00	and the light c n
0:11:02	all the networks
0:11:03	as
0:11:04	at ten output nodes
0:11:08	one stands for
0:11:10	one of five condition
0:11:12	also generally
0:11:13	and the other night
0:11:15	represents nine we play
0:11:17	configurations
0:11:20	the no probability of the modified
0:11:24	class
0:11:25	is used as spoofing detection score to make the final decision
0:11:37	the model parameters the and architecture of resonate aiding and s c net fifty are
0:11:42	shown in this table
0:11:44	the basic and the bottom that rise to a locks
0:11:48	are described in original rest netscape
0:11:52	resonating and testing
0:11:54	fifty have been shown in
0:11:56	this paper
0:11:58	to be effective for replay
0:12:01	spoofing detection
0:12:03	that is why which rows just two networks
0:12:06	lights is a kind of with less feature
0:12:11	activation
0:12:12	the use of an f and
0:12:14	allowed us to reduce the number of channel
0:12:18	i have
0:12:19	it is why it's called like to see
0:12:22	channels work the best in a estimation of two thousand seventeen
0:12:28	and also ranked highly
0:12:30	in case whisper of two thousand
0:12:32	nineteen challenge
0:12:34	that's why we also include
0:12:36	lights in our study
0:12:40	we first compare spoofing detection equal error rates
0:12:44	when using single feature maps of different resolutions
0:12:49	in this very
0:12:50	networks
0:12:52	for different neural network architectures
0:12:54	the representative
0:12:56	yes to performances were obtained whizzing
0:12:59	well is
0:13:00	different feature maps
0:13:08	so here anything the f t eighteen twenty five searching represent fft spectrograms extracted ways
0:13:17	this window all lands
0:13:22	the fft spectrogram
0:13:25	extracted was twenty five
0:13:27	milisecond fft
0:13:29	kind of five
0:13:31	and extracted ways
0:13:33	us thirty millisecond
0:13:35	give similar results in resonate eighteen
0:13:39	and the cp significantly better
0:13:42	then those with at milisecond
0:13:47	and for a scene and fifty however
0:13:50	and thirteen feet eighteen and twenty five k
0:13:54	similar results
0:13:56	while
0:13:57	those with thirty millisecond with the best
0:14:03	for lc-nn the fft spectrograms of twenty five milisecond
0:14:08	"'kay" the best performance
0:14:11	so there may not be one single optimal fft configuration four
0:14:17	different neural network structure
0:14:23	next we applied the proposed multi resolution vision maps
0:14:27	as input to the sedans
0:14:31	we also show at the same time the results of score fusion
0:14:37	which is a straightforward to mess that when multi feature maps
0:14:41	are available
0:14:46	here are the results on development set
0:14:50	where the reply conditions are thing
0:14:55	the query bars here we present
0:14:58	performance for single feature map
0:15:01	and blue bars represent score fusion and the yellow bars represent the proposed method
0:15:09	we can see from the figure
0:15:10	they use of two resolutions feature maps
0:15:16	here
0:15:17	improve
0:15:18	the times a forty four and thirty percent for this three and hence
0:15:25	respectively
0:15:27	well
0:15:28	and this the results are compared to the better
0:15:32	one of the two single features system
0:15:36	then we also have stream resolution
0:15:39	results here
0:15:42	each show the best performance for
0:15:45	a resin that eighteen and has the net fifty
0:15:49	which
0:15:50	are fifty two percent and fifty seven percent in
0:15:54	error reduction
0:15:56	for l c and
0:15:58	three resolution input shows less improvement
0:16:01	we thank the reason may be the lc-nn have much less number of parameters
0:16:07	which we will show you later page
0:16:11	we also can see score fusion
0:16:13	but she would
0:16:14	better
0:16:16	results compared to the single feature map systems
0:16:20	clearly
0:16:21	in all the cases
0:16:22	the proposed method yellow
0:16:26	bars
0:16:26	is taken it can sleep better than that of the score fusion which is true
0:16:36	the same trend has being seen in
0:16:40	evaluation set
0:16:41	where there are and thing we lay
0:16:44	conditions
0:16:48	the improvements compared to the same condition is less the is still consistently better
0:16:55	then score fusion and also of course the original single feature map systems
0:17:07	then we investigated
0:17:09	the computational cost of the proposed method
0:17:12	and the score fusion
0:17:16	using the proposed to resolution feature maps
0:17:19	only resulted in a parameter number increased
0:17:23	the s and zero point two
0:17:25	the of one two percent
0:17:27	while the increase of the use of the best
0:17:30	three resolution feature maps
0:17:32	was roughly zero point two to present
0:17:40	and score fusion
0:17:42	yes is well-known training two or more systems
0:17:46	and then fuse the scores in score level
0:17:48	scoring level
0:17:51	this did not improve the performance
0:17:53	much in our experiments
0:17:55	but it doubled
0:17:57	or even true or the number of parameters
0:18:03	so in conclusion
0:18:05	our proposed method
0:18:06	will be able to be more helpful in practical use
0:18:12	now i would like to summarise this presentation
0:18:16	we propose to multi-resolution feature maps
0:18:19	which stacks multiple feature maps into a three dimensional input
0:18:24	followed with c n n's
0:18:25	this optimal resolutions will be automatically so selected
0:18:31	it is proposed to alleviate the problem
0:18:34	that
0:18:34	feature maps commonly used in and just moving networks are likely should be sufficient
0:18:40	for anything
0:18:41	discriminative representations of all do segments
0:18:45	and they are often extracted
0:18:47	by fixed lens windows
0:18:50	the effectiveness of the proposed method was confirmed space both
0:18:55	two thousand nineteen challenge
0:18:57	physical access
0:18:58	with three and variance rest net eighteen a scene and fifty and l c n
0:19:05	experiments showed two and three resolutions feature maps
0:19:09	achieved are just search is seven and forty five percent you error rate reduction
0:19:15	it was significantly better
0:19:17	then score fusion
0:19:19	and also it cost only one start
0:19:22	to have
0:19:22	in terms of
0:19:24	a computational cost
0:19:27	for future work
0:19:28	we would like to introduce attention mechanism to make
0:19:32	better his own multi-resolution feature maps
0:19:35	we also would like to extend the proposed method with other feature extractors
0:19:42	that's all for my presentation thank you for watching please let me know if you
0:19:47	have any questions

Using Multi-Resolution Feature Maps with Convolutional Neural Networks for Anti-Spoofing in ASV

Spoofing and Countermeasure 1

Qiongqiong Wang, Kong Aik Lee, Takafumi Koshinaka