Speech Transcript - Residual Networks for Resisting Noise: Analysis of an Embeddings-based Spoofing Countermeasure

0:00:13	i well i guess how these is you don't
0:00:17	in the residual also
0:00:20	and today i'm going to present you a
0:00:24	residual methods for music signals
0:00:27	and indices
0:00:28	but and endings phase
0:00:31	with the increasing the actual text to speech
0:00:34	and voice conversion methods
0:00:37	there is it will we need for solving
0:00:40	the only yes is for each other series has resulting right progress
0:00:45	the what is a
0:00:47	there are so open challenge is how
0:00:51	the elements of comedy shows
0:00:53	in reality noise scenarios that is you very little research
0:01:00	and i a lot of the problem
0:01:02	is that i work so phenomena
0:01:05	the acoustic information
0:01:07	exploited by actually just
0:01:09	exactly
0:01:11	it is challenging looking size box
0:01:15	in this study
0:01:17	we propose a new
0:01:19	died resonant gmms for sure
0:01:22	and we compare systematically
0:01:25	its performance
0:01:26	to the ideas is able to times and i think
0:01:31	this includes
0:01:33	two hundred uses
0:01:35	or performance
0:01:36	in various types of noise scenario
0:01:40	and we also
0:01:42	to look inside this
0:01:43	seemingly
0:01:45	in data
0:01:46	black box
0:01:48	model
0:01:49	so this will be encountered a problem
0:01:52	is a mixture of and i read that
0:01:55	and the gmm
0:01:58	retrain basically gmm
0:02:00	well the and endings
0:02:02	the ones
0:02:04	by a i wrestler
0:02:07	in i able to that vectors
0:02:11	well i data base cu these background
0:02:14	as input features
0:02:17	and
0:02:19	but it is easy to
0:02:20	then i did convolutional layer is
0:02:24	in each that we can see that there is a max pooling which is essential
0:02:29	to result in and i'm selling factor of two
0:02:34	well i think so is there is that so you can actually includes
0:02:39	this gives connections
0:02:41	connecting the convolutional layer is a weighting
0:02:44	training of very you know one at a picture
0:02:49	finally
0:02:50	in the gmm is
0:02:53	we have a whole incorrectly layer
0:02:56	and we data and endings
0:02:58	to train
0:03:00	the gmm or vector
0:03:02	a gmm their true can have the and the h
0:03:06	but including putting
0:03:09	a likelihood ratio
0:03:11	or worse still mask
0:03:14	this enables
0:03:15	to include a human little
0:03:18	for the automatic speaker verification
0:03:20	or just implement the rejection based
0:03:27	in this fight i present
0:03:29	the overall performance
0:03:31	all the two baselines
0:03:33	the two challenge baseline
0:03:35	and assisi gmm
0:03:38	c use this is gmm
0:03:40	and the proposed data
0:03:43	see you did the gmm the an
0:03:45	and the usenet oneida
0:03:48	security the
0:03:49	but all
0:03:50	with the sole saw fusion system which are the fusion of the mfcc gmm
0:03:56	c is easy gmm
0:03:59	and the cuda gmm system
0:04:04	we can see that role
0:04:06	the sum fusion
0:04:08	that's cool was
0:04:11	but also that s
0:04:13	and a very straight north
0:04:16	using the different architectures
0:04:18	in the different kind of smoothing types and thus
0:04:23	i would like to emphasise you
0:04:25	the table we apply
0:04:26	one minute or
0:04:28	one political access portion
0:04:30	or the u s is nineteen
0:04:35	because it will hear system mapping dataset is and noise it is not very suitable
0:04:43	to test
0:04:44	a noisy scenario
0:04:47	really i'm noise original this data
0:04:50	so we have to create
0:04:52	but noise is the
0:04:55	it is computationally very expensive
0:04:59	to create
0:05:01	noise in scenarios
0:05:02	for the speech samples
0:05:05	so instead of this i do they
0:05:08	but less computationally intensive approach
0:05:12	by something
0:05:13	a subset of the yes easy to nineteen dataset
0:05:17	in a bottle ancillary and by well i mean we mean
0:05:23	the bonds respect to the data used to be s
0:05:27	the there exists
0:05:28	in the dataset
0:05:30	then
0:05:31	we rst noise samples from then used on dataset
0:05:35	these are all three
0:05:37	the signal-to-noise ratio
0:05:39	all five test
0:05:43	we have a selection of c six
0:05:46	speakers on the speech for them use an dataset
0:05:50	a random music file
0:05:52	and the remember noise data
0:05:55	from the nuisance dataset
0:05:57	by noise
0:05:58	really fair to the noise category all the muse and data
0:06:04	big noise is also where i
0:06:07	by since the functional generation
0:06:10	at a signal-to-noise ratio of five the signals
0:06:14	and also reverberation was applied
0:06:16	using simulated woman close this is from the y alright
0:06:22	we can see the overall performance results
0:06:25	all the all vectors
0:06:27	in the presence of
0:06:28	also i
0:06:30	we see
0:06:32	the results of noise
0:06:34	this but architecture for best
0:06:38	and without noise
0:06:40	this is usually gmm vienna
0:06:43	and the sum fusion on a circle
0:06:49	we have also
0:06:52	that is this sort of a tradeoff
0:06:54	big in the security in the n f c but the gmm
0:07:00	the c d v d n
0:07:02	performance
0:07:03	better in noisy cases
0:07:06	but slightly worse
0:07:08	in always this case
0:07:10	compresses but gmm
0:07:13	finally
0:07:14	we have to also the
0:07:17	that old s e c g and the c use is e g m all
0:07:21	characters
0:07:23	a the performing compared to the that the proposed architecture and compared to the cu
0:07:29	maybe a
0:07:32	in these noisy and with this scenarios
0:07:35	you we see the same feature but in
0:07:39	therefore that occurs
0:07:40	rather than
0:07:41	you know all of that they
0:07:46	we can see the sum fusion
0:07:48	performs best
0:07:50	in the noiseless scenario
0:07:52	the noise this setup
0:07:54	is not by this
0:07:56	though not installed s
0:07:59	why the noisy scenario is denoted by six right
0:08:03	the continuous time
0:08:06	overall we can also that the cu due to the nn off factor is the
0:08:12	most robust to noise in this whole audio
0:08:16	and we can also the
0:08:18	this kind of trade off there
0:08:20	with this but gmm
0:08:22	and the cu and
0:08:25	three shows that we have also seen previously
0:08:29	in a you know the
0:08:33	we then proceeded to do
0:08:35	visualisations
0:08:37	this is you didn't the nn and endings
0:08:39	first
0:08:40	with pca
0:08:45	really the visualisation
0:08:47	and so to solve the class is
0:08:51	it became apparent
0:08:53	that most of the school classes
0:08:55	so it's very well
0:08:58	from these green
0:09:00	point cloud
0:09:02	which corresponds to the bottom
0:09:06	exact
0:09:07	the v c classes
0:09:09	the classes corresponding to voice conversion
0:09:14	we sort of all of that
0:09:16	we don't wanna cost
0:09:19	this explains
0:09:21	the fusion detection performance
0:09:23	with some p c s is
0:09:26	because these can be separated
0:09:30	linearly
0:09:30	in the to these days
0:09:33	we see
0:09:34	a similar
0:09:35	consistent picture
0:09:37	another dimensionality reduction the
0:09:40	but these three
0:09:42	which stands for sixty still fifty mean and that
0:09:46	and what we the
0:09:48	is this same feature
0:09:50	of the v c cost use
0:09:52	all of that
0:09:54	with the one activities
0:09:57	on the and then proceeded to do an additional experiments
0:10:03	the goal of this experiment force
0:10:05	to see
0:10:07	how and then he's moving is
0:10:10	then there so gently
0:10:12	to these different kind of noise and i was
0:10:16	in the bigger what you can see
0:10:19	is what happens
0:10:21	in case
0:10:23	of variations
0:10:26	those also that this figure
0:10:31	the blue point counts
0:10:33	the red points while
0:10:35	and the green points l
0:10:37	is actually the same
0:10:39	that's in the pca side
0:10:42	now
0:10:43	we proceed to solve a whole
0:10:46	some samples
0:10:49	these ones
0:10:50	from the one of the
0:10:52	and these ones
0:10:54	but the ones
0:10:55	from this tool
0:10:58	and the be
0:10:59	following the lee
0:11:01	noise
0:11:03	with this reverberation
0:11:06	and what we see
0:11:09	is that being the ones
0:11:10	corresponding to the one thing
0:11:13	big on these green dots
0:11:16	moving closer to the actual decision boundary
0:11:22	and we can also see
0:11:24	that a little
0:11:27	become these orange dolls
0:11:30	we closer to the decision boundary
0:11:34	but still on the right side of this each
0:11:38	then well
0:11:40	this gives us a according to the u
0:11:42	the hot picture is robust to the duration
0:11:47	because you know
0:11:48	no one's
0:11:51	matrix
0:11:52	this is all
0:11:54	close as a decision boundary which is exactly
0:12:00	we can see that a mass
0:12:03	the right classification decision
0:12:06	is retained
0:12:07	now i'm going to talk about
0:12:09	alright cleanable algorithm based techniques
0:12:14	the first thing i'm going to talk about
0:12:17	is the graph based technique
0:12:19	which is a basis
0:12:23	first
0:12:24	we can only the security spectrum
0:12:27	based
0:12:28	on the all we also
0:12:31	down with the reckon
0:12:34	we obtain a sensitivity
0:12:37	this sensitivity man that sass
0:12:40	one loss of the spectrum well
0:12:42	i don't most important to me
0:12:45	the classification this procedure better the speech or if to whether it is natural
0:12:51	what can do
0:12:53	is a threshold we sense it's gonna
0:12:57	the whole thing this binary mask
0:13:00	in c
0:13:02	which is basically segments this for four hours
0:13:06	does not reach five important on
0:13:09	and you can be should be
0:13:13	see that
0:13:14	if we will lie
0:13:15	the original security spectral again but i mean
0:13:20	really all the in this
0:13:22	picture
0:13:23	which we again
0:13:24	i don't normalization
0:13:27	sensitive refuelling waller
0:13:30	to thing
0:13:32	reconstructed way
0:13:35	and how what we rewrite when you is a series of trainable all it was
0:13:41	right of each other
0:13:43	first you are going to here
0:13:45	the original well
0:13:47	then
0:13:48	you are going to hear already construction of the original using all the features
0:13:54	and finally going to you possible the audio that the no one extra innings sports
0:14:17	so you can do something about the real the speech
0:14:20	and the again here on a particular type of
0:14:24	viewing these examples bridge indicate what what's of the speech signal
0:14:29	might be important
0:14:33	that i think that we have five
0:14:36	this is that both
0:14:38	mean you know we'll technique
0:14:42	we all we want to all audio files based on how challenging air
0:14:48	the more challenging only what lies
0:14:51	i usually the ones
0:14:53	that are closer to the cm threshold
0:14:58	and the definitely once i goals
0:15:00	which are the from the c a threshold
0:15:05	and what we can do
0:15:07	is we can exploit this phenomena
0:15:10	this clueless the cm stressful
0:15:12	and use these
0:15:14	two or two was
0:15:16	based on this yes of course
0:15:18	and i think he's grew out was the main noticeable o as we can obtain
0:15:23	and you all recently collected by consent
0:15:27	where we don't understand the needle individual
0:15:31	but three
0:15:33	a fourth the voters on the acoustics
0:15:38	so
0:15:39	i'm going to show you what okay given the case of a eighteen
0:15:44	and
0:15:45	i'm going to and you are going to
0:15:48	he of progressively so was that
0:15:53	i first variability so
0:15:56	five from the c and search for in the direction of what you
0:16:00	and then finally ones that are there is to someone's the batteries two
0:16:28	so let us was to here is that there is a noise more aggressively present
0:16:35	when you use a listening to a morse two
0:16:38	all videos
0:16:39	in general we also that there is a more
0:16:44	no one set of speech in the school speech can be also
0:16:49	in general
0:16:50	in this actually involve your examples
0:16:53	you can hear more these extended while you're was
0:16:56	by scan disk you a whole or just picking the mean
0:17:01	we also be some definites experiments using the most that architecture which can be used
0:17:08	to cooperate
0:17:09	objective measure on estimation
0:17:12	we find a as for the zero point three more five
0:17:16	these being the mean opinion score and of the screen
0:17:20	the s is that
0:17:22	is the first principal axes the first nine dimensional well i principal component
0:17:28	and this year
0:17:29	that was
0:17:31	actually a single
0:17:33	then a bonus aspects of the speech
0:17:38	and interestingly we the exact show these voice cooking categories
0:17:43	also all was i think more natural than the actual one of the signals
0:17:50	waiting to the most
0:17:51	and point out directions for future
0:17:56	recognizer redeemable water
0:17:59	as an image reconstruction what you
0:18:01	in a minimal audio case
0:18:04	so in the future we want to use an l one based solution
0:18:09	trained on c you the
0:18:10	spectrograms
0:18:12	because these have been previously shall tools lingual speech coding i bit conventional fft spectrum
0:18:20	finally we also recognise
0:18:22	that's the data bases clicks voice activity detection
0:18:27	would be essential
0:18:30	but is always this each region both this can be important for cm investigation
0:18:37	in the case of political access data
0:18:41	but it would be thing important
0:18:43	to design a good calibration stuff i
0:18:45	we investigate to what extent
0:18:48	this is thus
0:18:49	really i
0:18:50	but non speech
0:18:51	versus their use
0:18:54	to summarize
0:18:55	we have found
0:18:57	that are known to have a second the measures
0:19:00	a robust to noise and you know have a better understanding that even though
0:19:06	then i don't exactly know
0:19:08	well for doing
0:19:10	well the
0:19:13	we know that the a robust to noise more robust to noise that the gmm
0:19:17	can
0:19:20	nevertheless we have a managed to the in more insight into these
0:19:27	by generating explainable as
0:19:31	finally
0:19:33	we have also
0:19:34	a investigate the of an important concept
0:19:38	which is the and the things correlate with subjective naturalness i'll show the diagonal
0:19:46	meaning that a texture
0:19:48	no in a
0:19:50	considers the naturalness
0:19:52	s i si
0:19:54	i hope this presentation and i is you
0:19:57	did not be afraid i
0:19:59	of using the minutes of this
0:20:02	in your work
0:20:04	due to
0:20:05	to just the sheer ease an unexplained i
0:20:08	and i would like to thank you
0:20:10	for your attention

Residual Networks for Resisting Noise: Analysis of an Embeddings-based Spoofing Countermeasure

Spoofing and Countermeasure 2

Bence Halpern, Finnian Kelly, Rob van Son, Anil Alexander