Speech Transcript - Effects of Waveform PMF on Anti-spoofing Detection for Replay Data - ASVspoof 2019

0:00:13	i mean system we don't and i don't we are guide for the next twenty
0:00:17	minutes if you have questions please press the power button and whatever you won't
0:00:23	meanwhile lists internet and actual three
0:00:38	okay
0:00:39	this one bound together with the shuffle file now so
0:00:43	we work on effect of the waveform we may have on this point detection in
0:00:48	this time it's for clean data or physical condition
0:00:52	it is a continuation of the were deemed on the same
0:00:56	challenge
0:00:58	for the most common conditions
0:01:01	we define the problem you the motivation why to use the waveform
0:01:06	we show will show several examples
0:01:09	this way for be and have
0:01:11	and
0:01:13	will describe you know musician process
0:01:16	which changes in may have all their plane data
0:01:20	and show how to fix
0:01:22	on the
0:01:23	i is moving recognition and other effects
0:01:29	the examples we show the results of the evaluation and then the big or
0:01:37	so
0:01:39	we can
0:01:39	five problem then the three
0:01:42	one two
0:01:44	classify speech segment rather means gene speech
0:01:48	or one speech
0:01:50	one generally small speech can be synthesized over a door may
0:01:55	or any other way but this work will focus on the data
0:02:04	the motivation for this work is due to the thing that a lot of more
0:02:08	than on i spoofing in the frequency domain
0:02:12	maybe features were applied like mfcc uses c and the c
0:02:19	and more
0:02:21	but not much down with time domain
0:02:25	and we want to learn what happens
0:02:28	with the time domain statistics of the wave form
0:02:32	and see how
0:02:36	we can find changes between the union speech and
0:02:42	shall let's take an example
0:02:45	of a speech segment
0:02:48	and see what
0:02:50	if we look at the waveform and able to model
0:02:54	we see john speech segment
0:02:57	and then
0:02:58	we want to find the probability mass function
0:03:02	of the art students
0:03:04	this statement will of sample queries
0:03:07	sixteen b
0:03:09	we also a person
0:03:11	so we have our sixteen uniform distribution to be between minus one and one
0:03:21	we show here only those two
0:03:24	in the
0:03:25	no range between mind zero one three and zero one three
0:03:31	it can be seen that the
0:03:33	i do
0:03:34	and system will do you
0:03:35	very similar to the last distribution normal distribution
0:03:40	and its well known in the literature at least the speech
0:03:46	no
0:03:47	let's the they the samples for the evaluation of is reasonable
0:03:54	twenty nineteen physical condition
0:03:58	we
0:04:00	evaluated the be an f
0:04:02	all the genes speech brought about
0:04:05	and this was speech the raw
0:04:09	below
0:04:10	and we see that there is the
0:04:12	big difference between them
0:04:14	especially
0:04:17	around zero
0:04:19	so
0:04:21	it can put on the
0:04:23	maybe easy even
0:04:25	by human only by looking at the b m f
0:04:30	to distinguish between
0:04:32	these two
0:04:34	classes union and
0:04:36	replay data
0:04:38	so if you want to make a group of feeding
0:04:41	of course not too if so using to distinguish between them
0:04:47	and we would like to have a similar distributions for all class
0:04:52	so this process we then
0:04:54	is a generalization
0:04:59	will style shows from continues random variable
0:05:02	and then goal is for example of a temporal
0:05:07	to show how we
0:05:09	d is
0:05:11	our one dies samples
0:05:13	so soon we have
0:05:16	source in the f
0:05:18	and
0:05:19	we want to make transformation that it will have
0:05:24	the
0:05:26	pdf of the destination
0:05:28	maybe f
0:05:30	so we have
0:05:31	two probability distribution function
0:05:33	all the sort of
0:05:35	and all the destination
0:05:37	in our case the stores it is well speech while the destination is the engine
0:05:43	speech is we want to convert the
0:05:47	spoof
0:05:48	same and to have the same statistics as the gmm speech
0:05:54	so first for every sample
0:05:58	from the possible speech
0:06:01	we wanna we will find v
0:06:05	value of the
0:06:07	c d f
0:06:09	then we will go in the general speech and you have
0:06:14	where am will be the same value
0:06:17	all the c d f
0:06:19	and the range
0:06:21	vector you're on the
0:06:24	several i will be
0:06:27	so
0:06:28	i have to zero
0:06:30	for this one speech will have no new value of better zero
0:06:37	s in simple
0:06:38	and these procedure we can do sample by sample for all the samples in this
0:06:44	world speech
0:06:47	of course in our case the distributions are no you know but
0:06:52	discrete
0:06:54	and the algorithm the legion be more again
0:06:59	in discrete case
0:07:01	the line is not movement email but have this continues
0:07:06	and
0:07:07	it looks like steps
0:07:09	so for each time a from the small speech
0:07:13	we see why use the
0:07:16	a c m f relative mass function
0:07:20	and now we will move and engine each have
0:07:24	and it's not exactly this that's the values and the same place
0:07:29	so we decided to take the lower bound
0:07:32	in this case
0:07:34	instead of this statement for four we have
0:07:39	still you equal for the new value but it's not true for every
0:07:46	so that it can change from sample stuff
0:07:50	and of course we do it
0:07:52	for all the samples here of the exact boundaries
0:07:56	three increase in our case yes sixteen weeks
0:08:02	so for my own
0:08:04	the logical conditions
0:08:07	and we see the results
0:08:10	the graph about
0:08:11	is the graph of the
0:08:13	suppose speech
0:08:15	while in the middle it's a graph of this of speech
0:08:19	a little aging decision process
0:08:22	and below use the
0:08:24	be a ubm have all the original speech
0:08:27	we can see that the algorithm works well
0:08:29	and the
0:08:31	generalize speech read
0:08:33	is similar to gmm speech
0:08:37	however when we try to apply the same algorithm
0:08:41	for physical conditions
0:08:44	we have a phenomena
0:08:46	that
0:08:49	in the engineering guys speech in the middle
0:08:52	we have like in a bunch around zero
0:08:58	jehovah sees the y-axis of the ml
0:09:02	for speech
0:09:03	the maximum zero one while other grass the maximum zero one four ensures
0:09:10	vol in to make it better visible but we see that
0:09:16	then generalize speech is far away for jane speech
0:09:22	this phenomena was french and we wanted to
0:09:26	understand what happened
0:09:29	so we can see and in the these video
0:09:34	around zero this speech
0:09:37	we have a very big
0:09:39	john responding
0:09:41	which are several
0:09:44	levels
0:09:46	of a window of
0:09:48	the may have been gmm speech
0:09:52	so in when we
0:09:54	convert
0:09:56	this both speech would you know speech writing iteration process
0:10:02	all three levels in this example
0:10:06	of four and five
0:10:08	are you and get an o b
0:10:11	in the engine you guys five
0:10:14	so to overcome these
0:10:17	problem
0:10:18	we can certainly db or duration of each
0:10:22	so i performance of speech
0:10:24	we had it is for small noise
0:10:28	and such way
0:10:30	we have more steps
0:10:32	more available from invisible speech in these investment
0:10:36	we had indeed
0:10:37	three beats
0:10:39	of uniform loans
0:10:42	so we have
0:10:45	eight times more
0:10:48	dis-continuous level
0:10:49	and that josh a lot more in this way now we can reach
0:10:55	and level
0:10:56	in the gmm speech
0:10:59	in our case
0:11:00	in real experiment
0:11:03	to sixteen be additional noise of five b
0:11:07	it means
0:11:09	each level
0:11:11	now have sort into
0:11:13	levels of floors of
0:11:16	when we apply these algorithm
0:11:18	we can see the results
0:11:21	the p m f or generalize speech is very similar religion speech
0:11:27	so we or are the problem of the four previously
0:11:32	of course we tried we also be the logical conditions
0:11:37	and the results were who is pretty with
0:11:41	so it doesn't diminish the previous results of logical conditions
0:11:47	but i improved dramatically the results
0:11:50	all of the generalization process with physical condition
0:11:56	now we want to see what happens with and spoofing system
0:12:03	well we use the generalization process
0:12:08	so
0:12:09	we to the baseline system that will provide by the organisers
0:12:14	in one
0:12:16	two classes for gmm speech and four
0:12:20	speech in each class is a gmm with five hundred twelve gaussian mixtures
0:12:28	there are two models well i four think uses in features and graph for eliciting
0:12:34	features
0:12:35	the baseline results are shown
0:12:38	it didn't column of the baseline
0:12:42	the next goal
0:12:43	we used a miss the
0:12:47	original gmm models but now try
0:12:50	tools
0:12:52	the one of the that a generalization
0:12:56	so righteously the results
0:12:58	all the models problem
0:13:01	in the next step
0:13:02	this data okay we will stay with real data before generalization
0:13:08	by the gmm and
0:13:10	of this model we are currently
0:13:14	generalized
0:13:16	data
0:13:17	and we see that
0:13:18	the generalization probability is very poor results
0:13:23	are very big
0:13:25	when we train
0:13:27	and then we generalize speech
0:13:29	the results are very on
0:13:35	we can say okay
0:13:37	we trained with one data and that the same data
0:13:42	logical of the results are
0:13:46	but i think a lot of
0:13:49	and
0:13:50	the control manager
0:13:52	is to
0:13:53	be able to recognize no admittance of a one thing because all the time you
0:14:00	matters timing algorithms
0:14:03	and
0:14:04	if
0:14:05	the system what well
0:14:07	vulnerable to the
0:14:10	new algorithms
0:14:14	and it's not robust it's not little because we never and always will be the
0:14:19	actual algorithm
0:14:23	so
0:14:24	to summarize
0:14:25	well maybe
0:14:27	we show that there is a big difference between the
0:14:32	waveform distributions of the
0:14:37	to really do you know speech
0:14:39	and the
0:14:40	speech
0:14:41	a the doors
0:14:43	a replay
0:14:45	and effective way
0:14:48	be easy to recognise in the time-domain the
0:14:54	as both speech
0:14:56	so
0:14:57	firstly try present unionisation process how we can convert of the
0:15:05	speech would be statistically more similar to human speech
0:15:11	and we show love it
0:15:12	it's better to a star
0:15:14	noise
0:15:16	to sample
0:15:18	so means of noise and
0:15:20	and better
0:15:22	and unionisation
0:15:25	then we tried this the control measure and we so that the results can vary
0:15:32	dramatically
0:15:33	with a friend use one data and try
0:15:37	is that a or of spoofing
0:15:41	in the form of understand the extendible
0:15:45	for a moving system
0:15:48	to behave like these
0:15:50	because it
0:15:51	must have very good generalization for be and
0:15:55	neither one will the
0:15:57	by national will have to be done
0:16:00	this direction to
0:16:02	may
0:16:04	seized and much more we will i
0:16:14	thank you very much and if you enjoy at all
0:16:17	you can press play and listen to be again and again
0:16:23	stay healthy by

Effects of Waveform PMF on Anti-spoofing Detection for Replay Data - ASVspoof 2019

Spoofing and Countermeasure 2

Itshak Lapidot, Jean-Francois Bonastre