Speech Transcript - Anti-spoofing in automatic speaker recognition

0:00:01	and everybody
0:00:02	i welcome you in my story on this thing in automatic speaker recognition
0:00:08	i'm a similarity score on assistant professor at your local news data
0:00:12	frames you're looks cool
0:00:14	and there's some other regions
0:00:18	at a low overall difference was moving detection rate of cognition we first you all
0:00:26	speaker verification
0:00:28	giving more attention to current research plan and progress
0:00:32	in the middle and all this information for a speech systems
0:00:37	but also we don't to the cost
0:00:43	automatic speaker verification is one of the most convenient enough room means of but you
0:00:48	might also recognition
0:00:51	this is why this technology is values from your application services such a smart phones
0:00:56	small speaker single sensors
0:01:00	it's technology has about a lot over the last years based data that a is
0:01:06	increasing the we need of by the premier network solution
0:01:09	so just it's vector
0:01:11	we to some extent is weaker than traditional gaussian mixture models
0:01:16	or the so-called i-vectors
0:01:18	and when the roaches are also emerging
0:01:22	we guess at the speaker recognition technology s probably reach the level of performance required
0:01:29	so or practical issue
0:01:33	it wasn't no is whether or not the remaining system is one a normal to
0:01:37	what we're gonna be the answer is yes
0:01:41	the reality of voice biometric technology can be compromised by political status namely born and
0:01:48	ability to the technology external
0:01:51	one of the measures trees the security of biometric systems are spoofing attacks
0:01:57	there is there are four
0:01:59	the final severe okay stores carry out of whatever you matrix system into recognising and
0:02:04	legitimate user is a general user order to avoid being recognised
0:02:10	this is achieved by presenting to this is a synthetic for all the money we
0:02:16	bash
0:02:18	or the volume at least eight
0:02:20	but before we locate is a are the second walk ons this system is processed
0:02:29	there is this is then try to answer this question is that there's on what
0:02:34	they say the are
0:02:37	this means that the target that idea in this case studies as well as a
0:02:41	non-target trial the t v
0:02:44	can be a set the origin by speaker verification system
0:02:50	this results in two different types of errors name false alarms and false rejection
0:02:56	as shown in table
0:02:59	only if this user used a and a change dataset or that this user is
0:03:06	an bolster the challenge i
0:03:09	there is a v
0:03:10	system based
0:03:12	according to their change
0:03:14	here target speaks when they are now available is whining boxers makes no f or
0:03:20	when there's anything about
0:03:25	so
0:03:26	given a test right it is we provide some score behind the score integrator the
0:03:32	confidence that the speaker voices
0:03:36	a better discrimination you see green order to increase in body then between target trials
0:03:41	and non-target trial scores by selecting a threshold between the leash motion looks coarse
0:03:47	however as trying to figure that in the non-target score distribution
0:03:53	usually overlap region
0:03:55	this is can you being the detection error tradeoff at school
0:04:00	on the right well the point where the false alarm rate is in well to
0:04:05	the force the
0:04:06	a certain three is cool enquiry
0:04:09	is this really realistic
0:04:11	though the impostor may can you have for performing system
0:04:15	or they can implement it is if you is my task
0:04:20	so they aim at all that is to provoke false alarms by increasing easily classifier
0:04:26	scores target while i'm going detection
0:04:30	we can distinguish costly to get in bolster from an eye impostor
0:04:34	there are there are also going to zero for impostors
0:04:41	the processing to create fake speech signal you know it down for let's see that
0:04:47	the challenge here is to find a solution to that there are many valuable and
0:04:52	involving this process and there are still menu question to ask
0:04:58	do their car from linear earlier processing due only receive you know part of the
0:05:04	spectrum should be able to look also and the phase signal
0:05:09	but something this question later when we have more element goods you are
0:05:16	there are many a general approaches for the measures improving the easily robustness for example
0:05:23	by speech or the u r c d this is an invasion that action
0:05:28	or winded executive countermeasures for example based on that for sure
0:05:33	this is and its energy detection
0:05:37	in this legal issue on an example that plot you stating baseline performance is when
0:05:43	they posters are non-zero for impostors
0:05:47	baseline black line
0:05:49	the performance degradation when data getting both
0:05:53	by the system
0:05:55	so is this also that the red line
0:05:57	and improvement of the performance is where they can to measure the client
0:06:03	this is the one dimensional fashion
0:06:05	rule i
0:06:06	and know that on a meeting with perfect countermeasures those this is the best performance
0:06:12	reach its baseline performance
0:06:18	nobody six including voice volume it is becoming an instance
0:06:22	many speaker pointed out there is usually issues
0:06:26	can think speech
0:06:29	decision can undermine confidence in easy and it is important you regional level of control
0:06:36	measure of presentation that detection to reduce false acceptances
0:06:42	to spoofing attacks
0:06:46	does that this additional tasks can be originated from more efficient synthesis
0:06:51	or voice
0:06:52	in unlogical system old or just we recording related approach you know basic process
0:06:58	well
0:07:00	where we enjoy directly the audio stream in the easy my
0:07:06	these four percent the measured rates
0:07:09	and a time or a is impersonation which ones used in dating a human voice
0:07:17	also the tree to but this condition is not only inter school and twenty minutes
0:07:23	studies
0:07:24	involving small datasets
0:07:26	it is not surprising a
0:07:28	that
0:07:29	there is no previous work misleading countermeasures maybe impersonation
0:07:37	a possible location of that point the in time typical icily system maybe before or
0:07:45	after the microphone as illustrated in three
0:07:48	corresponding to physical access and logical
0:07:53	is he is more or something then older biometric system based on different biometric is
0:07:59	just conceded that symbols of a human persons goal is can be collected the really
0:08:04	bystanders to face to face or telephone conversation
0:08:09	and then blame in order to my twenty a day is just
0:08:14	or more advanced voice conversion or speech synthesis algorithms
0:08:19	in used to generate particular
0:08:22	if it is looking at that
0:08:24	using only modest amounts of voiced the calculate the for a person
0:08:32	this table summarize the for splitting and that's in terms of us a single decreases
0:08:37	and in we will consider measures
0:08:40	except for the impersonation at time so that have a menu model i s is
0:08:45	unity
0:08:47	and i freeze
0:08:48	especially for text event is the scenario and the error of intermediate of dimension
0:08:55	that's the use of for scroll
0:08:58	generalization it is the meeting to the different
0:09:02	or unseen i
0:09:07	so this is the timeline which the task
0:09:10	two days visible units you
0:09:13	and is studies on speaker and feasible thing where and are on me now speech
0:09:19	for were created using a limited number or something
0:09:23	in see it is clear that the development of can to measure using only a
0:09:29	small number was looking at task
0:09:31	no you generalization to be
0:09:35	moreover
0:09:36	there was a lack of a galaxy we will corpora and evaluation bottle but not
0:09:42	for the to the results of being by different researchers
0:09:48	daisy of this study aims to establish a key during the initial you by making
0:09:56	of evil standard speech corpora
0:09:58	we have a large amount of signal that's
0:10:01	evaluation protocols and matrix
0:10:04	to some or a common evaluation and the benchmarking different systems
0:10:10	is feasible challenge is as being organised in time so far
0:10:16	the first was having to sausage in
0:10:18	the second two thousand and thirteen two thousand
0:10:23	it were presented and the corresponding special session loading the interspeech conference
0:10:32	is actually current own analyses of this visible for you as well as the their
0:10:39	finish definition to partition your see the company around the work
0:10:47	but the first thing is challenge involve detection of the division speech
0:10:51	the data using a mixture of voice conversion to speech synthesis techniques
0:10:57	it was or something during basically to a special session it english speech of those
0:11:02	in
0:11:03	and the sixteen organisation have debated the this challenge
0:11:08	there is useful for those of fifteen involve only logical a system that that's and
0:11:16	the a as it was generated we ten different of diffusion speech generation algorithms
0:11:23	well based on a large collections accordingly scolding this of course
0:11:29	version well
0:11:31	and consist of but not without and t v show that a speech
0:11:37	one of each was recorded using i one thing microphone
0:11:41	and we don't seem difficult channel or of background noise effects
0:11:48	and if one database was divided into two subsets coolant
0:11:53	the training level of an evaluation set in a speaker and he's joined mar
0:11:58	finally i s from the s one was i ni is known
0:12:05	where used
0:12:07	in the training and development and evaluation set
0:12:11	and the one to five times from six s c and it is then going
0:12:17	a known or and seen that
0:12:20	where are used on the in the evaluation set along we know that that's
0:12:27	based on the dimension and of the bias the or on what it used for
0:12:33	voice conditions speech synthesis
0:12:36	nine of them are we'll database and the hmm of gmm based addition model
0:12:43	while only one the s and is the unit selection based
0:12:46	speech synthesis implement we that one source madly
0:12:50	text-to-speech system
0:12:56	the banana but all of easy system based the on the i-vector but the is
0:13:02	pretty clear
0:13:05	except for the i guess who
0:13:08	well that that's are very effective with importantly reasoning
0:13:13	greece all equal error rate
0:13:16	in the worst case
0:13:17	that is s then
0:13:20	i don't to one
0:13:21	directly to fifty one will ones
0:13:24	it is seventeen
0:13:28	so that it will the on the left show here the challenge results
0:13:33	the in terms of the average equal error rate across all their a score the
0:13:39	evaluation set
0:13:40	for no one and i do not
0:13:44	the exactly a lack of a generalization these results
0:13:48	over the table on the left to sure that
0:13:55	i'm sorry believable the double on the on the right initials the that the top
0:14:00	performing system evaluated only
0:14:04	on the s ten
0:14:07	the unit selection based speech synthesis
0:14:11	isn't that isn't most if you without
0:14:13	then the and the most dangerous for speaker verification system is i are shown previously
0:14:20	so as then i used to efficiently the biggest three for the msd system in
0:14:27	this case
0:14:31	and used in one is on the
0:14:33	the front end of a against the door for a performing system
0:14:39	on the challenge
0:14:40	it will not the to read for the in this challenge is related to the
0:14:45	two features
0:14:47	and the level of the low end of the front and
0:14:51	other people between if the in the v a dynasty the use cochlear filter a
0:14:58	cepstral coefficients
0:14:59	that are related to the human auditory system
0:15:02	possible these something that john it problem
0:15:10	so no less and i don't know are most the challenge evaluation on the is
0:15:16	v is of two thousand fifteen
0:15:19	we propose a new feature domain constantly coefficients
0:15:23	this on the constant you possible which is a an alternative to put it costs
0:15:28	and which employ a variable time-frequency resolution that means
0:15:34	greater time resolution for and frequency
0:15:37	and you the frequency resolution for lower frequencies
0:15:42	so that wasn't you the first one vicinity of an idea which are different more
0:15:46	closely the human perception
0:15:49	and the to obtain a c uses you features we combine a cuda increase of
0:15:54	the initial k would have also with the prediction cepstral analysis
0:16:02	i should be for that the only thing started in the challenge
0:16:07	where only able to the test then i probably
0:16:12	so is it is easy as a
0:16:15	obtain completely can be you results for knowing the task and the best results for
0:16:21	i do not a week and eighty seven relative improvement on stand
0:16:26	and overall seventy two ground control
0:16:34	so to summarize basis for fifteen focused on the i don't voice conversion and speech
0:16:40	since is a task so not ugly
0:16:44	easily disapprovingly detection so no at
0:16:48	that's the band the scenario
0:16:51	the participant in their invested for to develop features using most simple classifiers
0:16:59	and the fourth line regionalisation used in the missing
0:17:04	any of
0:17:06	i think meet again we the some possible mission improvements
0:17:18	i like it doesn't fifteen addition to that used very high quality speech material it'll
0:17:23	seventeen addition aims to assess the we have a detection
0:17:27	we call in the white
0:17:29	condition
0:17:31	in focus exclusively on earlier works
0:17:34	a second of them i think speaker verification code dimension challenge was presented including this
0:17:41	is a special session
0:17:42	adding the speech those of indian
0:17:45	and fourteen now consider shows a distributed of the challenge
0:17:52	cost function if this were from the riesz a text
0:17:58	that adults
0:17:58	course
0:18:00	was proposed was to collect speech lead to over mobile devices
0:18:05	in the form of smart phones or a black computers
0:18:10	a bible tears of from across to low
0:18:14	we collect the a's this will does seven in the database using a playback device
0:18:20	and a recording device different acoustic environment
0:18:27	we did not to use a realistic scenario using core the recording but we made
0:18:34	actually got
0:18:35	and do the you don't call me all the target speakers voice
0:18:40	to create the plane data collection
0:18:44	this is the worst case scenario that of those the use of x sixteen speech
0:18:50	were to be linear access
0:18:56	the colour curve was is divided into three subsets for training development and evaluation
0:19:05	we different speakers replay section and ugly configuration
0:19:11	in training and development subset were collected in three different sites
0:19:16	and evaluation subset was collected at the same a three sides and also the data
0:19:23	for a new side
0:19:27	this is the loudest most the inverse italy that
0:19:34	in terms of a basically a wider meeting t s for the challenge also here
0:19:41	is a clear
0:19:44	the this is m is based on the a gmm
0:19:48	and the really that's a big effect you
0:19:52	with an important case of the equal error rate
0:19:55	for all
0:19:55	one point eight fifty one point five
0:19:59	on these evaluation set
0:20:04	the primary evaluation is only whether they can rest of this additional two thousand fifty
0:20:10	challenge
0:20:12	the equal error rate is computed from scores all across all training segments rather than
0:20:17	condition averaging
0:20:20	why fourteen estimation
0:20:22	perform the baseline while existing three and their the
0:20:28	at a performance is the old in more than seven percent relative improvement we used
0:20:33	a dismissal a
0:20:35	baseline system is based on gmm of a classifier we can you cepstral coefficient features
0:20:42	it was provided to the data
0:20:45	comparing the baseline mean zero one thing to do
0:20:49	it is important performance improvement when using wondering plus their the three
0:20:57	this is this idea of the parameter submission to residuals
0:21:02	it doesn't seventy
0:21:04	i don't training refer to the bar all the time for training
0:21:09	a sense for three and a reasonable
0:21:14	most all the systems a lower bound for the features
0:21:19	this call mom for all the systems to build a gmm classifier
0:21:24	single cost you as you can see
0:21:27	the invariant use whatever means of all around solution is twenty five one ninety one
0:21:33	understand
0:21:34	where s the best single system result show
0:21:39	and average detection whatever in
0:21:41	or
0:21:42	only six point seven percent
0:21:47	this is a test tools for looters challenge show that
0:21:52	the channel of a layer that is more difficult then detection speech synthesis and with
0:21:58	compression
0:22:01	for me a dimension generalization also remains a problem
0:22:07	after the challenge that were that the anomalies
0:22:10	ieee beyond zero samples present a beginning on managing speech uterrances
0:22:17	is zero really running by for the easy to be a
0:22:23	but maybe but i for a modified versions for speech detection
0:22:29	these issues it is so for version two point zero was released to colour be
0:22:35	anomalous
0:22:37	i detected of course the evolution
0:22:39	in addition the metadata which describes the recording and playback devices and that was the
0:22:45	environments where once released along we and you are not the baseline
0:22:51	the new metadata along with the data by ching as there is the number uterrances
0:22:58	as well as the a population or the evaluation set
0:23:02	remember when i'm better than for each other
0:23:07	for a better understanding of the outcomes we can rewrite the square the regulation terms
0:23:13	of the speaker measurement recording playback devices
0:23:17	acoustic environment is a physical spacing which original stage the that basically then here or
0:23:25	it is reasonable because seventeen database was collected you have a different environment
0:23:32	the evaluation meeting there about the accent level over even more controlled noise
0:23:38	the
0:23:39	for example can be in we model noise and balcony are assumed to be noisy
0:23:46	all these
0:23:46	all right are assumed to be maybe which in your oracle room huh
0:23:53	are assumed to be are actually
0:23:58	there are under the of a twenty six a little better prices
0:24:02	a smart phones the lower bound we
0:24:07	if we the we fifteen this moral speakers
0:24:11	are assumed to be all over the
0:24:14	well e
0:24:15	a little larger lot of speakers are assumed to be your mean you rightly
0:24:20	and the professional or do we managed are assumed to be i
0:24:27	assuming only there are a total twenty five recording devices
0:24:32	some are ones that are the weights for my from source would be a little
0:24:36	windy and it's where a microphone are assumed to be over the medium by i
0:24:43	and the again the regression your and b i
0:24:50	this figure shows the impact of different illegally configuration of one lazy performance measure in
0:24:56	terms of equal error rate
0:24:58	we have sent over a zero for impostor trials are replaced with a replaceable by
0:25:04	iteratively the each other little degradation
0:25:09	the control the demo on the right shows the resulting legal regulations sort of according
0:25:15	to the easy equal error rate in the
0:25:19	all pole a core also reflect the supposed to be a is the
0:25:25	where we are in this a little degradation
0:25:29	this is done
0:25:30	they higher than one at a very little degradation the motive for effect in a
0:25:35	the three years
0:25:39	it is this detection performance of a gmm robot
0:25:44	and i-vectors read about smoking the dimension
0:25:48	for this thing that a little degradation
0:25:52	also expressing that all the equal error rate
0:25:56	the first edition these results is that the recently the correlation between the specifically to
0:26:02	the thing
0:26:03	detection or everybody detection or
0:26:08	this is a fine reflect the final complex of overwhelmingly device
0:26:15	there was to get about a man and the recording right
0:26:19	the control on the right a to see the results in terms of the all
0:26:24	only a in a environment going back and replay value
0:26:32	results show the number of a single element of the little degradation for all i
0:26:39	trials this was all we trials corresponding with either one of the
0:26:46	i in my all their acoustic environment a system we need the effect of the
0:26:51	playback and recording device
0:26:57	to summarise it is able to go seventeen false own regalia
0:27:02	so not at a slow was commission
0:27:05	performances are reminding
0:27:07	even for the worst case scenarios
0:27:10	analysis is a very difficult since the data collection was the whole roll
0:27:17	remote control data collection mean thing to ensure a which is one recognition or the
0:27:24	that is useful to doesn't matter the in
0:27:27	so again is related to smoking detection so nicely where
0:27:32	text independent scenario will use
0:27:35	a there is no gave a database that for a little features and classifiers
0:27:41	it generalisation is even missing giving me a
0:27:45	it's been mitigated i mean green post evaluation improvement
0:27:53	so let's go to the to provide a speaker verification additional information challenge
0:27:58	a straightforward on boats
0:28:00	speech synthesis and the really
0:28:09	as for the because efficient it was examined everything is feasible for special session in
0:28:16	their speech goes on a in
0:28:17	and forty and fifty organisation there are basically the of the challenge order to standards
0:28:26	it is useful because i'm in the in a database is this i would've liked
0:28:30	to different use case scenarios
0:28:32	well you got and this guy was the score
0:28:35	also different a is this strategy of assessing still thing to measure performance on a
0:28:42	state
0:28:42	instead of the test
0:28:44	stand-alone compare measure
0:28:46	for this reason for if there is alright we have provided the
0:28:52	is this
0:28:52	score of the participant
0:28:55	so we have got the a s primary method of the minimum normalized the actual
0:29:00	cost
0:29:01	in this
0:29:02	and this is a very maybe at whatever rate
0:29:06	also for most discrimination
0:29:10	use of the a dcf means that the these this design database is this i'm
0:29:17	not for the standard on this task will commercial
0:29:21	but they are on the availability in is very system where subject to scooping up
0:29:34	necessarily now to use in a normalized dcf so inspired by the detection cost function
0:29:41	the
0:29:42	c f
0:29:43	used in these the sre challenge is
0:29:47	i in a this it is
0:29:51	aims to assess is the this is the last to make sure
0:29:55	to all formalize assessment
0:29:59	so long format or by rate
0:30:02	or you really motivation for a four
0:30:09	okay and the a whole basically
0:30:14	countermeasures system
0:30:17	there are a total of four possible error
0:30:20	where
0:30:21	quantify
0:30:23	target uses a by the company measures is that
0:30:27	i wanna five target is rejected by easy this is the
0:30:31	i don't target trials are so that
0:30:34	and cost of the idea is
0:30:40	the for possible errors in be formally describe so it is for the costs and
0:30:46	priors are this i mean that one
0:30:49	and the classification tree
0:30:51	it
0:30:52	are computed be taken
0:30:55	the roadie dcf a venue a can be difficult to either us or forming the
0:31:02	formation of the well in the nist speaker recognition issue
0:31:08	it is useful to normalize the cost
0:31:11	the normalized that it is it's a function of a the measured pressure
0:31:18	a similar to the bus the challenge efficient
0:31:22	is useful for those online dating does not goals of pressure of the set in
0:31:27	that means that the calibration
0:31:30	so we think source in this case the traditional or mutually the standard measure to
0:31:34	install involve a corresponding to go for calibration
0:31:39	that correspond to the remaining on remote i
0:31:43	in this
0:31:44	in by fitting the all my racial the to mine
0:31:48	for from the evaluation set using the
0:31:56	so this is able to those on a the database is visible the for score
0:32:01	one dorky be seen again corpus
0:32:04	okay speaker english speech database a or in the a union going
0:32:10	charmer still clearly all these things
0:32:15	either
0:32:16	before weights
0:32:19	so it was a the using this is from whatever the seven speakers
0:32:25	forty six main thing see more humane
0:32:27	but they are the ensemble to a sixteen khz the sixteen bits per sample
0:32:36	a collection of course uses colour that these in baseball problem in this analysis
0:32:44	it is divided in three
0:32:46	for training development evaluation in a speaker is john manner
0:32:52	for the logical is there are six
0:32:55	text-to-speech and voice conversion box
0:32:58	for training and there's fifteen
0:33:00	yes and b c score evaluations that
0:33:05	what the physical analysis
0:33:06	there are then these a holes the
0:33:09	environment
0:33:10	and i sleepily calculation of training
0:33:13	they're an imbalanced
0:33:17	we yes
0:33:18	the two is then of the double doors to provide state-of-the-art yes this is this
0:33:24	if you show a lot of assigning all over the course
0:33:31	this table summarize this system which are fundamentally you go first
0:33:36	the known
0:33:37	small things is the for a zero one at zero six
0:33:41	in the lab
0:33:42	two v c and four yes systems
0:33:46	then
0:33:46	well at zero seven to eighty nine d r for a sixteen and even being
0:33:55	are the eleven and or something a systems
0:33:59	and a sixteen at the eighteen nineteen i don't the reference
0:34:04	systems using the same algorithms
0:34:07	s
0:34:07	at zero four and at zero six
0:34:11	the l a verification is the lattice
0:34:14	most of our database for speech synthesis and was version is moving the results
0:34:23	this is this ensemble of problem a the weather
0:34:29	two
0:34:31	so
0:34:37	we did not complete with any of the local form
0:34:41	what if i
0:34:42	no
0:34:43	the a
0:34:47	we did not completely of any of the local phone
0:34:51	is you know there speaker one of i
0:34:55	employees are entitled to follow that contract to the latter
0:34:59	a data
0:35:02	employees are entitled followed by a contract so the latter
0:35:06	another speaker who finished
0:35:09	at that time it's telling faction like and five miles
0:35:13	a
0:35:15	i at time m is now and faction within five miles
0:35:20	as you can see that one of your the synthesis of a speech is quite
0:35:24	impressive
0:35:30	this is the size of a
0:35:33	a subset evaluations and session
0:35:36	results in terms of a it is for a little baseline we are provided
0:35:44	first of all shows the results for two categories of the us to the speech
0:35:51	yes we see
0:35:53	yes and v c you might
0:35:56	and i saw show results for types of models
0:36:01	there are neural network based
0:36:03	i one
0:36:05	a neural network based and where
0:36:08	yes
0:36:09	neural network based itsy a statistical model based p c
0:36:14	last rule
0:36:16	shows the results from different with for generation that the
0:36:22	in that are
0:36:24	their own where for model classical speech moreover
0:36:28	with four combinations
0:36:29	spectral filtering with typically and orders
0:36:34	in the testing is the complementary you of your over the baseline
0:36:39	otherwise dishonest users you features and the idiot there is a someone else
0:36:45	sdc features
0:36:50	it doesn't say challenge data was created from the rio your presentation visual quality of
0:36:56	the score was somewhat cold or
0:37:00	leading to improve upon the last challenge it doesn't line in addition to this once
0:37:05	you weighted and all
0:37:07	acoustic and global calibration
0:37:10	once we use these two similarly enrollment listings and devices we establish right
0:37:19	the remainder of this work are similarly directly on that
0:37:24	we choose a the one sure on the slide
0:37:28	realistic environment winkler only holding the noise putting aside for now the additive noise
0:37:35	we really a decision we consider perfect microphones
0:37:39	and
0:37:41	only at the recording this meeting about a five user
0:37:47	and for variability representation
0:37:51	we can see the that there are
0:37:53	it's carry out that the single session as that used a
0:37:57	and will only of the device quite in this case the last speaker
0:38:07	the physical access scenario assumes use in it is the leading to convey such as
0:38:13	illustrated in fig
0:38:16	there was a single iteration which please this is then this it will it is
0:38:20	also s
0:38:22	is this the data will environment distinction room size or categorize in two different
0:38:28	in the remote's label
0:38:30	i will rule
0:38:32	we may be able
0:38:33	and see that actual
0:38:36	the position of the aec easily see that by the yellow cross
0:38:41	circle in the three or whatever position of the to go is illustrated by the
0:38:46	blue star
0:38:48	well i assess it is harder
0:38:51	maybe by the okay well we'll see change a distance yes for the microphone
0:39:00	it is also illustrated in the table environment definition there are three categories or at
0:39:06	least and
0:39:07	and unlabeled a short distance be making this that and see that at least
0:39:15	each physical space system to explain that in addition variability are according to the difference
0:39:20	between space
0:39:22	which can be seen as a wall ceiling and the for submission coefficients
0:39:28	as well as the position interval
0:39:31	the level overrated variation used busy fighting the or the is sixty two variation by
0:39:37	the by are
0:39:40	it's fifty whatever item of definition
0:39:42	they are the result is six is the u
0:39:46	a little i shall we menu and
0:39:49	see i recognition
0:39:52	it is this is the microphone and that okay or writing reading the visual speech
0:39:58	there was a shown are so well
0:40:02	we think that although there is an environment as
0:40:06	you can see that symbol on the right
0:40:12	the man and language for the that's a month it is also illustrated in this
0:40:17	paper
0:40:19	but something that is modeled by making and then recording over one of five as
0:40:25	this
0:40:26	and but are sending their according to be is the microphone
0:40:31	according are assumed to be made in one over the three zones used to people
0:40:38	each representing a different vowel the oldest the problem or
0:40:45	in the state in table are a definition if they are labeled character i shows
0:40:52	this task of the medium distance and
0:40:56	largest
0:40:58	in addition to the variation lately we release let us define the means for recording
0:41:04	and presentation devices
0:41:08	we can see that only the presentation
0:41:11	no speaker
0:41:12	encoding only and better living in the last speaker if there are four selected
0:41:19	we use the categorisation
0:41:21	and without any
0:41:24	but if there
0:41:25	that would be
0:41:26	i and it
0:41:27	currency one
0:41:30	this case we or they have online replaying configuration as you can see and the
0:41:36	table
0:41:37	on the right
0:41:40	the simulation once either two containers all the speakers
0:41:44	each with a different range of the whole by about we mean frequency and maybe
0:41:49	a linear calibration
0:41:52	the first
0:41:53	a typical vector category represent the mean dillydallying in full band lot speaker
0:42:00	i one last speaker and a megabyte bound we the icsi and units
0:42:07	and the being able to more linear or racial a study
0:42:12	and one hundred
0:42:14	addition
0:42:15	and if you're you can see an illustration of set of the higher money frequency
0:42:21	responses
0:42:23	for i don't be noise model
0:42:25	the little device estimated using desynchronized we design a linear system identification
0:42:33	based on a linear convolution
0:42:36	each one in the finger is the a linear component
0:42:40	while from age to if i
0:42:43	i the higher wouldn't nonlinear components
0:42:48	the blue where the shaded region represent the right boundary
0:42:57	is it is still real devices from which measurement where the again for simulation or
0:43:03	a clear presentation
0:43:05	the first table on the left indicates a multi device is why on the right
0:43:10	in the case of interest
0:43:13	device that will signifies which type of the magazines
0:43:17	what are some all but is a little speaker
0:43:21	right most column in the case
0:43:24	if the device were used for the simulation of dance in the training and development
0:43:30	sets were not devices
0:43:32	or evaluations and i don't devices
0:43:39	this figure shows again at least commission for the different laws speakers
0:43:45	device
0:43:46	used for this evaluation
0:43:49	the top plot shows a by means of the glottal sure the lower one of
0:43:54	the binary but we are the mean and frequency
0:43:58	the bottom plot the should ideally a linear calibration
0:44:02	in the range of the d
0:44:04	or by about
0:44:07	devices are sort the wheat the wideband
0:44:15	this figure shows baseline results for maybe a scenario of the is useful to two
0:44:21	thousand nineteen database
0:44:23	results are used to read and fourteen you important to be in configuration
0:44:27	one you acoustic environments
0:44:29	and for to monitor a standard on arrays here's something german equal error rate between
0:44:35	target and zero for impostor trials that is the blood spatter
0:44:40	and target and replaceable from the area they leave are
0:44:44	i mean don't wanna mixture on the stand-alone replace moving in terms of equal error
0:44:49	rate
0:44:51	for baseline a be one and b two
0:44:55	and the bottom panel there is a combine is the and cm results use created
0:45:00	in terms of the me
0:45:03	e it is yes
0:45:05	for this result we guess they the to the is anyone interview medium
0:45:10	as for the previous challenges expecting clear
0:45:14	and moreover the worst the screens are
0:45:18	two or swings high when the device scenes and a little darker to talk be
0:45:23	stuff
0:45:29	its own can now the challenge results this figure shows the profiles for the baseline
0:45:36	this system b
0:45:37	zero two
0:45:39	and the best the
0:45:41	performing primary system for the in the means you're fine
0:45:46	and the seen teams single system
0:45:49	it is also shown the second best performing the single system for a in the
0:45:56	for immorality
0:45:58	forty five
0:45:59	so the lowest equal error rate is zero point two
0:46:03	percent
0:46:05	that is a greater us out
0:46:08	however for this results it is clear that there is a substantial gaps between
0:46:14	primary and single system
0:46:17	a four
0:46:19	so this means that fusion is important
0:46:25	is line shows the one the mean the team this year and equal error rate
0:46:30	the results from one before you conditions
0:46:33	to the in the age scenario
0:46:36	the first screening feel boring the on the x-axis and then don't whether or not
0:46:41	the system are the nn based or three systems
0:46:45	while the second denotes whether or not the systems are instance systems
0:46:50	which combine more all
0:46:52	so systems
0:46:53	or single system
0:46:56	we cannot the for really there is a manager you all the n and beast
0:47:01	and the in symbol systems
0:47:04	in addition to is also clear that the new word error rate and mean this
0:47:07	are measurements that are not correlated
0:47:12	as you can see in these two are red and blue
0:47:19	in this like the it is shown all the results for the thirty nine hour
0:47:24	in the evaluation set for the top then brown many solutions
0:47:28	first of all we can see that the baseline is the equal error rate
0:47:33	that means no smoking
0:47:35	is two point five percent
0:47:37	when we need class i think moving at a the is this is then becomes
0:47:43	what inaudible
0:47:45	again if the individual tax someone else a degree is the performance
0:47:52	that are easy to detect
0:47:54	there reminding the against you
0:47:57	us some degree the easy performance
0:48:03	and i difficult to the data they want in the or ranch a physical
0:48:08	and one only one that is the a seventeen
0:48:12	as in this entire on the knees the but is very difficult to detect
0:48:16	that is the one in the utterance to scroll
0:48:22	so let's evolution no i the challenge results for but these figures show that provides
0:48:29	for the baseline system be zero one
0:48:33	the best performing primary system fourteen d u and the same teams of the systems
0:48:41	the lowest the equal error rate here used zero point four
0:48:45	the is indeed we results
0:48:50	was it to invade
0:48:52	here there is less a discrepancy between primary and single system
0:48:57	so fusion since that is not so we bought
0:49:03	this is my shoulder while the mean dcf decoder ring the results for one if
0:49:08	is shown that to the each and you
0:49:12	and anything point as before on the x-axis denote a unit based in the nn
0:49:18	three or and channel and
0:49:20	the known in some other systems
0:49:23	not of the to as for any
0:49:26	cole p the there is a manager he or and bees and the instance systems
0:49:37	it is like a this on or on the results for all the nine a
0:49:42	single evaluation set for the door then primary submission
0:49:47	and we can see that the baseline is the query
0:49:51	well seems keys needs solos moving is he going for example for stack
0:49:57	when we in class looking at a this is then
0:50:01	because
0:50:01	wouldn't it
0:50:03	so looking at these i
0:50:06	we can see that the performance is increases
0:50:10	where
0:50:12	the distance back to okay becomes greater
0:50:15	so there are very fancy one
0:50:18	and decreases when the quietly of the device we got better
0:50:22	so real routes suitable
0:50:29	it is nice on all of the silence now four or other than twenty seven
0:50:34	that environments and evaluation sets again for the a the parameter estimation
0:50:41	so looking at least and over individual environments we can see that the performance is
0:50:46	the graces where the room i recall
0:50:50	really
0:50:51	so the received go
0:50:53	in case is when they are very the given variational model because higher
0:50:58	c
0:50:58	and increase when the to go to easily distance becomes higher
0:51:04	getting
0:51:05	see what
0:51:09	so to summarise a system that doesn't like being focus on the
0:51:14	but eagerly and yes or voice conversion
0:51:20	a simple even if one would be evaluated
0:51:23	we have a show that to there is this is then the i wanna normal
0:51:29	to squatting task
0:51:32	we have defined and limiting the dcf was just moving on to measure performance on
0:51:38	a c d
0:51:39	so instead of a doing these the on the standard on one dimensional
0:51:45	we have seen a transition from features to classifiers so and unit order to into
0:51:53	and that
0:51:54	and one double the fused system with the biggest challenges
0:52:00	don't demand countermeasures are very
0:52:03	how to the speech sounds are
0:52:08	very natural
0:52:10	is the recognition accuracy very clear by detection again be proven to work this time
0:52:16	of by only and stage
0:52:20	generalization is in missing
0:52:22	much more as to be done
0:52:26	so i don't to the union a and for decision
0:52:30	the is this will two thousand
0:52:32	then t one
0:52:38	so but for finnish thing to do not i like to wish to some softer
0:52:42	each for speaker recognition grunting using from us at all
0:52:50	it appears to keep the from a is
0:52:54	and my results to identically to overcome my as well
0:52:58	currently silently to from the university
0:53:03	you can finally two databases for easy and the disposing
0:53:10	i thought winter the is additional database misleading
0:53:13	and nist and the are star burst in that the speaker recognition database
0:53:23	a right don't
0:53:24	and the text dependent speaker recognition database
0:53:29	we also the a e for it is simply a database from
0:53:36	and the speaker wire a new speech and boxers
0:53:45	so here you can find some of the for this thing
0:53:49	matlab implementation of training and the scope of this common conditions
0:53:54	this is used as you features
0:53:56	and the three these coding systems that an easy to a last challenge
0:54:07	you know website you can find the matlab client on implementation of the teens yes
0:54:15	and the in your with the regarding the is a you please easy the a
0:54:22	one website
0:54:23	we need you are cool
0:54:27	last time at least i like to shoot due to budget
0:54:31	where i'm the principal investigator region two d measurement recognition also not only speech
0:54:39	a disapproving and
0:54:40	closing phase information
0:54:43	classifiers and respect
0:54:45	thus nazis ultimately increase the number eighty three networks
0:54:49	and the domain instruments increment because representing volume i mean and he uttering networks
0:54:56	and the second respect
0:54:58	use a friend gentlemen project
0:55:01	and is completely means he or more secure and presenter's the remote embodiment person authentication
0:55:11	thank you for listening and see you the you at session

Anti-spoofing in automatic speaker recognition

Tutorials

Dr Massimiliano Todisco, Eurecom, France