Speech Transcript - The VOiCES from a Distance Challenge 2019: Analysis of Speaker Verification Results and Remaining Challenges

0:00:17	i everyone finds not require an and all be presenting the work by myself and
0:00:23	my car was
0:00:24	on the voices from a distance challenge two thousand nineteen analysis of speaker verification results
0:00:30	and main challenges
0:00:34	when we look at evaluations and challenges in the community they tend to provide a
0:00:38	common data
0:00:39	benchmarks performance metrics for the advancement of research in speaker recognition community
0:00:44	some examples of might be for me without the nist sre series
0:00:49	speakers in the one challenge
0:00:50	voxel speaker recognition challenge
0:00:52	and s-dsc
0:00:54	previous evaluations focused on speaker verification in the mines considering telephone dialler microphone data
0:01:01	different speaking styles
0:01:03	noisy data vocal effort audio from video short duration and better and while
0:01:08	however they haven't been that many that focus on
0:01:11	there even inside in the far-field distant speaker domain
0:01:16	now
0:01:17	nowadays we've got commercial personal assistants a at a really
0:01:22	outstanding in this area so trying to get a bit more of an understanding in
0:01:25	this context is important especially when we will you know single microphone
0:01:29	scenario
0:01:31	and the voices from a distance challenge in two thousand nineteen was hosted by sri
0:01:35	international and lap forty one
0:01:38	i in to stage two thousand and nineteen
0:01:41	and what this channel focused on was both speaker recognition and speech recognition
0:01:45	using the distant farfield speech acquired using a single microphone
0:01:49	in noisy and realistic reverberant environments
0:01:52	there are several objectives that we had for this challenge
0:01:55	one was the benchmark in a state-of-the-art technology for farfield speech
0:02:00	we want to support the development of new ideas in technology to bring that technology
0:02:05	for
0:02:06	we wanted to support new research groups entering the field of distant speech processing
0:02:11	and that was larceny and i would be publicly available dataset
0:02:15	i think that this is realistic of reverberation characteristics
0:02:21	what we noticed since the release of the public database in two thousand nineteen of
0:02:25	those an increase use the voices dataset
0:02:28	so we thought this actually called for will be current special session that we're hosting
0:02:32	here in odyssey two thousand forty even not virtual
0:02:36	now the session we're hoping will focus on broad areas such a single buses multi
0:02:41	channel speaker recognition
0:02:43	single versus multichannel speech enhancement for speaker recognition
0:02:47	domain adaptation for farfield speaker recognition
0:02:51	calibration in five two conditions
0:02:53	and advancing the standard
0:02:55	over what we saw in the voices from a distance challenge two thousand nine ten
0:03:01	let's have a look at what the voices corpus actually had in
0:03:05	so voices stands for voice is obscured in complex environment setting
0:03:09	and it is alaska now publicly available corpus collecting in collected in the real reverberant
0:03:15	environments
0:03:16	well we have inside the dataset is
0:03:19	three thousand nine hundred or more hours of audio
0:03:22	from about a million segment
0:03:24	multiple rooms for internal
0:03:26	different distracters that just t v in babble noise
0:03:29	and different microphones different distances
0:03:32	we even have a male speaker the right tails to mimic human head movement
0:03:37	the idea for this dataset was that would be useful for speaker recognition
0:03:41	automatic speech recognition
0:03:43	speech enhancement
0:03:45	and speech activity detection
0:03:49	here a couple different statistics from the voices dataset
0:03:52	it is released under the creative commons full license and that makes it accessible commercial
0:03:57	academic and government used
0:04:00	when a large number of speakers three hundred over for different rooms
0:04:04	up to twenty different microphones and different microphone types
0:04:09	these source data so that we used it was a read speech data set accordingly
0:04:13	per state
0:04:15	and we've got number of different background noises including babble
0:04:18	music
0:04:18	and tv sounds
0:04:20	of the loudspeaker when it orientates for re mimicking human head movement
0:04:25	a ranges between zero two hundred ninety degrees
0:04:30	but still will read half of what we sort in the challenge of two thousand
0:04:33	nine ten
0:04:35	we have two different a speaker recognition asr
0:04:39	and they had to different task conditions one was a fixed condition
0:04:42	and the idea here was the data was constrained
0:04:45	everyone got to use the sign constraint dataset
0:04:48	the purpose behind this was to benchmarking assistance trained with that same data set to
0:04:53	see if there's a dramatic difference between interictal technologies for what was commonly applied
0:04:58	in the open condition
0:05:00	it were left use any available dataset private or public
0:05:04	now the idea here was to quantify those guys it could be achieved when we
0:05:07	have and constraints amount of data
0:05:09	relative to the fixed condition
0:05:14	in terms of the goal here
0:05:15	well looking at
0:05:16	can we determine whether i target speaker space
0:05:20	in a segment of speech and that's true enrollment of that target speaker
0:05:25	but performance metric is too much the nist sre
0:05:28	cost functions
0:05:29	when the parameters on screen
0:05:32	as far as the challenge we also provided a score of so uses can measure
0:05:36	performance
0:05:37	during development confirm the validity of discourse before they submitting them to us for evaluation
0:05:46	and the training set in fixed condition was my to all speakers in the while
0:05:51	but a collection
0:05:52	and voxel n one and fox lead to datasets
0:05:57	in terms of development and evaluation died of the challenge participants for lead to develop
0:06:02	on the development data
0:06:04	and then it was held out evaluation data that i and schmack the systems on
0:06:08	another couple of different things to point out here about how we divided these conditions
0:06:12	here
0:06:14	we make sure that we actually had some room mismatch between enrollment and test
0:06:18	as well as rooms use between development and evaluation
0:06:22	and this is to help me to mimic mitigate
0:06:25	sorry mimic what would happen with a system developed in all of our tree true
0:06:29	level data
0:06:30	and then sent out for the real world use
0:06:34	similarly we had mismatch between enrollment and test all the microphone type
0:06:39	comparing the studio two lapel
0:06:41	or to the l members and their own remote
0:06:45	we also had mismatch between the enrollment and verification for the microphone used
0:06:50	between those two different tasks
0:06:54	finally the last speaker orientation
0:06:56	we have quite a range then we list of those ranges so that we lie
0:07:00	to analyze the impact of head movement on speaker recognition
0:07:05	in terms the results we had twenty one change successfully submitted scores
0:07:09	and for voiced aims also submitted the scores for the open submission so we can
0:07:14	get that comparison point
0:07:17	entire we have fifty i system submissions if a fixed knife right
0:07:21	however on the side here was shown that all scores for each
0:07:24	a t
0:07:26	i will begin to these a little bit on the next slide
0:07:29	let's start analysing some others results
0:07:33	the first thing we did was we would that the confidence intervals the ninety five
0:07:36	percent confidence intervals
0:07:38	and we did this by using a modified version of a joint bootstrapping technique
0:07:42	reference can be found in i
0:07:45	now the reason we modified this was to account for the correlation of trials to
0:07:49	more to multiple models being available per speaker
0:07:54	that is different recording from a speaker could have represent a different in rowley
0:07:59	and so this correlation that happens
0:08:01	in the trial scores
0:08:03	what we're calling here the in people's between the five ninety five percent files
0:08:07	of the resulting empirical distribution
0:08:10	now if we look at those top for scores on them in a little
0:08:13	we can see that the confidence intervals on our
0:08:16	when you don't take into account the speaker sampling all the multiple models per state
0:08:20	so can easily as if we don't take that into account
0:08:24	what we should be looking other red buffy
0:08:27	that gives us a more true impact of what the confidence intervals are
0:08:33	and from look at those for systems with respect to the other submissions
0:08:37	we see that the significantly different compared to the rest of this submission
0:08:41	however they also perform relatively similar
0:08:47	somebody observations we found when looking at about what a different group submitted
0:08:52	wasn't the top teams applied weighted prediction error for dereverberation i remember the voices corpus
0:08:59	has a lottery of the rooms a quite and noisy
0:09:02	and that was the and the person really step
0:09:06	every team also use an extract the system with that are augmentation
0:09:10	and this is sometimes complimented we present it image net and that's net based architectures
0:09:16	but i was the most popular choice in the back and
0:09:19	and system calibration was actually for is crucial here
0:09:23	with all the bottom sixteens final to achieve good system calibration
0:09:27	and what that means is there was a significant difference between the minimum
0:09:30	and actual dcf values for which the system should have been shameful
0:09:36	a cycle now what happens when you change of the enrollment condition
0:09:41	in particular we looking at what happens in reverberant environment should be used source data
0:09:46	that is now reverberation close talking microphone
0:09:49	well use data from a different room
0:09:52	with reverberation
0:09:54	to enrol
0:09:56	but we actually stole one can i of the blue results of the balloon buffy
0:10:00	a resource enrollment against testing with room for data whereas the red us enrolling on
0:10:08	reverberant room three data
0:10:10	against the same test larry
0:10:12	we see the red bows a higher than if i
0:10:15	this reverberation enrollment
0:10:18	cost than on to forty two percent relative source enrol a degradation
0:10:23	and that depends on the system is being benchmark they're of course
0:10:26	but it does suggest that speaker should be enrolled using close-talking segment
0:10:31	a clean this stage
0:10:33	basically when you have this in our all different reverberation between enrollment and test
0:10:38	reverberation doesn't have a role
0:10:41	when enrolling on it
0:10:45	but several different background distracted
0:10:48	we call them distracters because then to start the system from that fruit speech the
0:10:52	speaker
0:10:53	we had t v in the background
0:10:55	or babble noise in the background
0:10:57	when enrolling we enrolled clean speech no destruction
0:11:01	but from verification we had three different types no destruction
0:11:04	t vs the tv noise which sometimes include stage
0:11:08	and babble noise
0:11:11	and what we found that the systems that was submitted would reasonably robust to the
0:11:14	effect of t v noise in the background
0:11:16	however with babble
0:11:18	including the speech environment for the true speaker
0:11:21	resort in we have forty five to fifty percent relative degradation so it's quite a
0:11:26	significant drop the
0:11:30	okay now microphone time
0:11:33	we had i studio mic place close to the source for enrollment
0:11:36	and then treated from my class lapel men's and down tree but verification at different
0:11:41	positions
0:11:42	and look at different distances i in the next slide
0:11:45	here we just one to look at how you different microphones
0:11:49	the quite
0:11:51	consistently across systems
0:11:53	have a step down going from boundary commenced a lapel microphone
0:12:00	from looking at different distances we just looking at a top five systems here
0:12:04	to constrain results to look at
0:12:07	with the lapel mikes placed at seven distances for the top five times
0:12:12	note you to be self non-overlapping masking effects them or just of my standard approach
0:12:17	to parse a greater challenge
0:12:20	what was interesting of the bus the really stand out the
0:12:23	read
0:12:25	kill and blue
0:12:27	tended to be partially obscured
0:12:28	so some of them are actually hidden
0:12:30	all very far from the
0:12:33	speaker
0:12:34	so the standard really draw performance as well
0:12:39	this also tends to explain the poor performance of the lapel mikes in general
0:12:43	embedded and remains a sore on the previous slide
0:12:48	and it was a summary now we're looking at the remaining challenges
0:12:51	based on organs a so far from voices publications and system submissions
0:12:58	but range in the ratio are characteristic
0:13:01	was to the three times worse than evaluation set and also the development set
0:13:06	now this was quite i
0:13:09	great the level of reverberation evaluation room
0:13:12	embedded development
0:13:13	and i was quite clear we found that this
0:13:17	sue severe amount of reverberation the country that the degree to degrade results compared to
0:13:22	development
0:13:24	current speaker recognition technology doesn't tend to address
0:13:27	the impact of reverberation sufficiently
0:13:31	the error rates a lot harder for reverberation condition then the source signal
0:13:35	the reverberation in the presence of noise for the degrades the performance
0:13:39	and the
0:13:40	increasing distance
0:13:42	provides a big impact of reverberation and degraded performance
0:13:46	so we need to explore novel speaker modeling techniques in a context is capable of
0:13:50	handling long time information
0:13:53	utterances the alien light reverberation the can happen in this nice
0:13:57	and try and make a robust to multiple noise conditions
0:14:02	system calibration is seven was critical for systems deployed in the real world
0:14:07	the bottom sixteen style to successfully calibrated system
0:14:10	and the previous work to shine that there is actually allows degradation calibration performance when
0:14:15	the distance the microphone
0:14:17	is significantly different between the calibration training conditions and one attacks to the court
0:14:23	so one way that we might be out to mitigate justify effect
0:14:27	is to have calibration methods that dynamically consider conditions of the trial
0:14:32	the predicted distance for instance
0:14:36	and that of the challenge is based on single-channel market find voices that was actually
0:14:40	collected with microphone well my
0:14:42	more microphones in the room
0:14:44	and we haven't looked into the effective for instance being for me
0:14:49	and there are a number of the front end processing
0:14:52	that would like to look at
0:14:53	including speech enhancement
0:14:55	dereverberation a little bit so that typically for the task of speaker recognition
0:15:01	so we hope you enjoy this special session at odyssey this year and that you
0:15:06	continue to drive technology forward in these areas
0:15:09	and we look forward to seeing what comes out of it
0:15:12	thank you

The VOiCES from a Distance Challenge 2019: Analysis of Speaker Verification Results and Remaining Challenges

Special Session: VOiCES 2020

Mahesh Kumar Nandwana, Michael Lomnitz, Colleen Richey, Mitchell McLaren, Diego Castan, Luciana Ferrer, Aaron Lawson