Speech Transcript - The 2019 NIST Audio-Visual Speaker Recognition Evaluation

0:00:13	grinning thanks for tuning in for my second presentation in the session mining is only
0:00:19	marginally and together with my colleagues listed here on this line although presenting an overview
0:00:24	of the twenty nineteen nist
0:00:27	audiovisual speaker recognition evaluation
0:00:29	which was organised
0:00:31	in the fall of twenty nineteen
0:00:35	before i start my presentation of electing by you
0:00:38	if you have an already all like to invite you to see my first presentation
0:00:42	the session which was an overview of the twenty nineteen
0:00:47	chris sre cts challenge
0:00:51	in addition of blinding white each signal and participate in the twenty nist cts challenge
0:00:57	which is currently going
0:01:02	so
0:01:02	here is the outline of my presentation
0:01:05	all start by describing the highlights all the twenty nineteen audiovisual sre
0:01:10	then to find the task you may summary on the data sets and performance metric
0:01:15	for this evaluation
0:01:17	share some participation statistics followed by results and system performance analyses
0:01:23	although i would like a quite a summary on the audio-visual sre nineteen and sharing
0:01:28	the main observations
0:01:31	baseline presents the main highlights
0:01:34	well the twenty nineteen sre
0:01:36	which included
0:01:37	video data for audiovisual person recognition
0:01:41	and open training condition
0:01:43	as well and a redesign
0:01:45	and more flexible evaluation web platform
0:01:49	recently introduced highlight also included
0:01:52	audio from value
0:01:54	which means
0:01:55	on you
0:01:56	recordings that were extracted from
0:01:59	big online but
0:02:03	so the primary task for the twenty nineteen audiovisual sre was person detection meaning that
0:02:10	given enrollment video data from the target person
0:02:14	and test video data from an unknown person automatically determine whether the target person is
0:02:21	present in the test menu
0:02:23	this person detection problem
0:02:25	can be posed as a two class hypothesis testing problem
0:02:30	where
0:02:31	the null hypothesis is the test video s
0:02:35	belongs to the target percent and alternative our hypothesis is the test video does not
0:02:41	belong to the target person
0:02:44	system that would for this task is then statistics computed on the test video known
0:02:50	as the log-likelihood ratio defining the slide
0:02:54	in terms of evaluation conditions
0:02:56	the audio-visual sre nineteen offer an open training condition that allow the use of on
0:03:02	limited data for system training to demonstrate possible performance gains
0:03:08	for enrollment
0:03:09	the systems where given video segments
0:03:12	we would variables speech content ranging from ten seconds to six hundred seconds
0:03:18	in addition
0:03:19	the systems were provided with
0:03:21	diarisation more as well as face bounding boxes
0:03:27	for the face frames containing the target individual
0:03:31	lastly the test involve area segments of variable durations in the ten to six hundred
0:03:38	seconds grange
0:03:42	the development and evaluation data for the only visual sre nineteen where extracted from the
0:03:47	channels multimedia and the vast corpora
0:03:50	the channels multimedia dataset was extracted from the entire janice benchmark r e
0:03:56	and it consists of two subsets namely or and four
0:04:02	each of which
0:04:03	"'cause" with this on there and test splits
0:04:07	we all for this evaluation we only use the course outside
0:04:10	because it better reflects
0:04:13	the data conditions in sre nineteen
0:04:16	the vast corpus on the other hand was collected by the ldc and contains a
0:04:21	mature online videos such as video belongs or belonged spoken in english the videos have
0:04:28	extremely divers audio and visual conditions
0:04:31	background environments a code different codecs different illuminations and hoses
0:04:38	in addition
0:04:39	third to be multiple individuals hearing in each video
0:04:45	baseline shows speech
0:04:47	duration histograms for the enrollment and test segments in the audio-visual sre nineteen data and
0:04:53	test sets which are shown on the left and right plots respectively
0:04:59	the enrollment segments speech durations were calculated after applying diarisation why no diarisation where applied
0:05:07	to the test segments
0:05:10	nevertheless
0:05:11	the enrollment and test histogram school adhere to follow
0:05:15	log normal distributions and overall
0:05:19	they are consistent across to them and test sets
0:05:24	this table shows the data statistics
0:05:27	four or subset of the channels multimedia data set as well as the audio-visual sre
0:05:34	nineteen and test sets
0:05:36	which were extracted from the vast corpus
0:05:39	notice that over all the size of the channels data is larger than size of
0:05:45	the sre nineteen
0:05:46	audio visual data and test sets
0:05:50	which makes it a good candidate for system training and development purposes
0:05:58	for performance measurement we use the mimicry known as the detection cost or see that
0:06:04	for sure
0:06:05	which is and weighted average of false reject and false alarm probabilities
0:06:10	with the weights defined in the table and baseline
0:06:14	to improve the inter credibility of the see that it is commonly normalized by default
0:06:19	cost
0:06:21	define it slide
0:06:23	this results in a simplified notation for to see that
0:06:27	which is parameterized by detection cost
0:06:31	and for this evaluation
0:06:33	this detection threshold
0:06:35	is the log all data and beta is also defined in this slide
0:06:43	this slide presents the participation statistics for the sre nineteen audio visual evaluation
0:06:51	overall we received submissions from fourteen team which were performed by twenty six sides
0:06:58	eight which where
0:06:59	from industry and the remaining eighteen where from i continue
0:07:05	also shown this line
0:07:08	is the shape of the work on three
0:07:11	which
0:07:13	shows us where
0:07:14	the participating teams where coming from
0:07:20	this line shows the number of submissions true seen her tract and demonstrate tracks in
0:07:26	total audio visual and audio visual cranks
0:07:30	for the twenty nineteen audiovisual
0:07:32	speaker recognition evaluation
0:07:35	we can see that majority of the teams participated in all three tracks
0:07:41	and one two teams only participated in the audio
0:07:45	and audiovisual tracks and
0:07:47	one team
0:07:48	participated in the audio only tracked
0:07:51	in total received one hundred and two submissions
0:07:55	which were made by
0:07:57	fourteen teams as english
0:08:04	this line shows the block diagram of the baseline speaker recognition system developed for the
0:08:09	audio visual history using the nist
0:08:13	speaker and language recognition evaluation toolkit as well as called me
0:08:19	the and then extractor was trained using call he walks alone
0:08:24	version two recipe
0:08:26	and to develop this is system we didn't use any hyper parameter tuning
0:08:32	or score calibration
0:08:40	this line shows
0:08:41	a block diagram of the baseline face recognition system developed for the audio visual history
0:08:48	and to develop this we used a the face then as well as the nist
0:08:52	ancillary to toolkit
0:08:54	we use the pre-training multicast convolutional your neural network model for face detection and for
0:09:03	inventing extraction
0:09:06	use the rest of the model that was trained on b g gee face to
0:09:10	dataset
0:09:11	in order to tune the hyper parameters we use the janice multimedia data set
0:09:17	and similar to what we had where the baseline speaker recognition system nor score calibration
0:09:23	was used
0:09:24	for the face recognition system
0:09:30	this line shows the performance of the primary submissions
0:09:35	parodying pair tract
0:09:37	as well as the performance of the baseline system in terms of the actual and
0:09:42	minimum costs
0:09:43	on the test
0:09:46	the blue bars and red bars show the minimum and actual cost respectively
0:09:53	the y-axis do you know it's
0:09:56	the c primarily and is a point the limit for the y-axis is limited to
0:10:01	is the two point five to facilitate crossed system comparisons in the lower cost regions
0:10:09	we can make something pornographer observations from this figure first compared to the most recent
0:10:15	sre which was the sre eighteen
0:10:18	at the time
0:10:19	there seems to be in notable improvement in audio-only speaker recognition performance
0:10:26	and these improvements are largely at you attributed to the use of extended and more
0:10:33	complex and two and neural network architectures such as the rest the architectures along with
0:10:40	soft marching loss functions such as the angular softmax
0:10:45	for speaker and baiting extraction
0:10:48	and given the size of these models
0:10:51	they can effectively exploit the vast amounts of training data that is available through
0:10:59	data augmentation
0:11:02	the second observation is that performance trends for the top for teens are generally similar
0:11:09	and we can see that the actual cost
0:11:12	for the all you only submissions or larger than those for the visually submissions
0:11:18	and the audiovisual fusion which means the combination of speaker and face recognition systems results
0:11:25	in salt stantially gains in person recognition performance
0:11:29	so for example we can see greater than eighty five percent relative improvement in terms
0:11:36	of the minimum detection cost for the leading system compared to either of the speaker
0:11:43	over face recognition systems along
0:11:46	thirdly more than half of the submissions outperform the baseline audio visual system
0:11:53	with the leading system achieving larger than ninety percent improvement over the baseline
0:12:01	the fourth observation is that i in terms of calibration performance mixed results we can
0:12:08	see makes results for some teens
0:12:11	for example to talk to teens the calibration errors for speaker recognition systems or larger
0:12:17	than those for the face recognition systems
0:12:19	while for some others the opposite is true
0:12:23	finally in terms of the minimum detection cost it to top performing speaker and face
0:12:29	recognition systems achieve comparable results which is very promising a all this evaluation for the
0:12:36	speaker recognition community
0:12:38	given the results we have seen before in prior studies where face recognition systems were
0:12:45	shown to outperform speaker recognition systems by and large margin
0:12:51	it's also worth emphasizing here not the top performing speaker and face recognition system
0:12:59	we each or from teen five
0:13:04	they're both a single systems that means do you know a system combination or fusion
0:13:11	a ford used to
0:13:14	systems
0:13:18	so no to gain further insight on actual performance differences among the top performing systems
0:13:25	we also computed would stratagem based ninety five percent confidence interval a for these point
0:13:33	estimates of the performance
0:13:36	the progress on the slide show the performance confidence intervals around the actual detection cost
0:13:42	for instance team for the audio switches on the call visual which is shown in
0:13:48	the middle and false visual track such as shown at the bottom
0:13:52	in general
0:13:54	the audio systems extra between our confidence margin then their visual counterparts this could be
0:14:02	partly because most of the parties and swore from the speaker recognition community
0:14:08	using off-the-shelf face recognition systems along with pre-training law models which where not necessarily optimize
0:14:16	for the task i and in the sre nineteen audio visual
0:14:23	evaluation
0:14:24	also unknown instead of this notice that several leading systems almost perform comparably under different
0:14:32	sample aims of the trial space
0:14:35	and another interesting observation is that the audio visuals fusion seems to boost a decision
0:14:42	making confidence all the systems by significant margin two point where two leading systems
0:14:54	performed the other systems
0:14:57	statistically significantly
0:15:00	these observations fair further highlight the importance of statistical significance tests wine reporting
0:15:07	performance results or in the model selection stage during system development particularly when the number
0:15:14	of a trials
0:15:16	a relatively small
0:15:20	this line shows a the performance carriers a bit that stands for detection error tradeoff
0:15:26	that performance curves for a top performing system for the audio visual and audio visual
0:15:32	tracks
0:15:33	the solid black cherry in the figure represent adequate cost contours and that means that
0:15:39	all other points on a given contour correspond to the same on detection cost about
0:15:47	so here we can see not consistent with our previous observations from the overall results
0:15:52	on if you slide back
0:15:55	you audiovisual fusion provide remarkable improvements in performance
0:15:59	across all operating points not just a single operating point on adaptor which is expected
0:16:07	given how complementary the two modalities audio and visual modalities or
0:16:13	in addition for a wide range of operating points this speaker and face recognition systems
0:16:19	provided comparable performance which is very problems promising for the speaker recognition community
0:16:25	and shows how far the technology has come so far
0:16:29	this slide shows a normalized target and non-target score distributions for
0:16:34	a top performing system for all tracks and means audio visual and audio visual track
0:16:41	then they recall dashed line
0:16:43	represents the detection threshold which you relative related to the value of data which we
0:16:51	discussed when we were talking about the performance measurement
0:16:55	here we can see that this score distribution from the audio on the end face
0:17:00	only systems
0:17:01	there were they roughly aligned with a target and non-target distributions showing some overlap and
0:17:09	that racial point
0:17:11	however a their diffusion the audiovisual fusion the target and nontarget classes are
0:17:19	well separated with minimal overlap
0:17:22	a threshold by
0:17:25	and we speculate on it and this is actually
0:17:30	a the reason that
0:17:33	we see such low errors
0:17:39	specifically on low false rejects
0:17:42	for systems that use audiovisual fusion
0:17:46	so in summary we use the new and improved evaluation web platform for automated submission
0:17:52	validation and scoring forty audiovisual is very nineteen to this web platform
0:17:58	we release the software package for system now meditations scoring
0:18:02	we also released the baseline person recognition system description and results
0:18:07	in terms of data may take a for the first time we introduce video data
0:18:12	for audiovisual person recognition
0:18:15	rereleased large labeled data sets which are extracted from the janice multimedia data set as
0:18:21	well as the bass corpus
0:18:23	and these datasets probably matched evaluation set
0:18:26	in terms of results
0:18:29	is also actual things a from the audiovisual fusion
0:18:34	we also so that a top performing speaker and recognition systems perform
0:18:39	a comparably
0:18:41	we saw major improvements that were attributed to the use of more
0:18:48	extended then more complex neural network models such as the rest the model
0:18:52	along with angular margin losses
0:18:56	in addition to this the improvements were attributed to the ecstasy use of data augmentation
0:19:02	and in a clustering of at estimating which was done primarily for diarization paris
0:19:13	effective use of this test set as well as the choice of calibration set where
0:19:17	also very working and they were key to performing well in this evaluation
0:19:22	and finally although fusion still seems to
0:19:26	playable we saw that strong single systems can be as good as fusion system
0:19:34	and with that a like to include conclude this time
0:19:38	i thank you very attention e well and stays

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Evaluation and Benchmarking

Omid Sadjadi, Craig Greenberg, Elliot Singer, Douglas Reynolds, Lisa Mason, Jaime Hernandez-Cordero