Speech Transcript - The 2019 NIST Audio-Visual Speaker Recognition Evaluation

grinning thanks for tuning in for my second presentation in the session mining is only

marginally and together with my colleagues listed here on this line although presenting an overview

of the twenty nineteen nist

audiovisual speaker recognition evaluation

which was organised

in the fall of twenty nineteen

before i start my presentation of electing by you

if you have an already all like to invite you to see my first presentation

the session which was an overview of the twenty nineteen

chris sre cts challenge

in addition of blinding white each signal and participate in the twenty nist cts challenge

which is currently going

here is the outline of my presentation

all start by describing the highlights all the twenty nineteen audiovisual sre

then to find the task you may summary on the data sets and performance metric

for this evaluation

share some participation statistics followed by results and system performance analyses

although i would like a quite a summary on the audio-visual sre nineteen and sharing

the main observations

baseline presents the main highlights

well the twenty nineteen sre

which included

video data for audiovisual person recognition

and open training condition

as well and a redesign

and more flexible evaluation web platform

recently introduced highlight also included

audio from value

which means

on you

recordings that were extracted from

big online but

so the primary task for the twenty nineteen audiovisual sre was person detection meaning that

given enrollment video data from the target person

and test video data from an unknown person automatically determine whether the target person is

present in the test menu

this person detection problem

can be posed as a two class hypothesis testing problem

where

the null hypothesis is the test video s

belongs to the target percent and alternative our hypothesis is the test video does not

belong to the target person

system that would for this task is then statistics computed on the test video known

as the log-likelihood ratio defining the slide

in terms of evaluation conditions

the audio-visual sre nineteen offer an open training condition that allow the use of on

limited data for system training to demonstrate possible performance gains

for enrollment

the systems where given video segments

we would variables speech content ranging from ten seconds to six hundred seconds

in addition

the systems were provided with

diarisation more as well as face bounding boxes

for the face frames containing the target individual

lastly the test involve area segments of variable durations in the ten to six hundred

seconds grange

the development and evaluation data for the only visual sre nineteen where extracted from the

channels multimedia and the vast corpora

the channels multimedia dataset was extracted from the entire janice benchmark r e

and it consists of two subsets namely or and four

each of which

"'cause" with this on there and test splits

we all for this evaluation we only use the course outside

because it better reflects

the data conditions in sre nineteen

the vast corpus on the other hand was collected by the ldc and contains a

mature online videos such as video belongs or belonged spoken in english the videos have

extremely divers audio and visual conditions

background environments a code different codecs different illuminations and hoses

in addition

third to be multiple individuals hearing in each video

baseline shows speech

duration histograms for the enrollment and test segments in the audio-visual sre nineteen data and

test sets which are shown on the left and right plots respectively

the enrollment segments speech durations were calculated after applying diarisation why no diarisation where applied

to the test segments

nevertheless

the enrollment and test histogram school adhere to follow

log normal distributions and overall

they are consistent across to them and test sets

this table shows the data statistics

four or subset of the channels multimedia data set as well as the audio-visual sre

nineteen and test sets

which were extracted from the vast corpus

notice that over all the size of the channels data is larger than size of

the sre nineteen

audio visual data and test sets

which makes it a good candidate for system training and development purposes

for performance measurement we use the mimicry known as the detection cost or see that

for sure

which is and weighted average of false reject and false alarm probabilities

with the weights defined in the table and baseline

to improve the inter credibility of the see that it is commonly normalized by default

cost

define it slide

this results in a simplified notation for to see that

which is parameterized by detection cost

and for this evaluation

this detection threshold

is the log all data and beta is also defined in this slide

this slide presents the participation statistics for the sre nineteen audio visual evaluation

overall we received submissions from fourteen team which were performed by twenty six sides

eight which where

from industry and the remaining eighteen where from i continue

also shown this line

is the shape of the work on three

which

shows us where

the participating teams where coming from

this line shows the number of submissions true seen her tract and demonstrate tracks in

total audio visual and audio visual cranks

for the twenty nineteen audiovisual

speaker recognition evaluation

we can see that majority of the teams participated in all three tracks

and one two teams only participated in the audio

and audiovisual tracks and

one team

participated in the audio only tracked

in total received one hundred and two submissions

which were made by

fourteen teams as english

this line shows the block diagram of the baseline speaker recognition system developed for the

audio visual history using the nist

speaker and language recognition evaluation toolkit as well as called me

the and then extractor was trained using call he walks alone

version two recipe

and to develop this is system we didn't use any hyper parameter tuning

or score calibration

this line shows

a block diagram of the baseline face recognition system developed for the audio visual history

and to develop this we used a the face then as well as the nist

ancillary to toolkit

we use the pre-training multicast convolutional your neural network model for face detection and for

inventing extraction

use the rest of the model that was trained on b g gee face to

dataset

in order to tune the hyper parameters we use the janice multimedia data set

and similar to what we had where the baseline speaker recognition system nor score calibration

was used

for the face recognition system

this line shows the performance of the primary submissions

parodying pair tract

as well as the performance of the baseline system in terms of the actual and

minimum costs

on the test

the blue bars and red bars show the minimum and actual cost respectively

the y-axis do you know it's

the c primarily and is a point the limit for the y-axis is limited to

is the two point five to facilitate crossed system comparisons in the lower cost regions

we can make something pornographer observations from this figure first compared to the most recent

sre which was the sre eighteen

at the time

there seems to be in notable improvement in audio-only speaker recognition performance

and these improvements are largely at you attributed to the use of extended and more

complex and two and neural network architectures such as the rest the architectures along with

soft marching loss functions such as the angular softmax

for speaker and baiting extraction

and given the size of these models

they can effectively exploit the vast amounts of training data that is available through

data augmentation

the second observation is that performance trends for the top for teens are generally similar

and we can see that the actual cost

for the all you only submissions or larger than those for the visually submissions

and the audiovisual fusion which means the combination of speaker and face recognition systems results

in salt stantially gains in person recognition performance

so for example we can see greater than eighty five percent relative improvement in terms

of the minimum detection cost for the leading system compared to either of the speaker

over face recognition systems along

thirdly more than half of the submissions outperform the baseline audio visual system

with the leading system achieving larger than ninety percent improvement over the baseline

the fourth observation is that i in terms of calibration performance mixed results we can

see makes results for some teens

for example to talk to teens the calibration errors for speaker recognition systems or larger

than those for the face recognition systems

while for some others the opposite is true

finally in terms of the minimum detection cost it to top performing speaker and face

recognition systems achieve comparable results which is very promising a all this evaluation for the

speaker recognition community

given the results we have seen before in prior studies where face recognition systems were

shown to outperform speaker recognition systems by and large margin

it's also worth emphasizing here not the top performing speaker and face recognition system

we each or from teen five

they're both a single systems that means do you know a system combination or fusion

a ford used to

systems

so no to gain further insight on actual performance differences among the top performing systems

we also computed would stratagem based ninety five percent confidence interval a for these point

estimates of the performance

the progress on the slide show the performance confidence intervals around the actual detection cost

for instance team for the audio switches on the call visual which is shown in

the middle and false visual track such as shown at the bottom

in general

the audio systems extra between our confidence margin then their visual counterparts this could be

partly because most of the parties and swore from the speaker recognition community

using off-the-shelf face recognition systems along with pre-training law models which where not necessarily optimize

for the task i and in the sre nineteen audio visual

evaluation

also unknown instead of this notice that several leading systems almost perform comparably under different

sample aims of the trial space

and another interesting observation is that the audio visuals fusion seems to boost a decision

making confidence all the systems by significant margin two point where two leading systems

performed the other systems

statistically significantly

these observations fair further highlight the importance of statistical significance tests wine reporting

performance results or in the model selection stage during system development particularly when the number

of a trials

a relatively small

this line shows a the performance carriers a bit that stands for detection error tradeoff

that performance curves for a top performing system for the audio visual and audio visual

tracks

the solid black cherry in the figure represent adequate cost contours and that means that

all other points on a given contour correspond to the same on detection cost about

so here we can see not consistent with our previous observations from the overall results

on if you slide back

you audiovisual fusion provide remarkable improvements in performance

across all operating points not just a single operating point on adaptor which is expected

given how complementary the two modalities audio and visual modalities or

in addition for a wide range of operating points this speaker and face recognition systems

provided comparable performance which is very problems promising for the speaker recognition community

and shows how far the technology has come so far

this slide shows a normalized target and non-target score distributions for

a top performing system for all tracks and means audio visual and audio visual track

then they recall dashed line

represents the detection threshold which you relative related to the value of data which we

discussed when we were talking about the performance measurement

here we can see that this score distribution from the audio on the end face

only systems

there were they roughly aligned with a target and non-target distributions showing some overlap and

that racial point

however a their diffusion the audiovisual fusion the target and nontarget classes are

well separated with minimal overlap

a threshold by

and we speculate on it and this is actually

a the reason that

we see such low errors

specifically on low false rejects

for systems that use audiovisual fusion

so in summary we use the new and improved evaluation web platform for automated submission

validation and scoring forty audiovisual is very nineteen to this web platform

we release the software package for system now meditations scoring

we also released the baseline person recognition system description and results

in terms of data may take a for the first time we introduce video data

for audiovisual person recognition

rereleased large labeled data sets which are extracted from the janice multimedia data set as

well as the bass corpus

and these datasets probably matched evaluation set

in terms of results

is also actual things a from the audiovisual fusion

we also so that a top performing speaker and recognition systems perform

a comparably

we saw major improvements that were attributed to the use of more

extended then more complex neural network models such as the rest the model

along with angular margin losses

in addition to this the improvements were attributed to the ecstasy use of data augmentation

and in a clustering of at estimating which was done primarily for diarization paris

effective use of this test set as well as the choice of calibration set where

also very working and they were key to performing well in this evaluation

and finally although fusion still seems to

playable we saw that strong single systems can be as good as fusion system

and with that a like to include conclude this time

i thank you very attention e well and stays

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Evaluation and Benchmarking

Omid Sadjadi, Craig Greenberg, Elliot Singer, Douglas Reynolds, Lisa Mason, Jaime Hernandez-Cordero