0:00:13grinning thanks for tuning in for my second presentation in the session mining is only
0:00:19marginally and together with my colleagues listed here on this line although presenting an overview
0:00:24of the twenty nineteen nist
0:00:27audiovisual speaker recognition evaluation
0:00:29which was organised
0:00:31in the fall of twenty nineteen
0:00:35before i start my presentation of electing by you
0:00:38if you have an already all like to invite you to see my first presentation
0:00:42the session which was an overview of the twenty nineteen
0:00:47chris sre cts challenge
0:00:51in addition of blinding white each signal and participate in the twenty nist cts challenge
0:00:57which is currently going
0:01:02here is the outline of my presentation
0:01:05all start by describing the highlights all the twenty nineteen audiovisual sre
0:01:10then to find the task you may summary on the data sets and performance metric
0:01:15for this evaluation
0:01:17share some participation statistics followed by results and system performance analyses
0:01:23although i would like a quite a summary on the audio-visual sre nineteen and sharing
0:01:28the main observations
0:01:31baseline presents the main highlights
0:01:34well the twenty nineteen sre
0:01:36which included
0:01:37video data for audiovisual person recognition
0:01:41and open training condition
0:01:43as well and a redesign
0:01:45and more flexible evaluation web platform
0:01:49recently introduced highlight also included
0:01:52audio from value
0:01:54which means
0:01:55on you
0:01:56recordings that were extracted from
0:01:59big online but
0:02:03so the primary task for the twenty nineteen audiovisual sre was person detection meaning that
0:02:10given enrollment video data from the target person
0:02:14and test video data from an unknown person automatically determine whether the target person is
0:02:21present in the test menu
0:02:23this person detection problem
0:02:25can be posed as a two class hypothesis testing problem
0:02:31the null hypothesis is the test video s
0:02:35belongs to the target percent and alternative our hypothesis is the test video does not
0:02:41belong to the target person
0:02:44system that would for this task is then statistics computed on the test video known
0:02:50as the log-likelihood ratio defining the slide
0:02:54in terms of evaluation conditions
0:02:56the audio-visual sre nineteen offer an open training condition that allow the use of on
0:03:02limited data for system training to demonstrate possible performance gains
0:03:08for enrollment
0:03:09the systems where given video segments
0:03:12we would variables speech content ranging from ten seconds to six hundred seconds
0:03:18in addition
0:03:19the systems were provided with
0:03:21diarisation more as well as face bounding boxes
0:03:27for the face frames containing the target individual
0:03:31lastly the test involve area segments of variable durations in the ten to six hundred
0:03:38seconds grange
0:03:42the development and evaluation data for the only visual sre nineteen where extracted from the
0:03:47channels multimedia and the vast corpora
0:03:50the channels multimedia dataset was extracted from the entire janice benchmark r e
0:03:56and it consists of two subsets namely or and four
0:04:02each of which
0:04:03"'cause" with this on there and test splits
0:04:07we all for this evaluation we only use the course outside
0:04:10because it better reflects
0:04:13the data conditions in sre nineteen
0:04:16the vast corpus on the other hand was collected by the ldc and contains a
0:04:21mature online videos such as video belongs or belonged spoken in english the videos have
0:04:28extremely divers audio and visual conditions
0:04:31background environments a code different codecs different illuminations and hoses
0:04:38in addition
0:04:39third to be multiple individuals hearing in each video
0:04:45baseline shows speech
0:04:47duration histograms for the enrollment and test segments in the audio-visual sre nineteen data and
0:04:53test sets which are shown on the left and right plots respectively
0:04:59the enrollment segments speech durations were calculated after applying diarisation why no diarisation where applied
0:05:07to the test segments
0:05:11the enrollment and test histogram school adhere to follow
0:05:15log normal distributions and overall
0:05:19they are consistent across to them and test sets
0:05:24this table shows the data statistics
0:05:27four or subset of the channels multimedia data set as well as the audio-visual sre
0:05:34nineteen and test sets
0:05:36which were extracted from the vast corpus
0:05:39notice that over all the size of the channels data is larger than size of
0:05:45the sre nineteen
0:05:46audio visual data and test sets
0:05:50which makes it a good candidate for system training and development purposes
0:05:58for performance measurement we use the mimicry known as the detection cost or see that
0:06:04for sure
0:06:05which is and weighted average of false reject and false alarm probabilities
0:06:10with the weights defined in the table and baseline
0:06:14to improve the inter credibility of the see that it is commonly normalized by default
0:06:21define it slide
0:06:23this results in a simplified notation for to see that
0:06:27which is parameterized by detection cost
0:06:31and for this evaluation
0:06:33this detection threshold
0:06:35is the log all data and beta is also defined in this slide
0:06:43this slide presents the participation statistics for the sre nineteen audio visual evaluation
0:06:51overall we received submissions from fourteen team which were performed by twenty six sides
0:06:58eight which where
0:06:59from industry and the remaining eighteen where from i continue
0:07:05also shown this line
0:07:08is the shape of the work on three
0:07:13shows us where
0:07:14the participating teams where coming from
0:07:20this line shows the number of submissions true seen her tract and demonstrate tracks in
0:07:26total audio visual and audio visual cranks
0:07:30for the twenty nineteen audiovisual
0:07:32speaker recognition evaluation
0:07:35we can see that majority of the teams participated in all three tracks
0:07:41and one two teams only participated in the audio
0:07:45and audiovisual tracks and
0:07:47one team
0:07:48participated in the audio only tracked
0:07:51in total received one hundred and two submissions
0:07:55which were made by
0:07:57fourteen teams as english
0:08:04this line shows the block diagram of the baseline speaker recognition system developed for the
0:08:09audio visual history using the nist
0:08:13speaker and language recognition evaluation toolkit as well as called me
0:08:19the and then extractor was trained using call he walks alone
0:08:24version two recipe
0:08:26and to develop this is system we didn't use any hyper parameter tuning
0:08:32or score calibration
0:08:40this line shows
0:08:41a block diagram of the baseline face recognition system developed for the audio visual history
0:08:48and to develop this we used a the face then as well as the nist
0:08:52ancillary to toolkit
0:08:54we use the pre-training multicast convolutional your neural network model for face detection and for
0:09:03inventing extraction
0:09:06use the rest of the model that was trained on b g gee face to
0:09:11in order to tune the hyper parameters we use the janice multimedia data set
0:09:17and similar to what we had where the baseline speaker recognition system nor score calibration
0:09:23was used
0:09:24for the face recognition system
0:09:30this line shows the performance of the primary submissions
0:09:35parodying pair tract
0:09:37as well as the performance of the baseline system in terms of the actual and
0:09:42minimum costs
0:09:43on the test
0:09:46the blue bars and red bars show the minimum and actual cost respectively
0:09:53the y-axis do you know it's
0:09:56the c primarily and is a point the limit for the y-axis is limited to
0:10:01is the two point five to facilitate crossed system comparisons in the lower cost regions
0:10:09we can make something pornographer observations from this figure first compared to the most recent
0:10:15sre which was the sre eighteen
0:10:18at the time
0:10:19there seems to be in notable improvement in audio-only speaker recognition performance
0:10:26and these improvements are largely at you attributed to the use of extended and more
0:10:33complex and two and neural network architectures such as the rest the architectures along with
0:10:40soft marching loss functions such as the angular softmax
0:10:45for speaker and baiting extraction
0:10:48and given the size of these models
0:10:51they can effectively exploit the vast amounts of training data that is available through
0:10:59data augmentation
0:11:02the second observation is that performance trends for the top for teens are generally similar
0:11:09and we can see that the actual cost
0:11:12for the all you only submissions or larger than those for the visually submissions
0:11:18and the audiovisual fusion which means the combination of speaker and face recognition systems results
0:11:25in salt stantially gains in person recognition performance
0:11:29so for example we can see greater than eighty five percent relative improvement in terms
0:11:36of the minimum detection cost for the leading system compared to either of the speaker
0:11:43over face recognition systems along
0:11:46thirdly more than half of the submissions outperform the baseline audio visual system
0:11:53with the leading system achieving larger than ninety percent improvement over the baseline
0:12:01the fourth observation is that i in terms of calibration performance mixed results we can
0:12:08see makes results for some teens
0:12:11for example to talk to teens the calibration errors for speaker recognition systems or larger
0:12:17than those for the face recognition systems
0:12:19while for some others the opposite is true
0:12:23finally in terms of the minimum detection cost it to top performing speaker and face
0:12:29recognition systems achieve comparable results which is very promising a all this evaluation for the
0:12:36speaker recognition community
0:12:38given the results we have seen before in prior studies where face recognition systems were
0:12:45shown to outperform speaker recognition systems by and large margin
0:12:51it's also worth emphasizing here not the top performing speaker and face recognition system
0:12:59we each or from teen five
0:13:04they're both a single systems that means do you know a system combination or fusion
0:13:11a ford used to
0:13:18so no to gain further insight on actual performance differences among the top performing systems
0:13:25we also computed would stratagem based ninety five percent confidence interval a for these point
0:13:33estimates of the performance
0:13:36the progress on the slide show the performance confidence intervals around the actual detection cost
0:13:42for instance team for the audio switches on the call visual which is shown in
0:13:48the middle and false visual track such as shown at the bottom
0:13:52in general
0:13:54the audio systems extra between our confidence margin then their visual counterparts this could be
0:14:02partly because most of the parties and swore from the speaker recognition community
0:14:08using off-the-shelf face recognition systems along with pre-training law models which where not necessarily optimize
0:14:16for the task i and in the sre nineteen audio visual
0:14:24also unknown instead of this notice that several leading systems almost perform comparably under different
0:14:32sample aims of the trial space
0:14:35and another interesting observation is that the audio visuals fusion seems to boost a decision
0:14:42making confidence all the systems by significant margin two point where two leading systems
0:14:54performed the other systems
0:14:57statistically significantly
0:15:00these observations fair further highlight the importance of statistical significance tests wine reporting
0:15:07performance results or in the model selection stage during system development particularly when the number
0:15:14of a trials
0:15:16a relatively small
0:15:20this line shows a the performance carriers a bit that stands for detection error tradeoff
0:15:26that performance curves for a top performing system for the audio visual and audio visual
0:15:33the solid black cherry in the figure represent adequate cost contours and that means that
0:15:39all other points on a given contour correspond to the same on detection cost about
0:15:47so here we can see not consistent with our previous observations from the overall results
0:15:52on if you slide back
0:15:55you audiovisual fusion provide remarkable improvements in performance
0:15:59across all operating points not just a single operating point on adaptor which is expected
0:16:07given how complementary the two modalities audio and visual modalities or
0:16:13in addition for a wide range of operating points this speaker and face recognition systems
0:16:19provided comparable performance which is very problems promising for the speaker recognition community
0:16:25and shows how far the technology has come so far
0:16:29this slide shows a normalized target and non-target score distributions for
0:16:34a top performing system for all tracks and means audio visual and audio visual track
0:16:41then they recall dashed line
0:16:43represents the detection threshold which you relative related to the value of data which we
0:16:51discussed when we were talking about the performance measurement
0:16:55here we can see that this score distribution from the audio on the end face
0:17:00only systems
0:17:01there were they roughly aligned with a target and non-target distributions showing some overlap and
0:17:09that racial point
0:17:11however a their diffusion the audiovisual fusion the target and nontarget classes are
0:17:19well separated with minimal overlap
0:17:22a threshold by
0:17:25and we speculate on it and this is actually
0:17:30a the reason that
0:17:33we see such low errors
0:17:39specifically on low false rejects
0:17:42for systems that use audiovisual fusion
0:17:46so in summary we use the new and improved evaluation web platform for automated submission
0:17:52validation and scoring forty audiovisual is very nineteen to this web platform
0:17:58we release the software package for system now meditations scoring
0:18:02we also released the baseline person recognition system description and results
0:18:07in terms of data may take a for the first time we introduce video data
0:18:12for audiovisual person recognition
0:18:15rereleased large labeled data sets which are extracted from the janice multimedia data set as
0:18:21well as the bass corpus
0:18:23and these datasets probably matched evaluation set
0:18:26in terms of results
0:18:29is also actual things a from the audiovisual fusion
0:18:34we also so that a top performing speaker and recognition systems perform
0:18:39a comparably
0:18:41we saw major improvements that were attributed to the use of more
0:18:48extended then more complex neural network models such as the rest the model
0:18:52along with angular margin losses
0:18:56in addition to this the improvements were attributed to the ecstasy use of data augmentation
0:19:02and in a clustering of at estimating which was done primarily for diarization paris
0:19:13effective use of this test set as well as the choice of calibration set where
0:19:17also very working and they were key to performing well in this evaluation
0:19:22and finally although fusion still seems to
0:19:26playable we saw that strong single systems can be as good as fusion system
0:19:34and with that a like to include conclude this time
0:19:38i thank you very attention e well and stays