i everyone finds not require an and all be presenting the work by myself and

my car was

on the voices from a distance challenge two thousand nineteen analysis of speaker verification results

and main challenges

when we look at evaluations and challenges in the community they tend to provide a

common data

benchmarks performance metrics for the advancement of research in speaker recognition community

some examples of might be for me without the nist sre series

speakers in the one challenge

voxel speaker recognition challenge

and s-dsc

previous evaluations focused on speaker verification in the mines considering telephone dialler microphone data

different speaking styles

noisy data vocal effort audio from video short duration and better and while

however they haven't been that many that focus on

there even inside in the far-field distant speaker domain


nowadays we've got commercial personal assistants a at a really

outstanding in this area so trying to get a bit more of an understanding in

this context is important especially when we will you know single microphone


and the voices from a distance challenge in two thousand nineteen was hosted by sri

international and lap forty one

i in to stage two thousand and nineteen

and what this channel focused on was both speaker recognition and speech recognition

using the distant farfield speech acquired using a single microphone

in noisy and realistic reverberant environments

there are several objectives that we had for this challenge

one was the benchmark in a state-of-the-art technology for farfield speech

we want to support the development of new ideas in technology to bring that technology


we wanted to support new research groups entering the field of distant speech processing

and that was larceny and i would be publicly available dataset

i think that this is realistic of reverberation characteristics

what we noticed since the release of the public database in two thousand nineteen of

those an increase use the voices dataset

so we thought this actually called for will be current special session that we're hosting

here in odyssey two thousand forty even not virtual

now the session we're hoping will focus on broad areas such a single buses multi

channel speaker recognition

single versus multichannel speech enhancement for speaker recognition

domain adaptation for farfield speaker recognition

calibration in five two conditions

and advancing the standard

over what we saw in the voices from a distance challenge two thousand nine ten

let's have a look at what the voices corpus actually had in

so voices stands for voice is obscured in complex environment setting

and it is alaska now publicly available corpus collecting in collected in the real reverberant


well we have inside the dataset is

three thousand nine hundred or more hours of audio

from about a million segment

multiple rooms for internal

different distracters that just t v in babble noise

and different microphones different distances

we even have a male speaker the right tails to mimic human head movement

the idea for this dataset was that would be useful for speaker recognition

automatic speech recognition

speech enhancement

and speech activity detection

here a couple different statistics from the voices dataset

it is released under the creative commons full license and that makes it accessible commercial

academic and government used

when a large number of speakers three hundred over for different rooms

up to twenty different microphones and different microphone types

these source data so that we used it was a read speech data set accordingly

per state

and we've got number of different background noises including babble


and tv sounds

of the loudspeaker when it orientates for re mimicking human head movement

a ranges between zero two hundred ninety degrees

but still will read half of what we sort in the challenge of two thousand

nine ten

we have two different a speaker recognition asr

and they had to different task conditions one was a fixed condition

and the idea here was the data was constrained

everyone got to use the sign constraint dataset

the purpose behind this was to benchmarking assistance trained with that same data set to

see if there's a dramatic difference between interictal technologies for what was commonly applied

in the open condition

it were left use any available dataset private or public

now the idea here was to quantify those guys it could be achieved when we

have and constraints amount of data

relative to the fixed condition

in terms of the goal here

well looking at

can we determine whether i target speaker space

in a segment of speech and that's true enrollment of that target speaker

but performance metric is too much the nist sre

cost functions

when the parameters on screen

as far as the challenge we also provided a score of so uses can measure


during development confirm the validity of discourse before they submitting them to us for evaluation

and the training set in fixed condition was my to all speakers in the while

but a collection

and voxel n one and fox lead to datasets

in terms of development and evaluation died of the challenge participants for lead to develop

on the development data

and then it was held out evaluation data that i and schmack the systems on

another couple of different things to point out here about how we divided these conditions


we make sure that we actually had some room mismatch between enrollment and test

as well as rooms use between development and evaluation

and this is to help me to mimic mitigate

sorry mimic what would happen with a system developed in all of our tree true

level data

and then sent out for the real world use

similarly we had mismatch between enrollment and test all the microphone type

comparing the studio two lapel

or to the l members and their own remote

we also had mismatch between the enrollment and verification for the microphone used

between those two different tasks

finally the last speaker orientation

we have quite a range then we list of those ranges so that we lie

to analyze the impact of head movement on speaker recognition

in terms the results we had twenty one change successfully submitted scores

and for voiced aims also submitted the scores for the open submission so we can

get that comparison point

entire we have fifty i system submissions if a fixed knife right

however on the side here was shown that all scores for each

a t

i will begin to these a little bit on the next slide

let's start analysing some others results

the first thing we did was we would that the confidence intervals the ninety five

percent confidence intervals

and we did this by using a modified version of a joint bootstrapping technique

reference can be found in i

now the reason we modified this was to account for the correlation of trials to

more to multiple models being available per speaker

that is different recording from a speaker could have represent a different in rowley

and so this correlation that happens

in the trial scores

what we're calling here the in people's between the five ninety five percent files

of the resulting empirical distribution

now if we look at those top for scores on them in a little

we can see that the confidence intervals on our

when you don't take into account the speaker sampling all the multiple models per state

so can easily as if we don't take that into account

what we should be looking other red buffy

that gives us a more true impact of what the confidence intervals are

and from look at those for systems with respect to the other submissions

we see that the significantly different compared to the rest of this submission

however they also perform relatively similar

somebody observations we found when looking at about what a different group submitted

wasn't the top teams applied weighted prediction error for dereverberation i remember the voices corpus

has a lottery of the rooms a quite and noisy

and that was the and the person really step

every team also use an extract the system with that are augmentation

and this is sometimes complimented we present it image net and that's net based architectures

but i was the most popular choice in the back and

and system calibration was actually for is crucial here

with all the bottom sixteens final to achieve good system calibration

and what that means is there was a significant difference between the minimum

and actual dcf values for which the system should have been shameful

a cycle now what happens when you change of the enrollment condition

in particular we looking at what happens in reverberant environment should be used source data

that is now reverberation close talking microphone

well use data from a different room

with reverberation

to enrol

but we actually stole one can i of the blue results of the balloon buffy

a resource enrollment against testing with room for data whereas the red us enrolling on

reverberant room three data

against the same test larry

we see the red bows a higher than if i

this reverberation enrollment

cost than on to forty two percent relative source enrol a degradation

and that depends on the system is being benchmark they're of course

but it does suggest that speaker should be enrolled using close-talking segment

a clean this stage

basically when you have this in our all different reverberation between enrollment and test

reverberation doesn't have a role

when enrolling on it

but several different background distracted

we call them distracters because then to start the system from that fruit speech the


we had t v in the background

or babble noise in the background

when enrolling we enrolled clean speech no destruction

but from verification we had three different types no destruction

t vs the tv noise which sometimes include stage

and babble noise

and what we found that the systems that was submitted would reasonably robust to the

effect of t v noise in the background

however with babble

including the speech environment for the true speaker

resort in we have forty five to fifty percent relative degradation so it's quite a

significant drop the

okay now microphone time

we had i studio mic place close to the source for enrollment

and then treated from my class lapel men's and down tree but verification at different


and look at different distances i in the next slide

here we just one to look at how you different microphones

the quite

consistently across systems

have a step down going from boundary commenced a lapel microphone

from looking at different distances we just looking at a top five systems here

to constrain results to look at

with the lapel mikes placed at seven distances for the top five times

note you to be self non-overlapping masking effects them or just of my standard approach

to parse a greater challenge

what was interesting of the bus the really stand out the


kill and blue

tended to be partially obscured

so some of them are actually hidden

all very far from the


so the standard really draw performance as well

this also tends to explain the poor performance of the lapel mikes in general

embedded and remains a sore on the previous slide

and it was a summary now we're looking at the remaining challenges

based on organs a so far from voices publications and system submissions

but range in the ratio are characteristic

was to the three times worse than evaluation set and also the development set

now this was quite i

great the level of reverberation evaluation room

embedded development

and i was quite clear we found that this

sue severe amount of reverberation the country that the degree to degrade results compared to


current speaker recognition technology doesn't tend to address

the impact of reverberation sufficiently

the error rates a lot harder for reverberation condition then the source signal

the reverberation in the presence of noise for the degrades the performance

and the

increasing distance

provides a big impact of reverberation and degraded performance

so we need to explore novel speaker modeling techniques in a context is capable of

handling long time information

utterances the alien light reverberation the can happen in this nice

and try and make a robust to multiple noise conditions

system calibration is seven was critical for systems deployed in the real world

the bottom sixteen style to successfully calibrated system

and the previous work to shine that there is actually allows degradation calibration performance when

the distance the microphone

is significantly different between the calibration training conditions and one attacks to the court

so one way that we might be out to mitigate justify effect

is to have calibration methods that dynamically consider conditions of the trial

the predicted distance for instance

and that of the challenge is based on single-channel market find voices that was actually

collected with microphone well my

more microphones in the room

and we haven't looked into the effective for instance being for me

and there are a number of the front end processing

that would like to look at

including speech enhancement

dereverberation a little bit so that typically for the task of speaker recognition

so we hope you enjoy this special session at odyssey this year and that you

continue to drive technology forward in these areas

and we look forward to seeing what comes out of it

thank you