i everyone finds not require an and all be presenting the work by myself and
my car was
on the voices from a distance challenge two thousand nineteen analysis of speaker verification results
and main challenges
when we look at evaluations and challenges in the community they tend to provide a
common data
benchmarks performance metrics for the advancement of research in speaker recognition community
some examples of might be for me without the nist sre series
speakers in the one challenge
voxel speaker recognition challenge
and s-dsc
previous evaluations focused on speaker verification in the mines considering telephone dialler microphone data
different speaking styles
noisy data vocal effort audio from video short duration and better and while
however they haven't been that many that focus on
there even inside in the far-field distant speaker domain
now
nowadays we've got commercial personal assistants a at a really
outstanding in this area so trying to get a bit more of an understanding in
this context is important especially when we will you know single microphone
scenario
and the voices from a distance challenge in two thousand nineteen was hosted by sri
international and lap forty one
i in to stage two thousand and nineteen
and what this channel focused on was both speaker recognition and speech recognition
using the distant farfield speech acquired using a single microphone
in noisy and realistic reverberant environments
there are several objectives that we had for this challenge
one was the benchmark in a state-of-the-art technology for farfield speech
we want to support the development of new ideas in technology to bring that technology
for
we wanted to support new research groups entering the field of distant speech processing
and that was larceny and i would be publicly available dataset
i think that this is realistic of reverberation characteristics
what we noticed since the release of the public database in two thousand nineteen of
those an increase use the voices dataset
so we thought this actually called for will be current special session that we're hosting
here in odyssey two thousand forty even not virtual
now the session we're hoping will focus on broad areas such a single buses multi
channel speaker recognition
single versus multichannel speech enhancement for speaker recognition
domain adaptation for farfield speaker recognition
calibration in five two conditions
and advancing the standard
over what we saw in the voices from a distance challenge two thousand nine ten
let's have a look at what the voices corpus actually had in
so voices stands for voice is obscured in complex environment setting
and it is alaska now publicly available corpus collecting in collected in the real reverberant
environments
well we have inside the dataset is
three thousand nine hundred or more hours of audio
from about a million segment
multiple rooms for internal
different distracters that just t v in babble noise
and different microphones different distances
we even have a male speaker the right tails to mimic human head movement
the idea for this dataset was that would be useful for speaker recognition
automatic speech recognition
speech enhancement
and speech activity detection
here a couple different statistics from the voices dataset
it is released under the creative commons full license and that makes it accessible commercial
academic and government used
when a large number of speakers three hundred over for different rooms
up to twenty different microphones and different microphone types
these source data so that we used it was a read speech data set accordingly
per state
and we've got number of different background noises including babble
music
and tv sounds
of the loudspeaker when it orientates for re mimicking human head movement
a ranges between zero two hundred ninety degrees
but still will read half of what we sort in the challenge of two thousand
nine ten
we have two different a speaker recognition asr
and they had to different task conditions one was a fixed condition
and the idea here was the data was constrained
everyone got to use the sign constraint dataset
the purpose behind this was to benchmarking assistance trained with that same data set to
see if there's a dramatic difference between interictal technologies for what was commonly applied
in the open condition
it were left use any available dataset private or public
now the idea here was to quantify those guys it could be achieved when we
have and constraints amount of data
relative to the fixed condition
in terms of the goal here
well looking at
can we determine whether i target speaker space
in a segment of speech and that's true enrollment of that target speaker
but performance metric is too much the nist sre
cost functions
when the parameters on screen
as far as the challenge we also provided a score of so uses can measure
performance
during development confirm the validity of discourse before they submitting them to us for evaluation
and the training set in fixed condition was my to all speakers in the while
but a collection
and voxel n one and fox lead to datasets
in terms of development and evaluation died of the challenge participants for lead to develop
on the development data
and then it was held out evaluation data that i and schmack the systems on
another couple of different things to point out here about how we divided these conditions
here
we make sure that we actually had some room mismatch between enrollment and test
as well as rooms use between development and evaluation
and this is to help me to mimic mitigate
sorry mimic what would happen with a system developed in all of our tree true
level data
and then sent out for the real world use
similarly we had mismatch between enrollment and test all the microphone type
comparing the studio two lapel
or to the l members and their own remote
we also had mismatch between the enrollment and verification for the microphone used
between those two different tasks
finally the last speaker orientation
we have quite a range then we list of those ranges so that we lie
to analyze the impact of head movement on speaker recognition
in terms the results we had twenty one change successfully submitted scores
and for voiced aims also submitted the scores for the open submission so we can
get that comparison point
entire we have fifty i system submissions if a fixed knife right
however on the side here was shown that all scores for each
a t
i will begin to these a little bit on the next slide
let's start analysing some others results
the first thing we did was we would that the confidence intervals the ninety five
percent confidence intervals
and we did this by using a modified version of a joint bootstrapping technique
reference can be found in i
now the reason we modified this was to account for the correlation of trials to
more to multiple models being available per speaker
that is different recording from a speaker could have represent a different in rowley
and so this correlation that happens
in the trial scores
what we're calling here the in people's between the five ninety five percent files
of the resulting empirical distribution
now if we look at those top for scores on them in a little
we can see that the confidence intervals on our
when you don't take into account the speaker sampling all the multiple models per state
so can easily as if we don't take that into account
what we should be looking other red buffy
that gives us a more true impact of what the confidence intervals are
and from look at those for systems with respect to the other submissions
we see that the significantly different compared to the rest of this submission
however they also perform relatively similar
somebody observations we found when looking at about what a different group submitted
wasn't the top teams applied weighted prediction error for dereverberation i remember the voices corpus
has a lottery of the rooms a quite and noisy
and that was the and the person really step
every team also use an extract the system with that are augmentation
and this is sometimes complimented we present it image net and that's net based architectures
but i was the most popular choice in the back and
and system calibration was actually for is crucial here
with all the bottom sixteens final to achieve good system calibration
and what that means is there was a significant difference between the minimum
and actual dcf values for which the system should have been shameful
a cycle now what happens when you change of the enrollment condition
in particular we looking at what happens in reverberant environment should be used source data
that is now reverberation close talking microphone
well use data from a different room
with reverberation
to enrol
but we actually stole one can i of the blue results of the balloon buffy
a resource enrollment against testing with room for data whereas the red us enrolling on
reverberant room three data
against the same test larry
we see the red bows a higher than if i
this reverberation enrollment
cost than on to forty two percent relative source enrol a degradation
and that depends on the system is being benchmark they're of course
but it does suggest that speaker should be enrolled using close-talking segment
a clean this stage
basically when you have this in our all different reverberation between enrollment and test
reverberation doesn't have a role
when enrolling on it
but several different background distracted
we call them distracters because then to start the system from that fruit speech the
speaker
we had t v in the background
or babble noise in the background
when enrolling we enrolled clean speech no destruction
but from verification we had three different types no destruction
t vs the tv noise which sometimes include stage
and babble noise
and what we found that the systems that was submitted would reasonably robust to the
effect of t v noise in the background
however with babble
including the speech environment for the true speaker
resort in we have forty five to fifty percent relative degradation so it's quite a
significant drop the
okay now microphone time
we had i studio mic place close to the source for enrollment
and then treated from my class lapel men's and down tree but verification at different
positions
and look at different distances i in the next slide
here we just one to look at how you different microphones
the quite
consistently across systems
have a step down going from boundary commenced a lapel microphone
from looking at different distances we just looking at a top five systems here
to constrain results to look at
with the lapel mikes placed at seven distances for the top five times
note you to be self non-overlapping masking effects them or just of my standard approach
to parse a greater challenge
what was interesting of the bus the really stand out the
read
kill and blue
tended to be partially obscured
so some of them are actually hidden
all very far from the
speaker
so the standard really draw performance as well
this also tends to explain the poor performance of the lapel mikes in general
embedded and remains a sore on the previous slide
and it was a summary now we're looking at the remaining challenges
based on organs a so far from voices publications and system submissions
but range in the ratio are characteristic
was to the three times worse than evaluation set and also the development set
now this was quite i
great the level of reverberation evaluation room
embedded development
and i was quite clear we found that this
sue severe amount of reverberation the country that the degree to degrade results compared to
development
current speaker recognition technology doesn't tend to address
the impact of reverberation sufficiently
the error rates a lot harder for reverberation condition then the source signal
the reverberation in the presence of noise for the degrades the performance
and the
increasing distance
provides a big impact of reverberation and degraded performance
so we need to explore novel speaker modeling techniques in a context is capable of
handling long time information
utterances the alien light reverberation the can happen in this nice
and try and make a robust to multiple noise conditions
system calibration is seven was critical for systems deployed in the real world
the bottom sixteen style to successfully calibrated system
and the previous work to shine that there is actually allows degradation calibration performance when
the distance the microphone
is significantly different between the calibration training conditions and one attacks to the court
so one way that we might be out to mitigate justify effect
is to have calibration methods that dynamically consider conditions of the trial
the predicted distance for instance
and that of the challenge is based on single-channel market find voices that was actually
collected with microphone well my
more microphones in the room
and we haven't looked into the effective for instance being for me
and there are a number of the front end processing
that would like to look at
including speech enhancement
dereverberation a little bit so that typically for the task of speaker recognition
so we hope you enjoy this special session at odyssey this year and that you
continue to drive technology forward in these areas
and we look forward to seeing what comes out of it
thank you