okay the third talk we have is on the two thousand best speaker recognition interim
assessment and crickets can present
okay so we have
for all results from best and how important is the calibrated test data
so now i'm going to give you the background information to help you interpret what
you're previously
a quick on the best program it stands for biometrics exploitation of science and technology
was my or program that once in two thousand and nine with the objective of
advancing the state-of-the-art in biometrics including speaker recognition models anymore about it but if you
want more information here's a link
no best managed what we call the best the best
which was be best evaluation speaker track for the best interim assessment the objective here
was to measure progress in speaker recognition relative to performance prior to sort the program
also to measure performance on data due to speaker recognition evaluations
a cell for those already which clusterise
please
no one's one to close there is
well the yeah the thing i like to know is
i should work people have not everyone is closer as you want have anonymity here
who does not know what speaker detection is the users right
okay
how about a target speaker
anyone not know what this is or non-target
or test segment
anyone that this is
okay
i just like a segment of speech data containing one or more unknown speakers
and the last anyone not know what mixtures
okay this is a set of speech corpora collectible the ldc to support speaker recognition
okay
so the data used for the evaluation was very big
very complex a bigger than anything a nist as the speaker recognition prior to that
and that could over thousand speakers at two thousand audio segments and forty one million
trials
make use of previous history collections is what was newly collected a mixer data which
we just below that was
including mixture one and two
which at the five different languages arabic english mandarin russian and spanish
mixer five
so mixer five was used in sre wait for those are from the of the
collection these are the mixer five speakers were not used in this way
mixer six likewise was using the sri to use of the seven year so speakers
that we're not using the sri ten
oh used greybeard
so people metrical we used gradient sre ten also but not really sticky and it
was
to be able to use it in this evaluation of that we were encouraged not
really sticky
as well as mixture seven which was newly collected with the objective of addressing new
sources variability
to analyze this part of the best evaluation
and
so i think i one over the so that there were nine core conditions men
to focus the research
first telephone wires train and test on telephone phone calls this is that like common
condition
clusters are used for maybe since the beginning
microphone train and test on far-field microphone uses sort of the interview condition
channel condition where we train on microphone and test on telephone of but limit or
consideration phone calls
another microphone condition where we trained on four or near field mikes and test on
telephone also limited the phone calls
and the speaking style condition we train on interview test on phone calls these restricted
the microphones
a language condition where we train on a multiple languages train it doesn't languages
second one where we train and test on two languages of the single both microphones
and phone calls
oh
and telephone
a multisession train
second multisession train the difference between these two is one was test on phone call
the other test on interview
and that's really
so to look at
actors a effective performance we really want to focus the evaluation on achieving a more
measuring system robustness
and also we saw yesterday that fact
errors we're condition the three categories
intrinsic extrinsic and metric
as these are all actually tested in the best evaluation
speech style where there is an interview or phone call
a vocal effort
i where there's normal vocal effort these recorded over a cell phone low vocal effort
or high vocal effort
in terms of the vocal effort
i vocal effort low vocal effort reduced of your headsets
morning and collected for internal phone calls
and high vocal effort the headset and the noise butlers i don't those of the
intentional artifact low vocal effort there was not always but a high side to encouraging
the speaker to lower his or her voice
another set question that remains whether this is a realistic approach is well that's what
actually is produced
regardless it seems the effects are interesting
as we talk about intrinsic in terms of extrinsic factors oh yeah channel which is
microphone versus telephone a different microphone types
and telephone we had transmission different transmission and handsets
and
something new and
a relatively interesting was changing the distance between interviewer and subject
to see if there was some additional vocal effort that could be listed in that
way
reverberation here we have that was artificially added but in reality there were two different
rooms one that was meant to be reverberant one that was meant to be not
reverberant
so that was both
for natural reverberation
as well as additive
reverberation and additive noise
in terms of additive noise than how we do this
oh sorry in terms of a river how we did this using a procedure that
was proposed by mitre
but actually implemented by mit lincoln labs
and was i don't of the participants at transfer
the method was the transform collected signals that have
reverberation qualities of a particular rooms
a given a range of dimensions and service conditions
as george show there were seven different reverberation conditions
point one six or so to one point three or so rt sixty
additive noise was also implemented by mit lincoln labs
other two noise types one which is each be easy which is heating ventilation and
air conditioning
so this is a sort of standard office room background noise
as well as the speech spectrum noise so this was a
a gaussian noise filtered to these spectra spectrum of speech
and there are two different noise levels well one fifteen db the other six db
and these were
see message weighted
five percent correctly
sources here if you
as speech spectrum fifteen db
i
i
i
vol
and the sixty V
i
i
i
i
and now extracted sixteen
i
i
i
a
okay
yeah and in terms of parametric factor
as there were a five different languages
and there's also eating data from greybeard
same data is returned as a so this is the reason why
we were encouraged not to should be the key
and there were multiple training session something that was new to the best in the
past without multiple training sessions that were phone calls that we actually in this case
that multiple training sessions over interviews somewheres the same speech over a microphone summers of
maybe one or two microphones but different speech
so something we've seen a few times
but maybe i'll explain some is the primary metric was different for best originally conceived
and
nineteen ninety six
to "'cause"
something like sixteen years to implement
a the false alarm rate and the corresponding miss rate of ten percent as set
as distinct but i one of the advantages that was simple and clearly defined
and the false alarm rate may be viewed as representing the cost of the wasted
listening effort incurred by using the system a specified miss rate
i in contrast equal error rate it does focus on the low false alarm region
which is likely to be of interest to a number of applications
and we can use the role of thirty to determine that with render target trials
that we'd only need to enter target trials or so to get the required miss
rate
so let's look at just a few general performance trends i should note here that
we are sharing general trends there were some things that were system specific but what
we're trying to share here
is are things that were a common across the systems that were submitted for this
evaluation
so in terms of language
what we see is the green lines are baseline system so this is a system
that was
i meant to be the state-of-the-art prior to the start of the evaluation
and the blue was a system submitted
for the best evaluation
the solid lines
or english
dashed lines or spanish
two things to is that the system submitted for the
a best evaluation shows better performance than the baseline across the range of operating points
also performance on english and spanish data were comparable
so not a big language effect
so this is my to help
and calibrate people's eyes
to the new metric
so here we see
the false alarm rate of ten percent miss rate
somewhere around twenty point eight
percent
second one that run point one percent
and the third and fourth
is that both point five percent
here's another look at language this is for mixture one and two
single system
and the system performed mostly better on english and spanish and on the other languages
and fact some systems perform better on spanish than english why this would be sort
of a missed regions right
and another chance to
the chance to calibrate your eyes to the new metric
so we can across the ten percent
the intersections the primary metric for best
so for speaking style
we're looking at one system's performance again line is train and test on interview the
green line is train on interview test on phone call the restraint on phonecall test
on phone call
and interview train and test gives best performance but there's a confounding factor here interview
rooms were actually longer
the test segments of this could explain why perform better
so we did a test in history
ten one vocal effort and found something somewhat surprising that low vocal effort perform better
than normal high vocal effort expected high vocal effort to perform worse but we also
expected low vocal effort before worse than normal vocal effort
similarly and the best evaluation high vocal effort stands out across systems as part
and low vocal effort stands out across systems is easy
and this is consistent similar to what we thought of this return
so there were a thing seven reverb conditions plus no reverb
so these very widely in terms of how much reverb was applied for those who
are not
able to immediately imagine what something would sound like based on an rt sixty let
me play a couple examples
so this was the least reverberant
condition
very well only
and the most reverberant condition
i
i
i
oh
and the thing to notice on this plot the degradation corresponds tardy sixty time and
that despite the mismatch that's that the train is on a noisy
are not reverb speech and the test is on reverb speech despite this mismatch performance
seem to be better with a small amount of reverb
that was no reverb at all
was also somewhat surprising
curious in an additive noise again no noise and train but noise and test
and when testing without noise fifteen db noise had little effect and sixteen to be
a more
one thing that was kind of interesting to us
was that if you look at the
oh for example the red in the dark blue line or the site and the
green line and there's not a lot of difference
and this is the price of expected speech spectrum noise to be more difficult to
do with other than expect noise
but as you can see there really was not a great you difference between the
two spectra
one other point i think to make have been made a suggestion that maybe one
percent is more appropriate under certain circumstances and ten percent and this condition that performance
was
and good enough that we were not able to measure
the primary metric because it never reached a ten percent that's right
so it's analysis largest and most complex by a fair amount this leads speaker recognition
evaluation today examine several factors affecting performance observed several surprising results some of which were
recreation of the results in sre ten that there were new conditions for example additive
noise and reverberation also provided some surprises which measures the earlier
us to use of synthetic data this is very exciting to us because it's
other times difficult expensive to collect data and
difficult to control
so if synthetic data turns out to be a reasonable way to evaluate systems and
the future this should be very exciting and useful discovery
and finally there is improvement observed over the baseline of all conditions
so participants should be policemen that
thank you
i
i
okay
i
oh
oh
i
yeah i have to do not sure
oh
i
oh
i
i
i
yeah
i
i
i
i
i
oh
i
i
i
oh
oh
oh
i
yes
yes so that was not explored in this evaluation but this is a fruitful area
for future
exploration
sure
i
yes cost
i
yes
yes
so and should probably admit at this point that we have not done deeper analysis
on basically any of this
it was a large and complex evaluation so in order to be able to handle
the main points
we had to do what are tied to read from the best
idea actually would like to explore this more
i
i
yes
yes and i guess i should point that out not just in the no noise
but across all the rivers we have this active speech just with different
yes
i
i think it was the same trend
of which was surprising but
maybe people can offer some
explanation or some intuition is that what this might be
i
i
i
oh
i
i
i
i
i
oh
interested look
i
yes
i
i see so two of them to make sure i understand what you're saying and
that the target scores were two widely distributed you don't are simply sums the thought
oh
yes
i
okay