Speech Transcript - The 2011 BEST Speaker Recognition Interim Assessment

okay the third talk we have is on the two thousand best speaker recognition interim

assessment and crickets can present

okay so we have

for all results from best and how important is the calibrated test data

so now i'm going to give you the background information to help you interpret what

you're previously

a quick on the best program it stands for biometrics exploitation of science and technology

was my or program that once in two thousand and nine with the objective of

advancing the state-of-the-art in biometrics including speaker recognition models anymore about it but if you

want more information here's a link

no best managed what we call the best the best

which was be best evaluation speaker track for the best interim assessment the objective here

was to measure progress in speaker recognition relative to performance prior to sort the program

also to measure performance on data due to speaker recognition evaluations

a cell for those already which clusterise

please

no one's one to close there is

well the yeah the thing i like to know is

i should work people have not everyone is closer as you want have anonymity here

who does not know what speaker detection is the users right

okay

how about a target speaker

anyone not know what this is or non-target

or test segment

anyone that this is

okay

i just like a segment of speech data containing one or more unknown speakers

and the last anyone not know what mixtures

okay this is a set of speech corpora collectible the ldc to support speaker recognition

okay

so the data used for the evaluation was very big

very complex a bigger than anything a nist as the speaker recognition prior to that

and that could over thousand speakers at two thousand audio segments and forty one million

trials

make use of previous history collections is what was newly collected a mixer data which

we just below that was

including mixture one and two

which at the five different languages arabic english mandarin russian and spanish

mixer five

so mixer five was used in sre wait for those are from the of the

collection these are the mixer five speakers were not used in this way

mixer six likewise was using the sri to use of the seven year so speakers

that we're not using the sri ten

oh used greybeard

so people metrical we used gradient sre ten also but not really sticky and it

was

to be able to use it in this evaluation of that we were encouraged not

really sticky

as well as mixture seven which was newly collected with the objective of addressing new

sources variability

to analyze this part of the best evaluation

and

so i think i one over the so that there were nine core conditions men

to focus the research

first telephone wires train and test on telephone phone calls this is that like common

condition

clusters are used for maybe since the beginning

microphone train and test on far-field microphone uses sort of the interview condition

channel condition where we train on microphone and test on telephone of but limit or

consideration phone calls

another microphone condition where we trained on four or near field mikes and test on

telephone also limited the phone calls

and the speaking style condition we train on interview test on phone calls these restricted

the microphones

a language condition where we train on a multiple languages train it doesn't languages

second one where we train and test on two languages of the single both microphones

and phone calls

and telephone

a multisession train

second multisession train the difference between these two is one was test on phone call

the other test on interview

and that's really

so to look at

actors a effective performance we really want to focus the evaluation on achieving a more

measuring system robustness

and also we saw yesterday that fact

errors we're condition the three categories

intrinsic extrinsic and metric

as these are all actually tested in the best evaluation

speech style where there is an interview or phone call

a vocal effort

i where there's normal vocal effort these recorded over a cell phone low vocal effort

or high vocal effort

in terms of the vocal effort

i vocal effort low vocal effort reduced of your headsets

morning and collected for internal phone calls

and high vocal effort the headset and the noise butlers i don't those of the

intentional artifact low vocal effort there was not always but a high side to encouraging

the speaker to lower his or her voice

another set question that remains whether this is a realistic approach is well that's what

actually is produced

regardless it seems the effects are interesting

as we talk about intrinsic in terms of extrinsic factors oh yeah channel which is

microphone versus telephone a different microphone types

and telephone we had transmission different transmission and handsets

and

something new and

a relatively interesting was changing the distance between interviewer and subject

to see if there was some additional vocal effort that could be listed in that

way

reverberation here we have that was artificially added but in reality there were two different

rooms one that was meant to be reverberant one that was meant to be not

reverberant

so that was both

for natural reverberation

as well as additive

reverberation and additive noise

in terms of additive noise than how we do this

oh sorry in terms of a river how we did this using a procedure that

was proposed by mitre

but actually implemented by mit lincoln labs

and was i don't of the participants at transfer

the method was the transform collected signals that have

reverberation qualities of a particular rooms

a given a range of dimensions and service conditions

as george show there were seven different reverberation conditions

point one six or so to one point three or so rt sixty

additive noise was also implemented by mit lincoln labs

other two noise types one which is each be easy which is heating ventilation and

air conditioning

so this is a sort of standard office room background noise

as well as the speech spectrum noise so this was a

a gaussian noise filtered to these spectra spectrum of speech

and there are two different noise levels well one fifteen db the other six db

and these were

see message weighted

five percent correctly

sources here if you

as speech spectrum fifteen db

vol

and the sixty V

and now extracted sixteen

okay

yeah and in terms of parametric factor

as there were a five different languages

and there's also eating data from greybeard

same data is returned as a so this is the reason why

we were encouraged not to should be the key

and there were multiple training session something that was new to the best in the

past without multiple training sessions that were phone calls that we actually in this case

that multiple training sessions over interviews somewheres the same speech over a microphone summers of

maybe one or two microphones but different speech

so something we've seen a few times

but maybe i'll explain some is the primary metric was different for best originally conceived

and

nineteen ninety six

to "'cause"

something like sixteen years to implement

a the false alarm rate and the corresponding miss rate of ten percent as set

as distinct but i one of the advantages that was simple and clearly defined

and the false alarm rate may be viewed as representing the cost of the wasted

listening effort incurred by using the system a specified miss rate

i in contrast equal error rate it does focus on the low false alarm region

which is likely to be of interest to a number of applications

and we can use the role of thirty to determine that with render target trials

that we'd only need to enter target trials or so to get the required miss

rate

so let's look at just a few general performance trends i should note here that

we are sharing general trends there were some things that were system specific but what

we're trying to share here

is are things that were a common across the systems that were submitted for this

evaluation

so in terms of language

what we see is the green lines are baseline system so this is a system

that was

i meant to be the state-of-the-art prior to the start of the evaluation

and the blue was a system submitted

for the best evaluation

the solid lines

or english

dashed lines or spanish

two things to is that the system submitted for the

a best evaluation shows better performance than the baseline across the range of operating points

also performance on english and spanish data were comparable

so not a big language effect

so this is my to help

and calibrate people's eyes

to the new metric

so here we see

the false alarm rate of ten percent miss rate

somewhere around twenty point eight

percent

second one that run point one percent

and the third and fourth

is that both point five percent

here's another look at language this is for mixture one and two

single system

and the system performed mostly better on english and spanish and on the other languages

and fact some systems perform better on spanish than english why this would be sort

of a missed regions right

and another chance to

the chance to calibrate your eyes to the new metric

so we can across the ten percent

the intersections the primary metric for best

so for speaking style

we're looking at one system's performance again line is train and test on interview the

green line is train on interview test on phone call the restraint on phonecall test

on phone call

and interview train and test gives best performance but there's a confounding factor here interview

rooms were actually longer

the test segments of this could explain why perform better

so we did a test in history

ten one vocal effort and found something somewhat surprising that low vocal effort perform better

than normal high vocal effort expected high vocal effort to perform worse but we also

expected low vocal effort before worse than normal vocal effort

similarly and the best evaluation high vocal effort stands out across systems as part

and low vocal effort stands out across systems is easy

and this is consistent similar to what we thought of this return

so there were a thing seven reverb conditions plus no reverb

so these very widely in terms of how much reverb was applied for those who

are not

able to immediately imagine what something would sound like based on an rt sixty let

me play a couple examples

so this was the least reverberant

condition

very well only

and the most reverberant condition

and the thing to notice on this plot the degradation corresponds tardy sixty time and

that despite the mismatch that's that the train is on a noisy

are not reverb speech and the test is on reverb speech despite this mismatch performance

seem to be better with a small amount of reverb

that was no reverb at all

was also somewhat surprising

curious in an additive noise again no noise and train but noise and test

and when testing without noise fifteen db noise had little effect and sixteen to be

a more

one thing that was kind of interesting to us

was that if you look at the

oh for example the red in the dark blue line or the site and the

green line and there's not a lot of difference

and this is the price of expected speech spectrum noise to be more difficult to

do with other than expect noise

but as you can see there really was not a great you difference between the

two spectra

one other point i think to make have been made a suggestion that maybe one

percent is more appropriate under certain circumstances and ten percent and this condition that performance

was

and good enough that we were not able to measure

the primary metric because it never reached a ten percent that's right

so it's analysis largest and most complex by a fair amount this leads speaker recognition

evaluation today examine several factors affecting performance observed several surprising results some of which were

recreation of the results in sre ten that there were new conditions for example additive

noise and reverberation also provided some surprises which measures the earlier

us to use of synthetic data this is very exciting to us because it's

other times difficult expensive to collect data and

difficult to control

so if synthetic data turns out to be a reasonable way to evaluate systems and

the future this should be very exciting and useful discovery

and finally there is improvement observed over the baseline of all conditions

so participants should be policemen that

thank you

okay

yeah i have to do not sure

yeah

yes

yes so that was not explored in this evaluation but this is a fruitful area

for future

exploration

sure

yes cost

yes

so and should probably admit at this point that we have not done deeper analysis

on basically any of this

it was a large and complex evaluation so in order to be able to handle

the main points

we had to do what are tied to read from the best

idea actually would like to explore this more

yes

yes and i guess i should point that out not just in the no noise

but across all the rivers we have this active speech just with different

yes

i think it was the same trend

of which was surprising but

maybe people can offer some

explanation or some intuition is that what this might be

interested look

yes

i see so two of them to make sure i understand what you're saying and

that the target scores were two widely distributed you don't are simply sums the thought

yes

okay

The 2011 BEST Speaker Recognition Interim Assessment

SESSION 09: Speaker Recognition Evaluation

Craig Greenberg