okay i'm going to describe ldc is efforts to create the lre eleven corpus for

the nist

two thousand one language recognition evaluation


so first a review the requirements for data for that's

for corpus

the process by which ldc selected languages for inclusion corpus

a review or data collection procedures for broadcast telephone speech

and then how we selected the segments that would be subject and auditing and then

i'll spend some time talking about the auditing process in particular in particular reviewing

the steps we should to assess interauditor agreement on language classification

and then finally conclude with a summary of the released corpus

so the requirements for two thousand eleven and first word to distribute previous lre datasets

to write evaluation participant so this included the previous test sets and also very large

training corpus that was prepared for lre two thousand nine

yeah primarily consists of a very large only partially audited broadcast news


the bulk of our efforts for lre eleven was new resource creation starting with lre

two thousand nine

there was departure from the traditional corpus development effort for galleries in that we in

addition to you telephone speech collection we also included

data from broadcast sources separately narrowband segments from broadcast sources and a number of languages

ten lre eleven and also included broadcast speech

for the target was to collect data from twenty four languages

in lre eleven and targeting both genres for most of the languages with one exception

so we have four varieties of arabic

in lre eleven corpus

for modern standard arabic because this is a formal variety of knots are typically native

language of an individual speaker

we did not like telephone speech for modern standard arabic and then for the three

arabic dialectal rates iraqi levantine immigrant be we did not collect any broadcast segments only


in speech otherwise all languages have shown

so as i mentioned target was twenty four languages

or something that michael dialects

we just use the term varieties

and our goal was to have at least some of these varieties are known to

be mutually intelligible to some extent by at least some humans

we targeted four hundred segments for each of the twenty four languages with the these

two unique sources per language and the way we define source was where the broadcast

sources in particular a source is a provider program

so cn and larry king is a different source then see that headline news

the stylus different speakers are different

alright so our goal is twenty four languages

to select these languages we started with doing some background research literature looking at information


particularly rely on a lot

and we compile a list of candidate languages and assign a confusability index score to

each of the candidate languages there are three possible scores

a zero reflects a language that's not likely to be confusable any of the other

candidate languages on the list

one is possible confusion with another candidate language on the list the languages are gender

genetically related accounts some systems is not humans


confuse the languages to some extent

and then issue is for languages that are likely confusable with another candidate language these

are languages where the literature suggests that there's no mutual intelligibility to some things to

i between language pairs

so there are review process we ended up a candidate set of thirty eight languages

which was whittled down to the twenty four five evaluation languages

with input from nist in the sponsor and also considering things like how feasible what

it actually be for us to collect data and fine

use a table of the languages that we ended up selecting for lre eleven you

can see that all of the arabic varieties have a usability score of two because

they were believed to be

mutually intelligible the other arabic varieties

a language by american english received a confusability score one

with the assumption that at least a potential to be confusable with indian english

and then there are few languages the received a confusability score zero for instance mandarin

there are no known computers in the selected list like just remember

alright and moving on to the collection strategy

for broadcast collection we targeted

multiple data providers of multiple sources for broadcast so we have a small amount of

data that's been collected previously impartially use for earlier lre evaluations

that have not been exposed and so we use some data from the voice of

america broadcast collection

but most of the broadcast recordings used for lre eleven were newly collected

so we have a some archive audio from ldc is local satellite data collection

but also hundreds of new collection that the re three of our collection sites in

philadelphia in tunis and hong kong we maintain his multiple collection sites in order to

get access to programming

that is simply not available

given the satellite feeds that we can access in philadelphia


we can the broadcast collection believing that we would be able to target is sufficient

data not twenty four languages to support a variety of other eleven needed it down

very quickly that wasn't the case and so we quickly scramble to put together an

additional collection facility new delhi

actually developed for this collection of portable broadcast collection platform

so essentially a small suitcase

that contains all of the components required were are partner facility do essentially plug and

record and so we partnered with a group the new delhi and the ended up


language that it up i languages for us and rape a scalar within about thirty


to full scale collection

we also

found as be collection but on

they were falling short of our targets for some of the languages

and decided to pursue collection of streaming radio sources for a number of languages to

supplement the collection

and in this case we did some sample recordings using here speakers to verify that

particular source contain sufficient content in the target language and ended up collecting data for

a week or so i've heard

one of the challenges that having all these different input streams for broadcast data is

that we end up with a variety of audio for

is that need to be reconciled downstream

for that election we used what we call based collection model

oh is a native speaker and formants

and the reason we use this a base model is to use the recruitment version

and become apparent a moment


that we higher for this study also end up serving as auditor so that they

do and a language judgements

on a collection our target was to identify to define flax for each of the

lre languages

and construct each class to make a single call to each other fifteen and thirty

individuals within their existing social network

so an individual we when we recruited people to be class for the study part

of the job description was you know a lot of other people who speaker language

and you can convince them to do a phone call with you

and how to recorded research purposes

so prior to the call being recorded the cali here is there a message saying

this "'cause" going to be recorded research

if you re push one which one and then the recording begins

because we were collected we were recruiting these class in philadelphia primarily in some cases

the multiple last for language knew each other and there is a chance that their

social networks with overlap and we wanted to call these to be distinct and so

we took some steps to ensure

that the call is did not overlap and language

because that

or in every all we excluded that call sides from work

we also require the class to make at least some of their calls in the

us we permitted down to call overseas

and most of them did we also require them to make some other calls within

us to avoid any biuniqueness of channel language conditions of all the time holes

work originating from thailand

then there would be a particular channel characteristic that could be associated with high we

wanted to obfuscate that

all of the telephone speech collected was

collected via lpc is existing on telephone platform

eight khz a bit you

alright so now we are collected data for recordings

and we need to process the material for human audited

so we first run all of the selected files through a speech activity detection system

in order to just english speech sources silence music other kinds of non-speech

based on the sad detection for telephone speech data we extract two segments

all of each being thirty to thirty five seconds duration

but for the broadcast data we need to do an additional bandwidth filtering

so using bruno spend with detector we run over the full recordings for the broadcast


and then from the intersection of the speech

plus narrowband segments

we identify continuous regions of the of thirty three or more seconds

from the broadcast data


segments that are the speech and you yeah

that are greater than thirty seconds we identify a single thirty three second segment within

that the same

that region

we do not select multiple segments from the longer region because we want to avoid

having multiple segments of speech from this

a single speaker in the collection

a given the large number of languages and a large number of segments with the

salary and in some cases it was necessary for us to reduce the segment duration

down to as low as ten so

rather than the thirty three seconds

so this is just

a graphical depiction of that selection process three

speech file we wanna sad system

distinguishing speech from non-speech if we have a speech regions be rather than but detector

identify the narrowband segments in our goal is

specifically say

with at least thirty three seconds of speech that are narrowband

alright so be identified segments are then converted into an auditor friendly format that works

well the web based auditing tool that are auditors use

that's sixteen khz sixteen bit for the broadcast data eight khz single channel for the

telephone speech again we exclude class call aside from the auditing process

and all of this process data is also then converted to pcm wave files so

that it can be easily rendered in a browser the accuracy is

orders are presented within with entire segments for judgements are typically they're listening to thirty

three seconds of speech

for broadcast

similar now for telephone segments

so we did some additional things with the lre data prior to presenting and two

daughters for judgement

with the specific goal of being able to assess inter auditor agreement for language judgements

so what we're going to baseline are segments that are

that it to be in the auditor's language so hindi auditor i'm being presented with

a recording that's expected to be in handy because you know somebody said they were

hindi speaker and we collected their speech

for be telephone speech


last work only auditors were only listening to segments that were from holy is

from another class

this was just sort of minimise the chance that they would just

the segments because they knew the person's voice

so on top of this baseline that auditors were listening to they were also given

an additional distractor segments

so up to ten percent additional segments were added to their auditing pilot mine

that were drawn from a non confusable language so let's say i think the auditor

i might have similarity or some english for some mandarin segments brown into my auditing


and really this was done to keep auditors on their toes so that occasionally that

we get a segment that was in a completely different language and they can just

sort of falsely and we all possible

we also added up to ten percent dual segments of these are segments that were

also assigned other auditors

so that we would get interannotator agreement

numbers for that

and then for all the varieties that have another confusable language in the collection of

all these body languages

we i additional confusable segments to the auditor's a kid

or possibly confusable right use like polish or slow but

the auditors judged ten percent additional over the baseline from the body language

for my confusable varieties like low and high they judge twenty five percent over the


and then for no confusable varieties like indian or do they just all the segments

from the body language

E individual can make a very high to it because the collection is happening sort

of a non linear fashion so getting here that an auditor was working on might

be all telephone speech frames for instance

but this was sort of our target for the auditing kit construction

briefly the auditor's

were selected first via

a preliminary online screening process

so okay

had a lot of little survey asking them questions about their language background and then

lead to an online test listening



but included in the target language but also some of these distractor segments

potentially confusable language segments

some of the feedback that we got on screening test how does also to point

out areas where additional auditor training was needed or where we need to verify the

language labels

i'm to make the auditing task here

about a hundred and thirty people to the screening test at for the past and

they were hired in given additional training and the part of the training consisted of

training there here's

to distinguish narrowband for my where

speech be a

a signal quality perception


the goal of the auditing task is to ensure that segments contain speech

arg in the target variety or narrowband

contain only one speaker

on the audio quality is acceptable

and that also ask questions about have you heard this person's voice

before in segments that you previously charged with the reliability and i wish

the solo a given the thousands and thousands of segments of people are judging on

the we just abandon the questions here

so the words about how to be consistency

i'll just get to the bottom point that the numbers reported here

or from segments that were assigned during the normalizing process all this dual annotation we

conducted was not done post hoc it was done as part of the regular everyday

so let's look for

step within language agreement so this is comparing multiple judgements

where the expected language of the segment was also the language of the auditor's


we're asking what is the language label agreement so this is for instance a case

where two and all the speakers are judging that's

that we expect to be

and you know naively we want this number to be close to one hundred percent

well it's not always hundred percent so for the arabic varieties which we know are

highly confusable one another we see very poor treatment for instance so

be modern standard arabic charges only read with one another forty two percent


and whether a segment was actually modern standard error

the dialectal

right are higher separate levantine arabic almost everyone agree

that is like

presented them

some other highlights here for hindi and word

we also see here


agreements to around ninety percent but not surprising given these language pairs

oh are related

now looking at dual annotation results of this is looking at the exact same segments

what is the agreement just on the language questions so we had nine hundred fifty

one cases where the order

said no that's non-target language

a fifteen hundred cases where C yeah that's my target language the two hundred fourteen

cases where one auditors that it's my language the other auditors said no it's not

and it can break this number down you'll see that the disagreement comes mostly from

three languages

modern standard arabic yeah

very well dual annotation agreements

and then agreement for can be word so not surprising that these languages that are

causing trouble

and finally looking at cross language agreement so this is looking at judgements where a

segment was

confirmed by one auditor to be in their language

language was the expected language was the one we believe that the segment bn

and that's a

was then judged by auditor from another language you also said that segment was in

their language

right so this is like a hindi speaker listens to segment that we think it's

and hindi they say yeah that's can be we play that same segment for and

we review auditor and they say yeah that's or do


we see some interesting cross-language disagreement here so

for the varieties where

and we expect languages modern standard error



ninety percent

i think that

is there


similar numbers for


so this one down here

we see some confusion between american english


which ones are just might both somewhat surprising but this is actually an asymmetrical


what's going on is that

is the expected language is american english but the auditor is an indian english order

they're likely to explain that segment has their own language but the reverse doesn't

an american english auditor does not flaming indian english segment to be american english

we see a similar kind of asymmetry for can be a word

so wrapping up with respect to data distribution redistributed the data to nist and C

six incremental releases

packages contain full audio recordings

the auditor version of the segments

and then the audio results for segments that particular criteria is the segment in the

target language does it contains speech is all the speech from one speaker

the answers to all that is needed to yes

and then for but

the entire segment sound like narrowband signal we delivered both yes and no segment judgements

along with the full segment metadata tables in this could sub sample on the segments


so the evaluation

so this is just table two summarizes the


a delivery so here are four hundred segment target for all the two languages allow

and ukrainian where we had a real struggle to find all five

so in conclusion we prepared significant points you telephone a broadcast data

in twenty four languages which included several

confusable varieties

we needed to dawson are collection strategies to support corpus requirements

there's a type of your should be at for auditors be over twenty two thousand

on the judgements yielding about ten thousand usable lre segments

the auditing kids were constructed just for consistency analysis

we found that the within language agreement was typically over ninety five percent it's a

few exceptions and it wouldn't

we did see cross-language confusion particularly for their of it but you languages

i'm in an asymmetrical a confusion with high level with american english indian english hindi

urdu and with farsi dari

and this corpus supported lre twenty eleven evaluation ultimately published in and sees

and decomposition but sponsors

okay thank you



that's right

so if we had only one auditor judgement for segment

the segments gonna comes up avr that was deliberate if we had multiple judgements and

they were all in agreement that was deliver if we have described that judgements

those segments were withheld from what was delivered to nist

those described in segments will be included in ultimate a general publication for lre eleven

one it appears in these

that might be interesting data for research

along with the metadata


right so it's

someone asymmetrical so there are certain varieties that people are more accepting if there are

linguistically similar to their own

well they don't auditors that they typically tell this is this is

not only could be tell that it wasn't moroccan let's say they could tell specifically

that it was correctly

the real confusion comes in with modern standard error

which is really not spoken natively by anyone

and also modern standard arabic spoken in a broadcast course

sources that we were collecting

may contain some dialectal elements so if you're doing an interview with someone from around

some already dialect may prevent to what you know was reported to be modern standard

arabic so that sort of a

confusing fact

and analysing