0:00:31okay i'm going to describe ldc is efforts to create the lre eleven corpus for
0:00:37the nist
0:00:38two thousand one language recognition evaluation
0:00:59so first a review the requirements for data for that's
0:01:03for corpus
0:01:05the process by which ldc selected languages for inclusion corpus
0:01:10a review or data collection procedures for broadcast telephone speech
0:01:15and then how we selected the segments that would be subject and auditing and then
0:01:21i'll spend some time talking about the auditing process in particular in particular reviewing
0:01:25the steps we should to assess interauditor agreement on language classification
0:01:32and then finally conclude with a summary of the released corpus
0:01:37so the requirements for two thousand eleven and first word to distribute previous lre datasets
0:01:44to write evaluation participant so this included the previous test sets and also very large
0:01:50training corpus that was prepared for lre two thousand nine
0:01:54yeah primarily consists of a very large only partially audited broadcast news
0:02:01the bulk of our efforts for lre eleven was new resource creation starting with lre
0:02:08two thousand nine
0:02:09there was departure from the traditional corpus development effort for galleries in that we in
0:02:17addition to you telephone speech collection we also included
0:02:22data from broadcast sources separately narrowband segments from broadcast sources and a number of languages
0:02:28ten lre eleven and also included broadcast speech
0:02:33for the target was to collect data from twenty four languages
0:02:39in lre eleven and targeting both genres for most of the languages with one exception
0:02:45so we have four varieties of arabic
0:02:48in lre eleven corpus
0:02:50for modern standard arabic because this is a formal variety of knots are typically native
0:02:57language of an individual speaker
0:02:59we did not like telephone speech for modern standard arabic and then for the three
0:03:04arabic dialectal rates iraqi levantine immigrant be we did not collect any broadcast segments only
0:03:11in speech otherwise all languages have shown
0:03:14so as i mentioned target was twenty four languages
0:03:17or something that michael dialects
0:03:21we just use the term varieties
0:03:23and our goal was to have at least some of these varieties are known to
0:03:28be mutually intelligible to some extent by at least some humans
0:03:33we targeted four hundred segments for each of the twenty four languages with the these
0:03:38two unique sources per language and the way we define source was where the broadcast
0:03:43sources in particular a source is a provider program
0:03:48so cn and larry king is a different source then see that headline news
0:03:53the stylus different speakers are different
0:03:58alright so our goal is twenty four languages
0:04:01to select these languages we started with doing some background research literature looking at information
0:04:07particularly rely on a lot
0:04:09and we compile a list of candidate languages and assign a confusability index score to
0:04:15each of the candidate languages there are three possible scores
0:04:18a zero reflects a language that's not likely to be confusable any of the other
0:04:24candidate languages on the list
0:04:26one is possible confusion with another candidate language on the list the languages are gender
0:04:32genetically related accounts some systems is not humans
0:04:39confuse the languages to some extent
0:04:42and then issue is for languages that are likely confusable with another candidate language these
0:04:47are languages where the literature suggests that there's no mutual intelligibility to some things to
0:04:54i between language pairs
0:04:56so there are review process we ended up a candidate set of thirty eight languages
0:05:01which was whittled down to the twenty four five evaluation languages
0:05:05with input from nist in the sponsor and also considering things like how feasible what
0:05:10it actually be for us to collect data and fine
0:05:14use a table of the languages that we ended up selecting for lre eleven you
0:05:21can see that all of the arabic varieties have a usability score of two because
0:05:25they were believed to be
0:05:27mutually intelligible the other arabic varieties
0:05:30a language by american english received a confusability score one
0:05:36with the assumption that at least a potential to be confusable with indian english
0:05:43and then there are few languages the received a confusability score zero for instance mandarin
0:05:49there are no known computers in the selected list like just remember
0:05:57alright and moving on to the collection strategy
0:06:01for broadcast collection we targeted
0:06:04multiple data providers of multiple sources for broadcast so we have a small amount of
0:06:10data that's been collected previously impartially use for earlier lre evaluations
0:06:16that have not been exposed and so we use some data from the voice of
0:06:20america broadcast collection
0:06:23but most of the broadcast recordings used for lre eleven were newly collected
0:06:29so we have a some archive audio from ldc is local satellite data collection
0:06:35but also hundreds of new collection that the re three of our collection sites in
0:06:40philadelphia in tunis and hong kong we maintain his multiple collection sites in order to
0:06:47get access to programming
0:06:49that is simply not available
0:06:52given the satellite feeds that we can access in philadelphia
0:06:58we can the broadcast collection believing that we would be able to target is sufficient
0:07:04data not twenty four languages to support a variety of other eleven needed it down
0:07:10very quickly that wasn't the case and so we quickly scramble to put together an
0:07:16additional collection facility new delhi
0:07:19actually developed for this collection of portable broadcast collection platform
0:07:23so essentially a small suitcase
0:07:26that contains all of the components required were are partner facility do essentially plug and
0:07:33record and so we partnered with a group the new delhi and the ended up
0:07:40language that it up i languages for us and rape a scalar within about thirty
0:07:46to full scale collection
0:07:49we also
0:07:51found as be collection but on
0:07:54they were falling short of our targets for some of the languages
0:07:58and decided to pursue collection of streaming radio sources for a number of languages to
0:08:05supplement the collection
0:08:07and in this case we did some sample recordings using here speakers to verify that
0:08:12particular source contain sufficient content in the target language and ended up collecting data for
0:08:20a week or so i've heard
0:08:22one of the challenges that having all these different input streams for broadcast data is
0:08:28that we end up with a variety of audio for
0:08:31is that need to be reconciled downstream
0:08:36for that election we used what we call based collection model
0:08:42oh is a native speaker and formants
0:08:46and the reason we use this a base model is to use the recruitment version
0:08:51and become apparent a moment
0:08:54that we higher for this study also end up serving as auditor so that they
0:08:59do and a language judgements
0:09:01on a collection our target was to identify to define flax for each of the
0:09:06lre languages
0:09:07and construct each class to make a single call to each other fifteen and thirty
0:09:13individuals within their existing social network
0:09:17so an individual we when we recruited people to be class for the study part
0:09:23of the job description was you know a lot of other people who speaker language
0:09:27and you can convince them to do a phone call with you
0:09:31and how to recorded research purposes
0:09:34so prior to the call being recorded the cali here is there a message saying
0:09:38this "'cause" going to be recorded research
0:09:41if you re push one which one and then the recording begins
0:09:46because we were collected we were recruiting these class in philadelphia primarily in some cases
0:09:52the multiple last for language knew each other and there is a chance that their
0:09:56social networks with overlap and we wanted to call these to be distinct and so
0:10:00we took some steps to ensure
0:10:02that the call is did not overlap and language
0:10:05because that
0:10:07or in every all we excluded that call sides from work
0:10:13we also require the class to make at least some of their calls in the
0:10:17us we permitted down to call overseas
0:10:20and most of them did we also require them to make some other calls within
0:10:24us to avoid any biuniqueness of channel language conditions of all the time holes
0:10:31work originating from thailand
0:10:35then there would be a particular channel characteristic that could be associated with high we
0:10:42wanted to obfuscate that
0:10:44all of the telephone speech collected was
0:10:48collected via lpc is existing on telephone platform
0:10:52eight khz a bit you
0:10:56alright so now we are collected data for recordings
0:11:00and we need to process the material for human audited
0:11:04so we first run all of the selected files through a speech activity detection system
0:11:09in order to just english speech sources silence music other kinds of non-speech
0:11:15based on the sad detection for telephone speech data we extract two segments
0:11:20all of each being thirty to thirty five seconds duration
0:11:25but for the broadcast data we need to do an additional bandwidth filtering
0:11:30so using bruno spend with detector we run over the full recordings for the broadcast
0:11:38and then from the intersection of the speech
0:11:41plus narrowband segments
0:11:43we identify continuous regions of the of thirty three or more seconds
0:11:49from the broadcast data
0:11:52segments that are the speech and you yeah
0:11:56that are greater than thirty seconds we identify a single thirty three second segment within
0:12:01that the same
0:12:02that region
0:12:03we do not select multiple segments from the longer region because we want to avoid
0:12:09having multiple segments of speech from this
0:12:11a single speaker in the collection
0:12:15a given the large number of languages and a large number of segments with the
0:12:19salary and in some cases it was necessary for us to reduce the segment duration
0:12:24down to as low as ten so
0:12:27rather than the thirty three seconds
0:12:31so this is just
0:12:32a graphical depiction of that selection process three
0:12:35speech file we wanna sad system
0:12:38distinguishing speech from non-speech if we have a speech regions be rather than but detector
0:12:44identify the narrowband segments in our goal is
0:12:47specifically say
0:12:52with at least thirty three seconds of speech that are narrowband
0:13:00alright so be identified segments are then converted into an auditor friendly format that works
0:13:06well the web based auditing tool that are auditors use
0:13:09that's sixteen khz sixteen bit for the broadcast data eight khz single channel for the
0:13:15telephone speech again we exclude class call aside from the auditing process
0:13:21and all of this process data is also then converted to pcm wave files so
0:13:27that it can be easily rendered in a browser the accuracy is
0:13:30orders are presented within with entire segments for judgements are typically they're listening to thirty
0:13:36three seconds of speech
0:13:38for broadcast
0:13:40similar now for telephone segments
0:13:44so we did some additional things with the lre data prior to presenting and two
0:13:51daughters for judgement
0:13:54with the specific goal of being able to assess inter auditor agreement for language judgements
0:14:01so what we're going to baseline are segments that are
0:14:05that it to be in the auditor's language so hindi auditor i'm being presented with
0:14:11a recording that's expected to be in handy because you know somebody said they were
0:14:16hindi speaker and we collected their speech
0:14:20for be telephone speech
0:14:24last work only auditors were only listening to segments that were from holy is
0:14:31from another class
0:14:33this was just sort of minimise the chance that they would just
0:14:37the segments because they knew the person's voice
0:14:41so on top of this baseline that auditors were listening to they were also given
0:14:46an additional distractor segments
0:14:49so up to ten percent additional segments were added to their auditing pilot mine
0:14:55that were drawn from a non confusable language so let's say i think the auditor
0:15:00i might have similarity or some english for some mandarin segments brown into my auditing
0:15:07and really this was done to keep auditors on their toes so that occasionally that
0:15:11we get a segment that was in a completely different language and they can just
0:15:15sort of falsely and we all possible
0:15:18we also added up to ten percent dual segments of these are segments that were
0:15:23also assigned other auditors
0:15:26so that we would get interannotator agreement
0:15:28numbers for that
0:15:30and then for all the varieties that have another confusable language in the collection of
0:15:37all these body languages
0:15:39we i additional confusable segments to the auditor's a kid
0:15:45or possibly confusable right use like polish or slow but
0:15:49the auditors judged ten percent additional over the baseline from the body language
0:15:55for my confusable varieties like low and high they judge twenty five percent over the
0:16:01and then for no confusable varieties like indian or do they just all the segments
0:16:06from the body language
0:16:08E individual can make a very high to it because the collection is happening sort
0:16:12of a non linear fashion so getting here that an auditor was working on might
0:16:17be all telephone speech frames for instance
0:16:20but this was sort of our target for the auditing kit construction
0:16:26briefly the auditor's
0:16:28were selected first via
0:16:30a preliminary online screening process
0:16:33so okay
0:16:35had a lot of little survey asking them questions about their language background and then
0:16:39lead to an online test listening
0:16:44but included in the target language but also some of these distractor segments
0:16:49potentially confusable language segments
0:16:53some of the feedback that we got on screening test how does also to point
0:16:58out areas where additional auditor training was needed or where we need to verify the
0:17:03language labels
0:17:04i'm to make the auditing task here
0:17:07about a hundred and thirty people to the screening test at for the past and
0:17:12they were hired in given additional training and the part of the training consisted of
0:17:16training there here's
0:17:18to distinguish narrowband for my where
0:17:20speech be a
0:17:21a signal quality perception
0:17:26the goal of the auditing task is to ensure that segments contain speech
0:17:32arg in the target variety or narrowband
0:17:34contain only one speaker
0:17:36on the audio quality is acceptable
0:17:39and that also ask questions about have you heard this person's voice
0:17:44before in segments that you previously charged with the reliability and i wish
0:17:48the solo a given the thousands and thousands of segments of people are judging on
0:17:53the we just abandon the questions here
0:17:58so the words about how to be consistency
0:18:02i'll just get to the bottom point that the numbers reported here
0:18:06or from segments that were assigned during the normalizing process all this dual annotation we
0:18:12conducted was not done post hoc it was done as part of the regular everyday
0:18:20so let's look for
0:18:21step within language agreement so this is comparing multiple judgements
0:18:25where the expected language of the segment was also the language of the auditor's
0:18:31we're asking what is the language label agreement so this is for instance a case
0:18:34where two and all the speakers are judging that's
0:18:37that we expect to be
0:18:39and you know naively we want this number to be close to one hundred percent
0:18:44well it's not always hundred percent so for the arabic varieties which we know are
0:18:50highly confusable one another we see very poor treatment for instance so
0:18:56be modern standard arabic charges only read with one another forty two percent
0:19:02and whether a segment was actually modern standard error
0:19:06the dialectal
0:19:07right are higher separate levantine arabic almost everyone agree
0:19:12that is like
0:19:13presented them
0:19:16some other highlights here for hindi and word
0:19:19we also see here
0:19:22agreements to around ninety percent but not surprising given these language pairs
0:19:27oh are related
0:19:32now looking at dual annotation results of this is looking at the exact same segments
0:19:37what is the agreement just on the language questions so we had nine hundred fifty
0:19:42one cases where the order
0:19:44said no that's non-target language
0:19:47a fifteen hundred cases where C yeah that's my target language the two hundred fourteen
0:19:52cases where one auditors that it's my language the other auditors said no it's not
0:19:57and it can break this number down you'll see that the disagreement comes mostly from
0:20:00three languages
0:20:02modern standard arabic yeah
0:20:04very well dual annotation agreements
0:20:07and then agreement for can be word so not surprising that these languages that are
0:20:13causing trouble
0:20:14and finally looking at cross language agreement so this is looking at judgements where a
0:20:19segment was
0:20:20confirmed by one auditor to be in their language
0:20:24language was the expected language was the one we believe that the segment bn
0:20:28and that's a
0:20:29was then judged by auditor from another language you also said that segment was in
0:20:34their language
0:20:35right so this is like a hindi speaker listens to segment that we think it's
0:20:39and hindi they say yeah that's can be we play that same segment for and
0:20:43we review auditor and they say yeah that's or do
0:20:47we see some interesting cross-language disagreement here so
0:20:52for the varieties where
0:20:54and we expect languages modern standard error
0:21:01ninety percent
0:21:02i think that
0:21:03is there
0:21:07similar numbers for
0:21:14so this one down here
0:21:17we see some confusion between american english
0:21:22which ones are just might both somewhat surprising but this is actually an asymmetrical
0:21:30what's going on is that
0:21:32is the expected language is american english but the auditor is an indian english order
0:21:37they're likely to explain that segment has their own language but the reverse doesn't
0:21:42an american english auditor does not flaming indian english segment to be american english
0:21:48we see a similar kind of asymmetry for can be a word
0:21:58so wrapping up with respect to data distribution redistributed the data to nist and C
0:22:04six incremental releases
0:22:07packages contain full audio recordings
0:22:10the auditor version of the segments
0:22:12and then the audio results for segments that particular criteria is the segment in the
0:22:18target language does it contains speech is all the speech from one speaker
0:22:23the answers to all that is needed to yes
0:22:26and then for but
0:22:28the entire segment sound like narrowband signal we delivered both yes and no segment judgements
0:22:34along with the full segment metadata tables in this could sub sample on the segments
0:22:39so the evaluation
0:22:41so this is just table two summarizes the
0:22:47a delivery so here are four hundred segment target for all the two languages allow
0:22:53and ukrainian where we had a real struggle to find all five
0:22:58so in conclusion we prepared significant points you telephone a broadcast data
0:23:04in twenty four languages which included several
0:23:07confusable varieties
0:23:09we needed to dawson are collection strategies to support corpus requirements
0:23:13there's a type of your should be at for auditors be over twenty two thousand
0:23:16on the judgements yielding about ten thousand usable lre segments
0:23:21the auditing kids were constructed just for consistency analysis
0:23:25we found that the within language agreement was typically over ninety five percent it's a
0:23:30few exceptions and it wouldn't
0:23:32we did see cross-language confusion particularly for their of it but you languages
0:23:37i'm in an asymmetrical a confusion with high level with american english indian english hindi
0:23:45urdu and with farsi dari
0:23:47and this corpus supported lre twenty eleven evaluation ultimately published in and sees
0:23:54and decomposition but sponsors
0:23:57okay thank you
0:24:29that's right
0:24:30so if we had only one auditor judgement for segment
0:24:34the segments gonna comes up avr that was deliberate if we had multiple judgements and
0:24:38they were all in agreement that was deliver if we have described that judgements
0:24:43those segments were withheld from what was delivered to nist
0:24:47those described in segments will be included in ultimate a general publication for lre eleven
0:24:53one it appears in these
0:24:56that might be interesting data for research
0:25:00along with the metadata
0:25:25right so it's
0:25:28someone asymmetrical so there are certain varieties that people are more accepting if there are
0:25:33linguistically similar to their own
0:25:35well they don't auditors that they typically tell this is this is
0:25:40not only could be tell that it wasn't moroccan let's say they could tell specifically
0:25:43that it was correctly
0:25:45the real confusion comes in with modern standard error
0:25:49which is really not spoken natively by anyone
0:25:54and also modern standard arabic spoken in a broadcast course
0:25:59sources that we were collecting
0:26:01may contain some dialectal elements so if you're doing an interview with someone from around
0:26:07some already dialect may prevent to what you know was reported to be modern standard
0:26:12arabic so that sort of a
0:26:16confusing fact
0:26:18and analysing