speaker that's probably collection system


okay i'm presenting this on behalf of kevin walker you

wasn't able to ten

due to a very normal the version to sixteen hour plane right

so i'm going to see how well this is a kind of a departure from

the other talks and the session and the conference as a whole but i think

of interest of this community nonetheless so i'm going to briefly describe the rats program

and its goals

and then really delve into the data creation process for rats that'd talk a little

bit about how or generating the content that's used in the rats program the system

that we built to produce degraded audio quality recordings for the program

talk a bit about the annotation process and then focus on some details of the

speaker id evaluations just about to start

within rats

so by way of introduction wraps is a three year darpa program

that's targeting speech an extremely noisy and highly distorted channels

specifically is targeting noise not background noise

but noise in the signal sort of

radio transmissions is

and of the target kind of

therefore evaluation tasks within rats and speech activity detection language id speaker id hubert spot

there are five very challenging languages that we're targeting

and phase one of rats the training and the test data is based on material

that ldc is providing later phases will also test

on operational data although there won't be any training data from the operational environment


in order to produce

data that is operationally relevant ldc needed to understand a little bit about the nature

of this data so talking to the community we understood the operational data to have

a really wide range of noise character

so from the

structural properties of the data what we're thinking about is something like

radio chatter from a taxicab driver

this radio channels they're always out of the background and

you're calibrateds

are also sort of ham radio data that's a good approximation of the structural properties

of the data were targeting in terms of density of tell how long the terms

are the very short they're very rapid back and for a turn-taking there's lots of

intervening silence and they're also occasional bursts of excited speech

in terms of the types of noise of interest to the program air traffic control

transmissions is a good approximation of the type of noise that were

i'm interested in so we get things like side and steering various types of channel


and also the use of push-to-talk devices which can introduce squelch

and so in our collection we also want to target data that's more or less

understandable by a human

but nonetheless

side of the range so we want data that's challenging for human to understand the

not impossible that's impossible for human

you know we can't really and pursue it beyond that

in terms of the nature the speech we wanted to be communicative and transactional and

ideally goal oriented

it may be too part here multiparty speech half duplex full duplex or even try

so like a asr stands take communication that a police department use

what we are targeting narrowband wideband and spread spectrum

and also a real variety of geographical and topographical environments that my that the radio

channel performance in the transmission quality

with lots of that

around interference as well

the speakers may be stationary where they may be in motion in the listening post

may also be emotions you can imagine a drone flying overhead

surveillance area collecting data

and also speakers may know one another

so skip over the over you jump into the types of data that we're targeting

so we made the to use of found in data so there is some data

that you can get on the web that has the sorts of noise properties retargeting

address this is mostly shortwave transmissions

in that a lot of ham radio operators

post videos on you to a of their setup and so is just a stationary

image of their setup but you get the audio track of these sure way transmissions

that they're receiving

the really interesting

we're also doing limited collection of sort of short wave transmissions at ldc

we made a fairly heavy use of existing data set

interest program primarily because many of these data sets were already richly annotated with the

features of interest

so for instance which are all of the exposed nist speaker recognition test sets is

primarily english but they have speaker id verification already and we

no more or less what the languages for these recordings similarly use the expose nist

lre test sets

also several the existing ldc corpora like callfriend that exist in various languages

and it is just partially verified for language and speaker id the fisher levantine a

corpus of telephone speech that has both language and speaker verification

and also some broadcast recordings where we know the language more or less but don't

for instance to the speaker

the bulk of the data and the ldc is producing rats program is new data

collection either locally in philadelphia work from vendors around world and this is primarily telephone

speech although we're doing some live recordings as well

are targeting two types of data general conversation simulators and also some scenario these recordings

where people are engaged in some collaborative problem solving task like playing a game of

twenty questions

or engaging in a scavenger hunt with one another

and importantly a fundamental keystone of our system is that we always would like to

have a clean recording for purposes of manual annotation

and then are ideas that this clean recording is rebranding

in order to introduce the kinds of signal degradation that the program targets

so in order to perform that i generate that signal degradation we developed a multi

communication channel collection platform we wanna this platform to be capable of transmitting speech over

radio communication links where the transmission itself introduces the type of noise conditions in signal

quality variation of are interested in program

the platform that we developed is capable of simultaneous transmission of up to eight different

radio channels for each channels targeting a different height degree of voice

and again preserving the clean input channel to facilitate the manual annotation process

now there's a wrinkle here which is that it and this need to doing annotation

on the clean channel this requires

very careful process to a line

and to project annotations from the clean channel onto the age and degraded channels and

that's a very challenging problem

some other design principles

we wanted the system to be able to be used for either live sessions were


we want a wide range of channel types with different modulations bandwidths

different types of interference

we also wherever possible one and the actual components of the system to have some

operational relevance we just some research into the kinds of


and you know push-to-talk devices and that sort of thing that might be actually used

in operational environment

the radio channels themselves were configured well first we selected a transceivers

who's the R P ranged from point five to twelve lots

but the transceivers and receivers are equipped with multiple omnidirectional low gain antennas

and the transceivers we selected are designed for half duplex analog communication also because this

is what we found was primarily used

in the real world data

and importantly they operate on a shared channel model so they can either be in

transmit motor receive mode but they can

be in both simultaneously

so this is some of the radio channels and that we developed and really that

this table is just to give you a feel for

the range of transmitters and receivers in a particular the bandwidth variation in the different

types of modulation that we were targeting not gonna have time to go into these

into too much detail

okay so the image here is fairly complex and this is the case are transmit


so i was one through the protocol for transmission briefly so we start with a

wrong transmit control computer


the there's a demon running on the transmit station control computer that's querying the database

for recordings that are available for retransmission

what it finds recording the control computer initiates a remote recording on the receive station

control computer

and it also initiates a local reference recording

that we have just as a baseline

it also sponsor a subprocess to drive a computer-controlled push-to-talk relay bank

and that is controlled based on a signal relay output so that's this portion of

the device

when the systems in transmit mode begins playing the output over the source recording output

over the specified audio devices

and the depiction of the

i audio devices this down here

the single relay is configured for of

fast attack

one sustain gradual release

and there's a very wide

rather utterances and this is just sort of maximise the amount of speech begets transmitted

through the system we also introduced a single power supply and power distribution i'm to

avoid having the battery problem with the various handsets that part of the transmission system

oh we also introduced in isolation transformer bank

which is here essentially to isolate the system from upstream electronic equipment

and the next slide shows you sort of a similar diagram for the received station

and this is mostly just to indicate the variety of receivers that we have


so after recordings are generated

essentially they're uploaded to our server and then we initiate this really like be post

processing sequence

to align the files and also detect any regions of non transmission a compact and


so that if you feel for what the resulting recordings sound like on a plane

resamples from each of the channels

so first we start with channels can be is evaluated have channels

oh and the reference recording first




there's channel i


okay so channel these are single sideband channel this one is one of the more

challenging and channels for the rest a cyst




the distortion channel B and then channel H is a narrowband that

channel is another


and then are





okay a channel F is or frequency hopping spread



channel i


right system real challenges here these are actually recordings that were transmitted in their entirety

these are

like white intelligible but they take some getting used to there are much more difficult

recordings in a in a set of data

so after

the clean signals transmitted we have nine resulting audio files

clean channel the integrated channels we have a right

slide that indicates the retransmission start time

and all the sort source file parameter

we also have what we call a slot which is essentially timestamps on the push

to talk button on and off of that some for each of the individual channels

and then we have the reference

in addition but on the clean channel only and now we need to create annotation

on each of the degraded channels

projected from the clean channel as well as very accurate cross channel alignments

ideally we'd also like to be able to flag any segments that are impossible for

humans understand

it's not really fair

to evaluate system performance on

segments that human can even understand

so a perfect world is easy right so we start with a

source recording

yeah it's we've got perfect alignment on me degraded channel recordings

and see the regions are not transmission very cleanly

but that's not really the way things work

in the real world we have any number of challenges on the retransmission so we

have things like channel specific lab

there is a bit like

some of the channels

so there's still a in the segment correspondences

and it's not

the late at the same

all set up for each channel and so we have to do some channel specific

manipulation to account for that lack

we also have things like

to read in the non transmission regions

so these are all regions where the transmitter was then engaged but you can see

that for channel and a the duration is shorter

then for some of the other channels

is we have to account for that

we also have the occasional failure on a particular channel four sessions of here cases

where in

channels just were engaged during the transmission

and we have the most pernicious problem which are these channel specific dropouts

where everything's marching one just on for some reason a just conked out


and so we have to have ways to detect these all of these issues

this is not a real challenge and managing the corpus

what we've done is collaborate with the rats performers to develop a number of techniques

to help better manage the data so dan ellis the columbia just develop on two

algorithms skewview sex Q that identify what the initial offsets for each channel should be

brain the cross-channel alignment


ldc also developed our own internal processed using a retina scanners

i'm to identify long time transmission regions on the channels

and this

this is sort of two and channel four channel

the rmse scans only allows to detect longer transmission regions about two seconds or greater

and we'd really like to be able to also detect

dropouts that are very short the sound quite a bit and so the grass community

is working on a robust

a channel specific energy detector no transmission region detector

they can detect be shorter dropouts

quickly moving on to the annotation tasks that are right channels better annotation sre better

lyman across the channels now we annotate

so there are five or annotation task

for speech activity were reading an audio segment around on the clean channel for lid

we're simply listening to the speech segments in judging them is in or out of

the target language for keyword for creating a time line

transcript for the speech segments

and then for the speaker id task we're listening to portions of all internal in

channel recordings associated with one speaker id in verifying that it's indeed the same person

we're also on a portion of the data the test data in particular

doing intelligibility but it so this is where we're having are annotators native speaker annotators

listened to the degraded recording segments

the speech segments and saying whether they're actually intelligible or not and this turns out

to be a very heart task for humans to do an agreement among humans on

intelligibility is extremely or

we also do most of education system outputs identified any real problems in the annotation


annotation release format is really simple we've got the final metadata and then for each

of the annotations what the annotation is

and then importantly what its provenance is because reusing some existing data and sort of

borrowing annotations from previously developed corpora we indicate whether the annotation is newly created whether

it's a legacy annotation or whether it's an automatic annotation for instance from a speech

activity detection system

so now we've got our annotations on the clean channel we've gotta alignments across the

degraded channels now we need to project the annotations onto those degraded channels we start

out with the green is

speech yellow as non-speech

we project that each of the degraded channels that have already been aligned

we identify did not i'm transmission regions is indicated by a push to talk about


we adjust for the rest the lack that happens

pacific channels

we run or rms can send find the files that failed transmissions entirely in exclude

those from a corpus

and then finally we run R and G detectors on a transmission detectors and find

any segments where


but more push to talk button lots a there was a transmission but actually there's

no signal

and so we select those and now we have annotations for each of the degraded

channels as well

so as a result each file for each segment we have one of five values

we have S for speech

there was a transmission of speech and S is there was a transmission non-speech T

is there is a transmission but has been labeled as to whether it contains speech

or not

and she is there was no transmission and then this R X

setting which is

we detected a transmission failure automatically

okay now quickly moving to the syndicate particular this evaluation

is just getting underway the dry run evaluation is actually happening next week

for sid we're defining a progress at which is two hundred fifty speakers

with ten sessions for each speaker nominally this is fifty speakers per language although it

won't actually play out that way six of the sessions per speaker going to be

sequestered by the evaluation team which is S A I C doesn't be used for


the other four sessions per speaker are used for test

there's a dev test set that has the same characteristics as the progress


and then there's this additional generally used dataset which is two hundred fifty speakers that

have these two sessions each

and the performers can do whatever they like with this generally is that

see within rats is being evaluated is an open test

paradigm systems need to provide independent decisions about each of the target speakers

from the candidate ten percent candidate speakers without any


of the impostors are in the test data

all speakers in the test will be enrolled in the test some samples will be

used as impostors

for the other trials

and the performers need to have agreed to avoid using the enrollment samples for any

purpose other that the target speaker enrollment so they can be used for training

in the trials involving that speaker

where also we also distribute the nist sre data sort of background on modeling data

for the performance of that data has been pushed to the retransmission system

so far we delivered something like fifteen hundred

single speaker call these are people who started out with the goal of making calls

and dropped out the collection so most people drop out

and ninety percent of the people drop election after all we don't a hundred and

thirty seven speakers that have to the nine whole speech

hundred eighty three speakers that have ten calls each and our goal is again two

hundred fifty speakers with at least two thousand other two hundred fifty that have that

the slide

just summarizes the total amount of data to be processed through the rest system to

date so we use this is about a month out of date so i think

we can add five hundred to the bottom line here

so we transmitted over three thousand hours probably closer to thirty five hundred hours now

a source data yielding about sixteen thousand hours or more of degraded audio channels in

this includes

four hundred hours of data labeled for so

i seven hundred twenty labeled language id and about four hundred hours of keyword spotting


i'll come to the conclusion since i'm running out of time so in summary over

the past

you have i guess lpc is designed in the late this multi radio channel collection

platform we've undertaken a very large scale data collection including retransmission an annotation

of five very challenging languages

we retrieve retransmitted over three thousand hours of data yielding more than six thousand hours

of degraded signal

see that over fifteen thousand hours of clean signal data and generated corresponding degraded channel


we've developed independently and also with lots of the input from the rats performers several

algorithms to improve the overall quality of the transmitted data

we supported lots of new request for a new kinds of annotation collection

this is dry run evaluation is starting next week and people are very nervous and

this is really our data

i'm very eager to see what else

and thank you





we would like to

the receivers the listening post in a moving vehicle looking at that time assessment or

something but we don't have the funding to support that model so the transmitters and

receivers are at ldc there about thirty meters apart but there are significant structural barriers

in between the transmit and the receive station

so there's

like the core of the building is between the transmitted receive station that's the best

we could do with the resources available we are pursuing for base to address a

novel channel selection that may involve please see the listening post

and a more remote location

or even doing some of extra collection listening post motion