0:00:15speaker that's probably collection system
0:00:38okay i'm presenting this on behalf of kevin walker you
0:00:41wasn't able to ten
0:00:43due to a very normal the version to sixteen hour plane right
0:00:49so i'm going to see how well this is a kind of a departure from
0:00:52the other talks and the session and the conference as a whole but i think
0:00:56of interest of this community nonetheless so i'm going to briefly describe the rats program
0:01:01and its goals
0:01:03and then really delve into the data creation process for rats that'd talk a little
0:01:08bit about how or generating the content that's used in the rats program the system
0:01:15that we built to produce degraded audio quality recordings for the program
0:01:22talk a bit about the annotation process and then focus on some details of the
0:01:25speaker id evaluations just about to start
0:01:28within rats
0:01:31so by way of introduction wraps is a three year darpa program
0:01:36that's targeting speech an extremely noisy and highly distorted channels
0:01:42specifically is targeting noise not background noise
0:01:46but noise in the signal sort of
0:01:51radio transmissions is
0:01:53and of the target kind of
0:01:54therefore evaluation tasks within rats and speech activity detection language id speaker id hubert spot
0:02:01there are five very challenging languages that we're targeting
0:02:04and phase one of rats the training and the test data is based on material
0:02:09that ldc is providing later phases will also test
0:02:15on operational data although there won't be any training data from the operational environment
0:02:22in order to produce
0:02:25data that is operationally relevant ldc needed to understand a little bit about the nature
0:02:30of this data so talking to the community we understood the operational data to have
0:02:37a really wide range of noise character
0:02:40so from the
0:02:41structural properties of the data what we're thinking about is something like
0:02:46radio chatter from a taxicab driver
0:02:50this radio channels they're always out of the background and
0:02:52you're calibrateds
0:02:54are also sort of ham radio data that's a good approximation of the structural properties
0:02:59of the data were targeting in terms of density of tell how long the terms
0:03:02are the very short they're very rapid back and for a turn-taking there's lots of
0:03:08intervening silence and they're also occasional bursts of excited speech
0:03:12in terms of the types of noise of interest to the program air traffic control
0:03:17transmissions is a good approximation of the type of noise that were
0:03:21i'm interested in so we get things like side and steering various types of channel
0:03:27and also the use of push-to-talk devices which can introduce squelch
0:03:32and so in our collection we also want to target data that's more or less
0:03:37understandable by a human
0:03:39but nonetheless
0:03:41side of the range so we want data that's challenging for human to understand the
0:03:45not impossible that's impossible for human
0:03:48you know we can't really and pursue it beyond that
0:03:51in terms of the nature the speech we wanted to be communicative and transactional and
0:03:56ideally goal oriented
0:03:58it may be too part here multiparty speech half duplex full duplex or even try
0:04:04so like a asr stands take communication that a police department use
0:04:10what we are targeting narrowband wideband and spread spectrum
0:04:16and also a real variety of geographical and topographical environments that my that the radio
0:04:23channel performance in the transmission quality
0:04:26with lots of that
0:04:28around interference as well
0:04:30the speakers may be stationary where they may be in motion in the listening post
0:04:35may also be emotions you can imagine a drone flying overhead
0:04:39surveillance area collecting data
0:04:41and also speakers may know one another
0:04:44so skip over the over you jump into the types of data that we're targeting
0:04:49so we made the to use of found in data so there is some data
0:04:53that you can get on the web that has the sorts of noise properties retargeting
0:04:57address this is mostly shortwave transmissions
0:05:00in that a lot of ham radio operators
0:05:04post videos on you to a of their setup and so is just a stationary
0:05:09image of their setup but you get the audio track of these sure way transmissions
0:05:13that they're receiving
0:05:16the really interesting
0:05:18we're also doing limited collection of sort of short wave transmissions at ldc
0:05:23we made a fairly heavy use of existing data set
0:05:27interest program primarily because many of these data sets were already richly annotated with the
0:05:32features of interest
0:05:34so for instance which are all of the exposed nist speaker recognition test sets is
0:05:39primarily english but they have speaker id verification already and we
0:05:45no more or less what the languages for these recordings similarly use the expose nist
0:05:50lre test sets
0:05:52also several the existing ldc corpora like callfriend that exist in various languages
0:05:58and it is just partially verified for language and speaker id the fisher levantine a
0:06:04corpus of telephone speech that has both language and speaker verification
0:06:09and also some broadcast recordings where we know the language more or less but don't
0:06:15for instance to the speaker
0:06:17the bulk of the data and the ldc is producing rats program is new data
0:06:21collection either locally in philadelphia work from vendors around world and this is primarily telephone
0:06:27speech although we're doing some live recordings as well
0:06:31are targeting two types of data general conversation simulators and also some scenario these recordings
0:06:38where people are engaged in some collaborative problem solving task like playing a game of
0:06:41twenty questions
0:06:43or engaging in a scavenger hunt with one another
0:06:46and importantly a fundamental keystone of our system is that we always would like to
0:06:53have a clean recording for purposes of manual annotation
0:06:59and then are ideas that this clean recording is rebranding
0:07:03in order to introduce the kinds of signal degradation that the program targets
0:07:08so in order to perform that i generate that signal degradation we developed a multi
0:07:13communication channel collection platform we wanna this platform to be capable of transmitting speech over
0:07:19radio communication links where the transmission itself introduces the type of noise conditions in signal
0:07:26quality variation of are interested in program
0:07:30the platform that we developed is capable of simultaneous transmission of up to eight different
0:07:36radio channels for each channels targeting a different height degree of voice
0:07:42and again preserving the clean input channel to facilitate the manual annotation process
0:07:48now there's a wrinkle here which is that it and this need to doing annotation
0:07:54on the clean channel this requires
0:07:57very careful process to a line
0:07:59and to project annotations from the clean channel onto the age and degraded channels and
0:08:06that's a very challenging problem
0:08:09some other design principles
0:08:12we wanted the system to be able to be used for either live sessions were
0:08:17we want a wide range of channel types with different modulations bandwidths
0:08:22different types of interference
0:08:25we also wherever possible one and the actual components of the system to have some
0:08:29operational relevance we just some research into the kinds of
0:08:33and you know push-to-talk devices and that sort of thing that might be actually used
0:08:39in operational environment
0:08:41the radio channels themselves were configured well first we selected a transceivers
0:08:47who's the R P ranged from point five to twelve lots
0:08:51but the transceivers and receivers are equipped with multiple omnidirectional low gain antennas
0:08:58and the transceivers we selected are designed for half duplex analog communication also because this
0:09:04is what we found was primarily used
0:09:07in the real world data
0:09:09and importantly they operate on a shared channel model so they can either be in
0:09:13transmit motor receive mode but they can
0:09:15be in both simultaneously
0:09:18so this is some of the radio channels and that we developed and really that
0:09:24this table is just to give you a feel for
0:09:27the range of transmitters and receivers in a particular the bandwidth variation in the different
0:09:33types of modulation that we were targeting not gonna have time to go into these
0:09:38into too much detail
0:09:40okay so the image here is fairly complex and this is the case are transmit
0:09:47so i was one through the protocol for transmission briefly so we start with a
0:09:51wrong transmit control computer
0:09:58the there's a demon running on the transmit station control computer that's querying the database
0:10:02for recordings that are available for retransmission
0:10:06what it finds recording the control computer initiates a remote recording on the receive station
0:10:13control computer
0:10:15and it also initiates a local reference recording
0:10:18that we have just as a baseline
0:10:22it also sponsor a subprocess to drive a computer-controlled push-to-talk relay bank
0:10:29and that is controlled based on a signal relay output so that's this portion of
0:10:36the device
0:10:41when the systems in transmit mode begins playing the output over the source recording output
0:10:47over the specified audio devices
0:10:49and the depiction of the
0:10:50i audio devices this down here
0:10:53the single relay is configured for of
0:10:55fast attack
0:10:56one sustain gradual release
0:10:59and there's a very wide
0:11:01rather utterances and this is just sort of maximise the amount of speech begets transmitted
0:11:06through the system we also introduced a single power supply and power distribution i'm to
0:11:13avoid having the battery problem with the various handsets that part of the transmission system
0:11:19oh we also introduced in isolation transformer bank
0:11:23which is here essentially to isolate the system from upstream electronic equipment
0:11:30and the next slide shows you sort of a similar diagram for the received station
0:11:34and this is mostly just to indicate the variety of receivers that we have
0:11:41so after recordings are generated
0:11:45essentially they're uploaded to our server and then we initiate this really like be post
0:11:49processing sequence
0:11:50to align the files and also detect any regions of non transmission a compact and
0:11:56so that if you feel for what the resulting recordings sound like on a plane
0:12:02resamples from each of the channels
0:12:04so first we start with channels can be is evaluated have channels
0:12:10oh and the reference recording first
0:12:19there's channel i
0:12:38okay so channel these are single sideband channel this one is one of the more
0:12:42challenging and channels for the rest a cyst
0:12:54the distortion channel B and then channel H is a narrowband that
0:13:07channel is another
0:13:19and then are
0:13:31okay a channel F is or frequency hopping spread
0:13:42channel i
0:13:53right system real challenges here these are actually recordings that were transmitted in their entirety
0:13:59these are
0:14:00like white intelligible but they take some getting used to there are much more difficult
0:14:06recordings in a in a set of data
0:14:09so after
0:14:11the clean signals transmitted we have nine resulting audio files
0:14:15clean channel the integrated channels we have a right
0:14:18slide that indicates the retransmission start time
0:14:21and all the sort source file parameter
0:14:23we also have what we call a slot which is essentially timestamps on the push
0:14:27to talk button on and off of that some for each of the individual channels
0:14:32and then we have the reference
0:14:34in addition but on the clean channel only and now we need to create annotation
0:14:39on each of the degraded channels
0:14:42projected from the clean channel as well as very accurate cross channel alignments
0:14:47ideally we'd also like to be able to flag any segments that are impossible for
0:14:52humans understand
0:14:53it's not really fair
0:14:54to evaluate system performance on
0:14:57segments that human can even understand
0:15:00so a perfect world is easy right so we start with a
0:15:04source recording
0:15:06yeah it's we've got perfect alignment on me degraded channel recordings
0:15:12and see the regions are not transmission very cleanly
0:15:15but that's not really the way things work
0:15:21in the real world we have any number of challenges on the retransmission so we
0:15:25have things like channel specific lab
0:15:28there is a bit like
0:15:30some of the channels
0:15:31so there's still a in the segment correspondences
0:15:34and it's not
0:15:35the late at the same
0:15:37all set up for each channel and so we have to do some channel specific
0:15:41manipulation to account for that lack
0:15:43we also have things like
0:15:46to read in the non transmission regions
0:15:48so these are all regions where the transmitter was then engaged but you can see
0:15:54that for channel and a the duration is shorter
0:15:58then for some of the other channels
0:15:59is we have to account for that
0:16:01we also have the occasional failure on a particular channel four sessions of here cases
0:16:06where in
0:16:08channels just were engaged during the transmission
0:16:11and we have the most pernicious problem which are these channel specific dropouts
0:16:17where everything's marching one just on for some reason a just conked out
0:16:23and so we have to have ways to detect these all of these issues
0:16:26this is not a real challenge and managing the corpus
0:16:29what we've done is collaborate with the rats performers to develop a number of techniques
0:16:34to help better manage the data so dan ellis the columbia just develop on two
0:16:39algorithms skewview sex Q that identify what the initial offsets for each channel should be
0:16:47brain the cross-channel alignment
0:16:52ldc also developed our own internal processed using a retina scanners
0:16:59i'm to identify long time transmission regions on the channels
0:17:03and this
0:17:07this is sort of two and channel four channel
0:17:09the rmse scans only allows to detect longer transmission regions about two seconds or greater
0:17:16and we'd really like to be able to also detect
0:17:19dropouts that are very short the sound quite a bit and so the grass community
0:17:25is working on a robust
0:17:27a channel specific energy detector no transmission region detector
0:17:31they can detect be shorter dropouts
0:17:34quickly moving on to the annotation tasks that are right channels better annotation sre better
0:17:41lyman across the channels now we annotate
0:17:44so there are five or annotation task
0:17:47for speech activity were reading an audio segment around on the clean channel for lid
0:17:53we're simply listening to the speech segments in judging them is in or out of
0:17:57the target language for keyword for creating a time line
0:18:01transcript for the speech segments
0:18:03and then for the speaker id task we're listening to portions of all internal in
0:18:08channel recordings associated with one speaker id in verifying that it's indeed the same person
0:18:14we're also on a portion of the data the test data in particular
0:18:18doing intelligibility but it so this is where we're having are annotators native speaker annotators
0:18:23listened to the degraded recording segments
0:18:26the speech segments and saying whether they're actually intelligible or not and this turns out
0:18:31to be a very heart task for humans to do an agreement among humans on
0:18:35intelligibility is extremely or
0:18:37we also do most of education system outputs identified any real problems in the annotation
0:18:45annotation release format is really simple we've got the final metadata and then for each
0:18:50of the annotations what the annotation is
0:18:53and then importantly what its provenance is because reusing some existing data and sort of
0:18:58borrowing annotations from previously developed corpora we indicate whether the annotation is newly created whether
0:19:07it's a legacy annotation or whether it's an automatic annotation for instance from a speech
0:19:11activity detection system
0:19:14so now we've got our annotations on the clean channel we've gotta alignments across the
0:19:18degraded channels now we need to project the annotations onto those degraded channels we start
0:19:23out with the green is
0:19:25speech yellow as non-speech
0:19:28we project that each of the degraded channels that have already been aligned
0:19:33we identify did not i'm transmission regions is indicated by a push to talk about
0:19:39we adjust for the rest the lack that happens
0:19:43pacific channels
0:19:46we run or rms can send find the files that failed transmissions entirely in exclude
0:19:51those from a corpus
0:19:53and then finally we run R and G detectors on a transmission detectors and find
0:19:59any segments where
0:20:01but more push to talk button lots a there was a transmission but actually there's
0:20:05no signal
0:20:06and so we select those and now we have annotations for each of the degraded
0:20:11channels as well
0:20:12so as a result each file for each segment we have one of five values
0:20:19we have S for speech
0:20:21there was a transmission of speech and S is there was a transmission non-speech T
0:20:27is there is a transmission but has been labeled as to whether it contains speech
0:20:31or not
0:20:31and she is there was no transmission and then this R X
0:20:35setting which is
0:20:36we detected a transmission failure automatically
0:20:41okay now quickly moving to the syndicate particular this evaluation
0:20:46is just getting underway the dry run evaluation is actually happening next week
0:20:52for sid we're defining a progress at which is two hundred fifty speakers
0:20:57with ten sessions for each speaker nominally this is fifty speakers per language although it
0:21:02won't actually play out that way six of the sessions per speaker going to be
0:21:07sequestered by the evaluation team which is S A I C doesn't be used for
0:21:11the other four sessions per speaker are used for test
0:21:15there's a dev test set that has the same characteristics as the progress
0:21:20and then there's this additional generally used dataset which is two hundred fifty speakers that
0:21:24have these two sessions each
0:21:26and the performers can do whatever they like with this generally is that
0:21:31see within rats is being evaluated is an open test
0:21:34paradigm systems need to provide independent decisions about each of the target speakers
0:21:39from the candidate ten percent candidate speakers without any
0:21:44of the impostors are in the test data
0:21:48all speakers in the test will be enrolled in the test some samples will be
0:21:52used as impostors
0:21:54for the other trials
0:21:56and the performers need to have agreed to avoid using the enrollment samples for any
0:22:00purpose other that the target speaker enrollment so they can be used for training
0:22:05in the trials involving that speaker
0:22:08where also we also distribute the nist sre data sort of background on modeling data
0:22:14for the performance of that data has been pushed to the retransmission system
0:22:19so far we delivered something like fifteen hundred
0:22:23single speaker call these are people who started out with the goal of making calls
0:22:28and dropped out the collection so most people drop out
0:22:31and ninety percent of the people drop election after all we don't a hundred and
0:22:36thirty seven speakers that have to the nine whole speech
0:22:41hundred eighty three speakers that have ten calls each and our goal is again two
0:22:47hundred fifty speakers with at least two thousand other two hundred fifty that have that
0:22:54the slide
0:22:55just summarizes the total amount of data to be processed through the rest system to
0:22:58date so we use this is about a month out of date so i think
0:23:03we can add five hundred to the bottom line here
0:23:05so we transmitted over three thousand hours probably closer to thirty five hundred hours now
0:23:11a source data yielding about sixteen thousand hours or more of degraded audio channels in
0:23:17this includes
0:23:19four hundred hours of data labeled for so
0:23:21i seven hundred twenty labeled language id and about four hundred hours of keyword spotting
0:23:28i'll come to the conclusion since i'm running out of time so in summary over
0:23:33the past
0:23:34you have i guess lpc is designed in the late this multi radio channel collection
0:23:39platform we've undertaken a very large scale data collection including retransmission an annotation
0:23:46of five very challenging languages
0:23:48we retrieve retransmitted over three thousand hours of data yielding more than six thousand hours
0:23:52of degraded signal
0:23:55see that over fifteen thousand hours of clean signal data and generated corresponding degraded channel
0:24:02we've developed independently and also with lots of the input from the rats performers several
0:24:07algorithms to improve the overall quality of the transmitted data
0:24:11we supported lots of new request for a new kinds of annotation collection
0:24:16this is dry run evaluation is starting next week and people are very nervous and
0:24:21this is really our data
0:24:23i'm very eager to see what else
0:24:27and thank you
0:24:56we would like to
0:24:59the receivers the listening post in a moving vehicle looking at that time assessment or
0:25:04something but we don't have the funding to support that model so the transmitters and
0:25:10receivers are at ldc there about thirty meters apart but there are significant structural barriers
0:25:16in between the transmit and the receive station
0:25:20so there's
0:25:21like the core of the building is between the transmitted receive station that's the best
0:25:25we could do with the resources available we are pursuing for base to address a
0:25:29novel channel selection that may involve please see the listening post
0:25:35and a more remote location
0:25:37or even doing some of extra collection listening post motion