Speech Transcript - The Babel Program and Low Resource Speech Technology

i mary harper and

in october two thousand ten i went i refer to develop this program a babel

and i say i think babel there's lots of ways to pronounce babel

and you can actually go to this website and find out about all the way

of saying a lot of people like this example

for me about is be a D V only and i drop in buffalo new

york were that was by

dialectal variation of inclusion

we want a taller are while so i see table

and of course there's also the original hebrew word and a variety of other ways

of pronouncing as well

but morgan pointed out the but though

now you say sounds but they didn't

in for some reason

okay so

every program

whether it's a darpa write our has sort of a back story you have to

have a motivation

that sort of an elevator speech so my challenges that

you know you you're in a situation we're dealing with the crisis it might be

they the

it might be a need for example where you have to deal with a lot

i'll

speech noisy speech in order to solve a crisis situation and you have thousands of

hours in no time to listen you might have one or two people who could

listen to it but you're certainly not gonna get through it and anytime the

would be reasonable in order to help people

and

if you have no existing speech technology for that language you have problems

but if you could rapidly develop that

say in the day here to you actually might be able to do simple it

it sort of addresses two gaps it's harder to build up the human capital in

the language

because it can take years and signal i don't we have one or two people

global language and we see that even with just developing the resources that we don't

have this

this language capital

and there's also a technology gap and so this slide was certainly

down a number of years ago but certainly computer to the three hundred and ninety

three million a printer ninety three languages that have a million speakers we touched very

few

right so we and we've only studied of is actually really and immediately i mean

we study english all the time because it easy they're corpora and so one

and it can take way too much time

it months to years to build a new language especially if you have to transcribe

the and the systems double developed for english don't always work well to other languages

they can they can help with the bootstrap

they certainly don't give you the kind of

error rates that maybe someone might want to see

so the basic idea underlying

table we rather than just evaluate a word error because

the director was very the image that she wanted to habitat

have a real task that just transcription

and so keyword search

fortunately had have an evaluation in two thousand six

and so we settled on at the keyword search task

where the basic idea is you speech recognition

or phone recognition or something the index

the thousands of hours of audio and then you have some whale

of putting in a query

for babel we use orthographic queries and for those who are doing low resource they

do other things with the data in order to basically accommodate to the fact that

we use orthographic queries and then we evaluate

a weather about the got the keeper correctly identified from the audio

our approach is really to work with a wide variety of languages and

the more

why you know so not just european languages it's it i think is really important

to study things that have a wide variety of aspects

real recording conditions as much as possible i mean obviously collections are gonna suffer from

the pair that

you go into these countries you may not be able to

record in a highly reverberant room or something

but the hope is that you can get these sort of real world recording situations

and then we

constrain the resources in various ways and that we actually collect a lot of data

but we actually create a wide variety of conditions for people to evaluate and

and the actually can create conditions as well to answer questions that they think are

important like getting by without lexicon

so we gradually reduce the amount of brown from speech for dipping them in training

but we also give them the audio

untranscribed i we also reducing the amount of time that they have to basically evaluate

the surprise language and i think that's critical

and not starting off with something that's impossible at the outset we actually getting people

to the point where they can develop the technology is extremely important in this

and we set the targets to be

sort of a three times improvement over which we get with phonetics or

and i think that was critical and that was done based on the std or

six

dbn results in cantonese and mandarin and

where they got a point three briefly about point three atwv so we said that

as the target level

the goal is to improve speech technology with limited amounts of ground truth data only

speech

systems for an english

language is extremely important

i improving speech recognition through innovative use

of the technology and different approaches

at a wide variety of languages so that you can get fast development of keyword

search systems to tackle this problem

i just to give you a sense of the layout of the program

other than the basic your which was they had a little bit one for the

nine months because it was a fifteen month period

they have roughly about nine months to work with data

the columns

not necessarily a day one

and then the evaluation starts where they have one want to do the keyword search

under practice languages so we evaluate everything

is this really important to understand

what progress is being made on the different languages causes languages are all different

and then we actually three give them a surprise language were we give them the

half of data and all talk about that little bit

where they have a certain number of meets the builder system

which decreases over the pure so in the basically related for weeks

this period more in the option one period don't have three weeks

and then they have one week to retrain their keyword results

and

give you might ask why one we there a lot of research or evaluation methods

the people are trying out what keywords or so it is important to leave a

sufficient amount of time there as well

so they not sure that we're using to measure performance is the actual term weighted

value which was developed by nist

i think in coordination with a number of

sponsors of that evaluation and it's kind of got i use case where

you've got people who

would like to be able to find stuff they don't tolerate a great number of

false alarms so you wouldn't want to use of score the other thing is that

rare terms given as a few nature of language

and the fact that maybe rare terms may be very useful in terms of finding

things that are critical

i mean tsunami might be a very common thing in the in the traffic you're

collecting but it may not have been there in your training data so you want

to be able to find some dummy things in your in your audio for

and

what you have to realise is that it's are weighted it and it's basically evaluated

over all terms regardless of the frequency of those terms

right so a single ten house the same in the score is something that is

highly frequent

and then there's a number of things that

this is that

like that the C in the be value a by enlarge you've got this weighting

a probability of false alarm

the systems have very low probability of for of false alarms typically so those are

really kind of little bit you can understand that it sort of a tradeoff between

those two things

but really missing something really doesn't a score when there are singleton

so something you want to sort of in mind as you look at the results

that i'm gonna go through

the babel program has a number of dimensions in terms of people were working

obviously the program wouldn't exist without data and so i has been the data collector

from day one i started i actually talked to them

proposed a i went on the job about the notion of the data collection

then and then

we have the test an evaluation team that's what T N T stands for work

and

it's actually

important to realise that you

have nist two can run and evaluation miserable cindy the technological support to setup an

evaluation

and then approach that's like this

so we actually have like hidden layers that actually build system so we can do

forced alignments and things like that

and then my work in some logistic stuff and then can also provides

needed help with linguistics they help me what they advise me on a number of

dimensions and certainly getting good phonetic coverage over the language and getting know how diversity

languages

it would be really card for me where i don't know well these languages actually

do that and the other thing is there is sort of attaining between the T

V gene and happen in order to ensure that the quality of the data is

appropriate for the T ask that we're doing

so keyword searches not something that happened had

then sporting before and had or browse

really doesn't make it very challenging to evaluate and keyword search so

we actually do teen and offline i'll talk a little bit about it that's the

one and then we have for teens

where i put the primes on the left applied to C M U I B

M X C and we bbn are the primes and you can see all

all the people who participated in the base period sometimes there's some reconfiguration but this

is the picture it was at the time of the base period so mobile technologies

still in there

lots of work

i think they're sixteen papers here that were supported by table i think of you

go back you

go back to

icassp over the past couple years in interspeech

i think they're probably hundred papers for so that have been sponsored by

babel all with rate work

i want to point out that is like go through things like that can i

have time to touch on

all the work for all the cool things the people are doing

i'm just gonna point out something selecting

or sort of interesting lessons word

and you know

there's a lot of other things the people are doing better

quite interesting i'm also gonna point out

how we change things for the option period

and the kinds of things that look like they're

really glimmering hopes i think

i'm not gonna ask that research is you'll be able see that

and in future conference

so the data collection is actually quite daunting

we actually have we're collecting the data and four dollars

where there are seven collected at a time

and we only needed for practise languages and one surprise language for

the base period we collected seven

no was a good thing we did because what we can plan to help as

the development language and a surprise language which would was a some is and by

golly

where are some is was supposed to be surprise

things went wrong with the collection and so we basically had to use the other

five languages so

it is really important to be over collecting for your leads at a particular time

the amount of time we spent collect seven languages

given the fact that you stagger kick up

roughly two years

right so you can see that there is like to your overlapped periods

so it is really interesting right now we're working on sixteen getting ready just send

funds for five so it there it really is this is sort of the critical

period for basically making sure the rest of the of the program is going to

play out but you can see there is an increasing number

of languages in each period you subtract one for the surprise language and you can

see how many

being used for practise

right so you can imagine by the time you hit the and of the program

multilingual systems are gonna be really highly supportive really high supported

we have a variety of criteria for selecting languages all talk about the little bit

more on the next

most of these are multi dialectal and they also represent a wide variety of recording

conditions

and starting in the option period we also started collecting

microphone channel

and

they all the data include surprise environments are channels in the evaluation

there is always something that's next and it's not that for hundreds of the data

but it there so people can assess whether they're methods are working

and these things

so we languages from a variety of language families with different features phonotactic morphological syntactic

and so one whether there are stolen a the collected in country which i think

is really important so you're living with a wide variety of telecommunications

kind of sit

situations others dialectal variation

a wide variety of environments the easiest environment tends to be the home office one

where there's a landline a mobile not always landline and some of these countries now

so when line is disappearing in some of the collections we're doing now

probably place three in the be able with the car click card it tends to

be one of the or

and then there's others

obviously you want to have not telephone channel data in there as well

and metadata valence

we actually do

provide the metadata

with each of the for the files so that the collection could alternately be used

to support dialect id room language id or other things

right so

you want to collect this data in such a way that it can be used

for a variety of purposes

we start off doing risk assessment obviously you don't wanna go into country where there's

a likelihood that people will die when they're doing the collection so

you have to you have to take that into consideration we also have to take

into consideration whether or not

there are

is the potentially get transcribers

and people who knows something about the language so all those things are certainly taken

into account

then we begin the work of working of a language where we actually work on

what happened calls a language specific peculiarities document

it typically involve involves providing the phoneme set

that is gonna be used by captain

and a variety of other things and something about the dialects in what

there were what the primary dialect that they would standardise and for example

that some people use some people don't

well but it is a part of the and process so we allow we keep

it going it provides

at the start of the lexicon

and also something densities which are very useful

and then there's a small database of transcribed conversational that they sent to us that

is review

i castle in others

to make sure that

the transcription quality is reasonable

sometimes we also get a lexicon to take a look

provide feedback

that affects things only receive an interim do we which is about three hours of

a conversation and that we actually start looking for had ever graph

who had or perhaps there are verging because

you can actually is the lexicon to help you spot these together with some language

experts so we try to clean that up so spelling normalisation is something that we

i it's that perhaps it's a little bit are adapt a certain amount of artificiality

but it certainly is important to do it i can tell you it's not

we're gonna be a hundred percent accurate

it's being done with

a certain amount of limitation on the resources are available

finally we get the be delivery and that's reviewed in partitioned into training

do have any the L and

every collection is collected is if it's a surprise well

where we use seventy five hours of the about but for the development languages the

practice languages we only used fifteen

so in many cases we have a lot of leftover audio that we just don't

pass

we also develop keyworks

using a certain amount that we have them annotated by captain so that we can

assign types and so one

so that we can have a certain notion of balance among the keyword so we

make sure that we come up with a certain number of names and so once

the but there's balance in the test

we also have the segments that and provides can be very large

we can re segments

using

voice activity detection

and basically those segments are passed back to happen for judgement in quality where they

compare to the original segments

then we do forced alignments on the dev and email and give the force alignments

to the performance

all these are the period one languages where problem we begin with cantonese pashto tagalog

and turkish and those were pretty risk free languages

and then we tested on vietnamese remember vietnamese was not to be the surprise language

and ended up being somewhat challenging in that

cantonese they provided provide of word boundaries but in vietnamese

it was just the syllables right and so things tend to be short words and

they also did a and not to bang up job of including all the dialectal

variants the pronunciations which

i think probably also cause problems

but actually as a resource it's a it's a great resource if you're interested in

understanding the vietnamese dialects you can see the number of dialects per language

cantonese have five partial for to call it three turkish seven in vietnamese for instance

the cantonese dialects

probably were pretty heart for

some people to understand so at the beginning when we use the data there was

some question about whether those dialects really cantonese but they work

so when we when we evaluate but this

developed an evaluation plan and they were

three conditions for the language resources that are use there was sort of the basic

language pack so this is i use the resources and i'm to button

there's the babel L like language resource condition where you could use a language packs

that you have available

and that's very nice for multi-lingual work

and then if you wanted to bring in other not available resources do that so

for example if you wanted to bring in web text or something like that

or you wanted to bring in a pronunciation lexicon or if you had some found

data do that

and then there's the amount of training that they used from the base a lower

condition

and you could either use the eighty hours of conversational

together with the scripted

or you could use something that was limited which uses

ten hours of transcription that it's selected it sub selected from the eighty hours so

it's a proper subset of but our set

and then there are two conditions for evaluating keywords there is the and heart condition

no text audio reeves so you build your keywords system you don't have knowledge of

the keyword you just basically do the search based on

those keywords you you're not able to read decoder retrain or something like that with

knowledge

you're not allowed can you're obviously gonna

decode but you can take into consideration knowledge of the keywords the test audio we

used condition

is you have knowledge of the keywords you could actually do things like automatically add

them to the lexicon and do crazy things in terms of a language model you

could use that you could go if you were gonna do the other lr you

can actually but what for language model data and so on

right so there's a lot of variability here

in the action period

we're really actually change things up a lot

where people can declare the resources and so there's a lot of interesting new conditions

that performers can come up with a narrow

so this is this was the star but there certainly gonna be i think a

lot more variability in the experiments people in the future

another innovation the came up with the program is since we're waiting so many languages

and we don't want to prevent people from doing experimental conditions

nist developed but they probably in the scoring server

and this allows researchers to submit

in get evaluated against the test data we don't

release all the test data after test we really some portion of it

if you wanted to go one evaluate against the full test set you know

given a sequestered part

and i think that's really important so

if you're writing a paper

ten months after the other evaluation you wanna go back and reevaluate or you've discovered

something new

and

you want to basically

test your hypothesis on the past languages

right can do that

and still get the full test them i think that's really very important and i

really think it it's gonna make a lot of difference in terms of the pure

science of the program can support

jon fiscus put together this for the open of al

this is submissions

over

the six weeks so the twenty seventh week

in the program

and you can see where there's spikes in terms of the rapid increase in the

cumulative number of submissions but you can see even after the evaluations over especially with

vietnamese

people cat submitting right because vietnamese was somewhat challenging and some people wanted to continue

to do work and of course the number of other languages as well

this that the resulting get back to you and as soon as they basically say

everything's okay everything's

right there is a sort of an intermediate point where they wanna make sure that

everything is working properly and so

usually takes about a week before the first results are was but assume is there

last

then people can report them openly

in the first period people to the state and a lot of creative things

people submitted primary in contrast systems and

for the most are trying to or submissions word system combinations and we'll talk a

little bit about system combination because it really does seem to help except for the

swordfish

all performers were able to make the program targets in all languages including the surprise

using the full language pair

and that in the base language resource condition with no audio

and of course

there are other conditions where you could potentially do better

program targets were exceeded with ten hours of training and for the five languages by

some people

usually using system combination

right

system combination reduces

the token error rate and increases atwv compared to single systems

but even single system full

language pair

single system full language pack systems

maybe program target

with the with the language back

all systems have of course have very probable low false alarm

warring this miss rate places a significant role in increasing atwv and that something you

want to sort of keep in mind

and there were several collection factors that actually attracted atwv language dialect environment gender

and i'm good just show you some poor results i think that are sort of

interesting

i don't know that i don't think evolution this even to the performers i actually

put this together for my program review

in here in here is

this i the this slide are posted i'm accurately but up here

i call this from the

actually they're probably not posted

but you can see that the base lr full language pack

are all marketing reading

and you know not everybody submits to condition the only one that was required was

the full language pack a cell are

and you can see people made that are their targets

and in all the languages

gender affects atwv and what was kind of into the and word error as well

in the set of collections the females that better

i think you did better with female speech which is kind of interesting

i'm not all the languages sometimes by a lot look at the technology for example

the males are so much worse

i don't why i mean really we collect two thousand speakers

right for per language so

and i'm sure there's interactions

with other factors but environment is important

you can see overall

pooling all over all systems you get a and the average of point five one

atwv

so the car here and he

the unexpected environment were sort of

equally for the landline a mobile are the home office and those are sort of

the best

and then the place in street people are sort of

somewhere in between

typically those are probably done what cell phone so

but

when you want to cross language

this is kind of mse slide the card it is significantly worse than our past

over summaries and

and you know obviously partial was of her language overall but there's something going on

there

and it is kind of interesting you know you look at it turkish and the

land lines wonderful well they probably have a much more stable

environment for landline

in some of these maybe rare so maybe with pancho

was the predominant thing so i didn't i what i didn't give you is sort

of the breakout of the distributions

a dialect

dialect in atwv interacted and i gave the for teens and you can see

northeast northwest so used in southwest

southwest was really under-represented it was really became clear with the interesting collects twice it's

a i don't want people by right

but you can see people could still do something with that in some of these

were related but certainly the ones that have a higher amount of data

certainly word the past

and the ones that had the lower amount of data least amount of data we're

sort of the words

and that was true across the board

but certainly the dialect does dimension of challenge to the data

lamb

i think it's area specific

somehow or another and

getting a

and echo

so what helps well early and it was clear that

especially with the cantonese data that you gotta do

re segmentation of the data to remove this to do science modeling get rid of

the silence or you kinda screwed things

robust multi mlp features

works really important i think

it really paid "'em" played a major role deep learning

really started to shine in the program very early and then i think there's

lots and lots of room for to keep shining in doing

very interesting experiments

pitch features on a language were useful at least for most people

and that what kind of cool about that is that sort of gives hope for

more universal feature extraction

and

one of the things that was really extremely important was to develop methods for preserving

potential it's a search alternatives

and the variety of ways of doing that including

tensor lattices are smarter ways of doing the queries there there's a number of papers

here that you can probably

C and this topic

and in other

then used

combining systems especially wonderful training data really matters a lot

it matters a lot

whether you try to build the systems differently or we just randomly see them differently

system combination is very useful semi supervised training

is very helpful for acoustic model and features

and score normalization

really plays a big model so if you do nothing else score normalization gives you

a lot right

so i

i could i could report and number

of things i just picked a smattering of things

in typically the reason why a perfect it was not an endorsement per se but

it was largely because

there was some P

sure that sort of

with speech to the point i was trying to make

but certainly several of these have papers appearing here so i put them there when

i when i could sort of a lineup the result

because some of these results were things that i got from site visits as opposed

to from papers because

a group here i to prepare the talk

longer go

but you can see

the stacked bottleneck features versus the inter bottleneck features you get an eight percent reduction

in word error

and the anaconda competent improvement in terms of atwv

adding fundamental frequency in probability of voicing

reduces word error

we generate this was on vietnamese i believe

we generation neural network

we generation neural net it

and neural network targets at a percent and semi supervised training

helped a lot to

and those were all

additive right so

very cool

so features very important

deep learning is very helpful and

we have a comparison here between shallow and deep

and you can see the shallow versus the deep atwv and

you know two to three percent

absolute improvement this was using the kuwaiti tandem sat

fmpe full language pack models

pitch helps even for non-tonal language

this is this is from the M probably has been playing around with features

because he was very unhappy with how here performed with this

edge and vietnamese so he's basically done a lot of interesting network

and you can see you know when they the

as C

pitch feature sometimes the goes up

it goes down a little bit for by golly but his method that he incorporated

in the kaldi gives an improvement

and all those languages

so vietnamese and cantonese are tonal but you can see a semi something a like

and certainly a lot of other people

have a similar program your problem and so one

have this kind of result large lattices help up to a point

right so you've got a

i actually haven't so this is the data per

where random is up in the upper right corner

and the further down go the better but that curve shows the operating

performance in terms of trade off between probability of false alarm probability of miss

and so

in further down is really important

and so you can see the green line is done with small lattices

and the purple and the line is done with larger

and the normals

lattices and eventually it is diminishing returns but certainly reserving stuff

that you want to find is extremely important

knowledge of the keywords helps

so you can see

it helps even more with the limited language pack we're

you don't

you might not know about those words based on the ten hour subset

so if you know about the keywords

you can actually leverage that knowledge

in interesting ways like not running things weight you always wanna keep the probabilities right

but you might want to set

specific

beings for specific for

maybe and has developed

a white list approach

that using the audio we use so you can see here

here's knowledge

of the keywords before they basically do things and they get a re called keywords

about ninety two percent

without knowledge of the keyword that seventy four percent you can see there's the big

of done atwv and you can see the number of hits per keyword is much

lower in keywords without its

much higher

but if you simply look at say infrequent words that may be important

just boosting the P model that was actually does give you something

in terms of being able to preserve those

keywords and so the percent of recall somewhere between

and that's that that's beneficial right so it's preserving stuff

so that you don't perform things out

and

you look at system combination i think system combinations about preserving stuff to

you get big gains

this is this is that the data set so it's

you can see

the best system here to combine system

and all on a full language pack and a limited language pack and you can

see

icsi except for posh though

system combination gets you

about point three atwv

which is pretty amazing

i word errors but you know

you can actually make the target

amazing

here's another picture of system combination

where you can see the

the individual systems using various

putting

you know dnns be enough

and then you have the combination in routinely this is a limited language pair

results as well

so you're gonna see much more modest scores

light

good duh normalisation this is the bbn result and

dab in email

per language where you look at

cantonese part though turkish tagalog and vietnamese you can see normalisation gives you a significant

improvement

not always the same price the

the dev and that S right so there's some impact of the set

but you can see

normalisation and doing it well

is certainly a big part of the program and there's a lot of methods that

people are working on now including that that's

rescoring

and the other interesting result which i believe appears here as a poster

and i couldn't put all the names of the authors out there so

would be readable so i put in and all

but when you normalize is very important

so you've got the contrast between the no audio we used in the audio reuse

but you can you can look at either one row or the other row

and if i do

normalisation after system combination i only get so far but if i normalize before i

do system combination i do really well

and

if i normal is

after the best tokenization be more score combination i basically can build a single system

that is really better than what you produce you normalize weights and normalizing orally so

if you're doing combinations of various representations is important to get the scores on the

cities

and in the same

the same place i mean it

it is really important it makes a big difference

and quite frankly a single systems gonna be much easier to run so it's kind

of an interesting thing to know

the other people that appears here

touches on analysis so

effective thresholds on atwv is also

an interesting thing to look at where you can actually get a number or rules

we have a fair threshold so it's just based on

my notion of what i can do here we based on

what i have in the genoa

verses if i set the threshold to be a them all

for the key for each keyword

and then if i play around and make sure that i he

the things that matter and throw away the things that are so i basis that

the probability

of hits to one of my probability of missus does euro you can see

the probability space is also playing a major role

in terms of your ability to get the keywords it's not just a matter of

calibration

also getting better probabilities seems to be an important aspect as well

so there is a lot of interesting things that people can look at and certainly

analysis i think is really a very important aspect of the program so understanding

why something works why something doesn't work

why something doesn't work is such a prison such a bad thing it basically by

if you a piece of knowledge that really is important in terms of solving the

problem

we also had an open keyword search

valuation in two thousand thirteen for vietnamese and we have a lot of people i

we had before babel performers plus eight outside teens who ended up submitting systems

and

i was them here

we have eight wonderful volunteers who actually participated in the open kws meeting is of

the results in their all over the

all over the place i kinda put it up there

so that and these are posted right that the resulting in kws are posted you

can go take a look

but

if people want to participate in the next one maybe they won't feel so shy

about the possibility of submitting something that may not be

super certainly babel people

have a lot more practise with the data

but

you can see you know that the scores were all over the place

and but people really did a lot of interesting things and there was your resource

approaches as well as

is low resource approaches

so impure into we added six languages we have fried practice and one surprise

they only have sixty hours of transcribed training they do have the remaining twenty hours

untranscribed

there's also what ten hour training set

and they have to exceed the program targets now i'm both condition

because they got so close right so the

and also

approaches that use things like morphology and so one

where maybe they would help you to get

i in the ten hour set

maybe the sixty hours that are the eighty hours that is a little bit too

large

and then they'll help three weeks to build the surprise language

the languages are bengali a nasty so those were collected in the first period they

don't have another channel they're pure telephony

no we have a means illumination real allow and of course that we have a

surprise and i'm not gonna out that here

optimizing bengali i think our

somewhat

okay right but zoo appears to be quite challenging annotation real

appears to be quite

simple right and so these are aspects of the language i don't think they're aspects

up a collection

and then lower

will have its own challenges because again

we couldn't annotate the compounds

reliably and so

the lower words

not the not the borrowed words are

multi syllabic right there the syllables

but there's single slap excuse me

so cast would put together some of the challenges of this

and present it got but i thought that was interesting so that the notion of

the sure language models where you can

sure between then golly in a somebody's right

also means doesn't have this much of a web presence and so it sort of

an interesting thing to do reporting in the french for the haitian creole

the phonology there are stolen allowance to

lasso has told kinda like

cantonese and

and vietnamese but two tone is very different

unfortunately this is how we could not marking the legs kind of couldn't be done

reliably and so it didn't make sense to put it in the resource

and then you also have some six segmental

it's a segmental phonology issues by golly and

morphology use there is too big time maybe more so than in the big only

enough

the oov rate is

higher than any of the languages we've seen including turkish which

didn't really have a terrible oov rate

and then there's other aspects that linguists might be interested in looking at the likeness

levels

there's person to script something albion ask means are sort of very similar

strictly speaking at being falsely score for a some is also mean square but it

really is same as the bengali

and then you have wow which has an another script as well there's a lot

of code switching and fusion creole but there is available ones you to i certainly

see a

and so those can be problems

and then there's a lot of short words and haitian creole allow but

i guess the short or words are hurting haitian creole maybe we'll for well

maybe not

so exciting directions people are going in

one of the things that we want is more analysis and so we revise the

evaluation plan images posted at the open kws sites you can actually take a look

if you want

so that people can actually evaluate a lot more conditions and then

sure the conditions with each other so that

others can evaluate likewise

there's a lot of work going out multilingual processing is trying to sit right it's

very intriguing and very interesting

and i think yes the deep learning things to really

those neural net models or certainly seen to play a role in progress the people

are making

machine learning

sort of get a somewhat slow start because you're trying to integrate this community into

the speech community but they're beginning to take off

two so stay tuned i think that there is a lot of interesting things the

we're gonna happen

smart lattices and consensus networks were beginning to play a role at the end of

last

a period

but i think that there are actually making much progress now

and the thing is that a lot of work was done a consensus networks to

make it work with the keyword search task

originally it was developed by lydia

due to basically

you know do a last pass right before you gave your one best output

and it was great for that but there were things that you can do to

basically make it work a little bit better with the keyword search

and then morphology

again this is a community integration of people largely were context working with speech community

so there's a lot of tradeoffs between whether you wanna break

break up a little pieces words which might be something that's great if you're doing

text

and that's a great if you're doing speech so

a lot of a lot of the integration of the teens

is beginning to bear fruit there as well so

it's quite interesting and a big thing that i think is really important is the

getting by with less

ten hours of training were less

i don't seen results with less

but i certainly think would be cool

and no pronunciation like

everybody promise to do decimation studies but

to large extent you know the program targets unfortunately

seem to sometimes try the research toward program targets as opposed actual exploring the space

of experiments so there is there is a there's a tradeoff between having annual evaluations

in getting people to do research

but i really

really do hope that people will

explore these conditions "'cause" i think the really important

so i'm ending up with us why about the open kws

the slightest why

and you can see the timescale right so

registrations gonna close and G and at the end of january so if you're interested

at all

we use

do consider

the vietnamese language pack will be available for those of you who have not participated

before

the open kws people who have participated as long as they participate again can keep

the data

right so if you just keep participating you can actually keep all the surprise languages

and hopefully nist open

up some of those that language is by evaluating on them to

so there's lots of data it's very useful

there's a lot of things there that you could

do with that data above

to support basic speech recognition and other types of speech research

and hopefully by the time with the

and of the program this will be released publicly to everybody

since we all the data alright

but you can see

the surprise language bill

is gonna be sent

we can have or so before the

evaluation begins where we send a password so there won't be any problem with the

download

downloads gonna be a little bit harder since we have the channel data which is

not downsampled in any way

because we figured

that's an aspect of handling that data

right and then people have the three weeks we send out the evaluation pack ahead

of time as well as its larger

it's seventy five hours and some of that is channel data

right and this will send a password on april twenty eight there at which point

people have a week

to complete their submissions you can submit many things

this will keep an eye and things to make sure that submissions are sounded there's

problems

there is a point of contact and so on so

it should not be a very bad thing and the other thing that this

there will be an open kws meeting were everybody would be expected to participate so

there is sort of a bird there for people who might participate but

i think that the meeting last time was very valuable in

in table babel folks were really very generous in sharing their insights so

i think it's a great opportunity to hear about

the work

and be able to ask questions and interact with

the babel participants so i think it's a really good thing you have the open

kws

and last but not least this is the get up with the slide stage

this is this is one of the things you have to do and the pitch

for the program

i put a little task force there and

after languages cover

but

obviously it's nice to be able to say all but really there's the caviar that

this has to be a language that has an orthographic transcription

i have to say even just having a north the orthographic transcription does not make

it easy

to create a language and so some languages are really much more normalize than others

as much as we have done a lot of work in terms of normalizing english

and there's a lot of spelling variants that happened it's a lot harder to do

it in these other languages were there really isn't

well studied conventions so

all star the caviar because certainly

you really do have to have the ability the capability of being able to clean

up the language

even when there is a presence in the web

and tiny is we talked about them as well

we're moving down to ten to forty hours

working with variable recording conditions where they developed

system in a we

a big the immediate impact has been language data were i i've

shared language data in had opened of males

that impacts the community at also text the government

new methods and speech search speech systems is then sort of the medium impact

and getting affective keyword search a new languages deliver quickly as the ultimate delivery so

learning how to do that learning how to solve the problem of

this is a new language now build the system is really for it's the core

principle program

and everything really needs to be projected in that direction

and alternately there are lots of other ways to say well

what if i only have a certain amount of time to transcribe

we find that

we can do that very well programmatically the people can certainly investigate that right where

they consider

the time to what the time to transcribe and clean things up in terms of

selecting data that they can work with

the nice thing is that there is that eighty hours of audio regardless of how

much data you use and so there's a lot of room to investigate a wide

variety

of getting by with less

including getting by with no lexical

getting by without transcripts

at all certainly there is more like that going on in the program

it may not

the

price

that the best systems do but i would say it's all equally important and in

vital to the program so having a wide variety of things going out i think

is really important

i'm done so if you questions

The Babel Program and Low Resource Speech Technology

Limited Resources Day

Mary Harper (IARPA)