Speech Transcript - Building Speech Recognition Systems with Low Resources

host this better

okay

thank you very much thanks of having mean here giving you the chance to talk

about our work

so i would like to

talk about how can we build systems with very low resource as

and very much as lori also tried i would like to give a brief

the definition of what we think the low resource languages and it's surprising you mentioned

if you ask the linguists they would give a different answer but no

linguists keep the very same and so they say

in other resource languages these crawl them and say that is a language which likes

electronic resources for speech and language processing

it lacks presence on the web

it likes the writing system or stable topography and it may elect linguistic expertise

and

a language qualifies for being under-resourced if you just one of these criteria applies but

in many cases several of these the

definition supply

some people call that under-resourced some people called we low resource or more resources or

whatever but it's all we

men means the same thing

however what's really different is if you look into mine or two languages so mine

or two languages not necessary low resourced

and the low resource languages not necessarily a minority language so these have to be

discriminative

we were trying to put together some definitions so there will be

a special issue in speech communication which i was likely to put together with low

associated one are

in the alex carp of and it is about under-resourced languages and how to build

asr systems forward

and will be out in january and next year so twenty fourteen

so the ideal case

i consider to be or i think many people increase if you have plenty of

resources

so that means you have plenty of phone sets you know about the phone set

up to of a language you do have

or you have corresponding transcripts

you have the pronunciation rules you have experts which build for you the pronunciation dictionaries

and you have plenty of text data

and then you can build asr systems you can you build an L P or

inching systems and also tts systems for matter of fact

so that would be the ideal world

however the word is not ideal and i was asked to talk about a low

resource is what we do if we have very few on all data

so i would like to present a few of the solutions which we are currently

working on

and so the first one is addressing the fact if you do not have any

transcripts at all in a language which you would like to work on

the second is if you do not have any pronunciation dictionary so how can you

dig in for example into web for getting pronunciations

and the second thing is i would like to show some work on what we're

currently doing if you have no writing system at all for language

and then what can you do if you do not have any linguistic X expertise

so you don't have language experts at hand

and for all the solutions i would like to show today my general underlying assumption

that i would like to leverage off existing knowledge and data resources which i have

gathered from many languages already

so i sort of believe in if you have seen a lot of things in

one language and many languages

it should help you to build

the resources in the next language

so to me

the holy grail of rapid adaptation in languages and probably also in domains this

that you have plenty of data given in many different languages

that you sort of derive a global inventory of models

and then you have a tiny bit of data or maybe next to zero data

and then basically adapt to that particular length

and so when i started we might instance with the alex waibel he basically a

push you very much to if you want to do something with many languages they

would be good if you have many languages

so i started out collecting a large

corpus of data so

so i wanted to have something which is

uniform in a way that it basically a appears in many different languages but has

the same forward the same style and so ever and so ever it basically

i started doing this in ninety five and ever since we keep collecting data so

we now accumulated twenty one languages

it's basically you have for each language we have twenty hour spoken by one hundred

speakers

so altogether it is a small amount compared to what people use today

but it's basically having covered many languages

in over the years we have built recognises and all these languages

so that means we do have a pool of knowledge

in the acoustics and also in dictionaries in text

which i would like to show you how to leverage this for

situations what we do not have

any resources keep

so the underlying idea when i started in my thesis what's mean that was still

a gaussian mixture model time so that was pretty dnns

so the idea would be to share acoustic models across languages

in order to get some sort of and a global phone inventory and by the

time and that's something which also you are included

we use the ipa the international phonetic alphabet

to share data across languages

so whenever to phones have the same ipa representation

the joint to be trained as a language independent fashion

and

one observation we did what had was that if you do this and this is

a graph what we do this for twelve languages

so we share twelve languages based on i

one of the observations was

this doesn't really

i mean this just see

she just going straight up and that was only twelve languages

so our expectation of my expectation was that this would keep growing so that means

the more languages we see we will have issued diversity and we're not anywhere near

being in an asymptotic behaviour of the score

so and then as we hurt and yesterday in these wonderful talks from morgan and

also from france idea about the dnns

so what i'm really happy about this or why am excited about the dnns as

for the very first time

what we see is

not only that they work well for

strapping in new languages

but what i had never seen before it acoustic modeling whenever i would share across

languages

i would get worse performance in the training languages so i never managed to get

a better performance and those languages are used for training i only got slightly better

performance when they use the four languages

so now what the dnns we really i mean i hope so for the very

first time if you train the multi-lingual network that you can even outperform on the

languages which are in the training set

which makes me very happy and i think gives hopeful

using them for bootstrapping also two languages

so the first thing i would like to talk about is what

can be done if there are no transcription that at all so you have a

new language

well not to have that's a pronunciation dictionary and a good amount of speech

but you do not have any transcriptions

so and for this experiment we took the check which doesn't mean that i consider

check to be low resource languages the reason we to check is

two things the first of all the conference was in the czech republic so we

thought this is an appealing idea and the other was here is reason is

we had four other slavic languages which are related to check and we wanted to

figure out

whether we can whether we can gain from having related languages so we had two

sets of experiments we to for slavic languages

which is equation russian obligatory in polish

in the hope that the relation might help for bootstrapping check

and then me to another situation which is probably more realistic namely that you have

already resource language rich so asr systems and resource rich languages that is english french

german and spanish

and we basically said okay we use we have given speech we have given a

dictionary and we have given the language model in check

what we're letting our transcripts

and the idea is basically that we wanted to leverage of knowledge which we gained

already in many other languages

so the idea here goes as follows

we basically take the source languages he in this case we have a polish creation

will there in russian

we take the recognizer we haven't this language

and now we're taking our target checked dictionary and language model which we saw to

have given

and we are using the dictionary to express the check dictionary in terms of polish

falls

sounds to me

but it basically allows us to use the context dependent acoustic models

from our polish asr system

and we don't the very same thing for operation gary in russian so we basically

create for different dictionaries which are expressed in the phone set of the of the

source language on this

so it's out check words but it would be expressed in encoders phones and russian

phones and so on

and so now we can do what's we can run the polls recognizer on checked

data

it would give us check works we have the russian diction we have the russian

asr and the polish and so on so we have

basic need we get for hypothesis

of the czech words

and so then what we're doing is we're leveraging of confidence score which had been

proposed by two mask and then too much off

which was called a stop you so it basically how i mean what we do

with will basically calculate how often the same word appears

in the hypothesis from the different recognizer so this is the polish recognizer clearly

check hypothesis this is the creation the bulgarian the rational

and whenever or more languages agree on the same work we assume this work to

be correctly identified in this can be then used for training on

and so the whole the whole idea is we need to leverage off or to

vote to george to boot model different languages

and the reason why we few this works quite well is

if you take is the usual confidence score and you apply basically one language

we know that

the confidence score it's rather unreliable so this one gives you the confidence score threshold

over the word error rate

and basically if you have just one language and it really you start out with

the very poor performance

then even a very good confidence course not very reliable

but if you do this for several languages then what happens is that if you

had at least two or two languages agree on the same words of the same

islands in the hypothesis

then you basically have a much more reliable the behavior of the confidence score

and we learn that

using a threshold which is roughly one divided by and plus and offset if and

is the number of languages which are applied

that is a good value

so then the whole framework works as follows you basically create your saul's recognizers with

the matched dictionary

you run on your check audio data to produce a transcriptions

once you get the transcriptions you look with the

multilingual east appeal which ones that much parts of the transcriptions

and then you know basically adapt

the acoustic model of your source recognizer

towards

using the transcriptions which you just drive that means to produce recogniser choosing to check

the russian recogniser choosing to check

the creation one and so on so everything gets more basically chewed in to check

and then go through all everything the way you just keep the you adapted acoustic

model and then you we iterate

so we basically it generate the whole process

in which we always through the transcriptions away

and once we have are certain amount of data this way automatically transcribed

then we are using a multilingual phone inventory to bootstrap the check system

and we were quite happy with the with the performance we saw so what you

see here is basically

the first iteration to in the second third and then the bootstrap and the final

step after

reiterating one more time

and this is basically the performance as you get up with gary integration with the

pollution but the russian

acoustic models

and basically the take a message is if you use the related languages the best

system we can come up with has twenty two point

seven percent

performance

which is

very close to the baseline which we have if you're assuming that we have twenty

hours of transcriptions give

so this is the supervised case

and this is the fully unsupervised case

for the related K so for the case for the good case where you have

related languages

and this little one here is the one for the not related resource-rich languages

and the number the best number we got with this is twenty three point three

so we are still within range even so we're not as close as if we

do the related cases

but i believe that the

then the differences marginal

you may wonder

about two things one thing is

does that work for

for a conversational speech

and this is something we currently trying so this is more plan speech so

we are in the making of doing this for conversational speech the other question you

may have as Y all the results also work but don't please keep in mind

that we only have twenty three hours of training data and the setup and we

have a

rather shoot or perplexity because it's newspaper articles

if we so we of course wondered whether we would gain something by increasing the

number of source languages so we did the same thing this time for vietnamese and

we look at

what is the difference between using two source languages for source languages and six for

source languages

and what we discovered is that

over the iterations what you see here in the bars is the amount of data

we could extract

for trends for transcription

and what you see here and the curves is the quality of the transcriptions

so what we found is that the quality of transcriptions teams to go slightly up

if we using more languages

but we can derive more transcription data so the amount of data we can big

sticks extract its larger

and so what we found it is if you look at the performance here that

finally if we do with six languages we get sixteen point eight on vietnamese which

compares to fourteen point three if we build it in a supervised fashion

now the company at here is if we now have only and next word use

native speaker knows how to build systems

here she could do something it the range of eleven point eight sort the gap

you get all the gain you get by doing language specific things is enormous i

would say and significantly larger than the get you have between supervised and unsupervised

so if you have the choice having somebody one or something about the language and

for example to tuning and tweaking like

for vietnamese control modeling

pitch features using multi syllables and so on that gives a larger gains then the

discrepancy between transcribed and untranscribed

okay so the second thing i would like to discusses so this was about what

can what we do if we don't have transcriptions the second thing is what do

you do

if you have no pronunciation dictionaries

so that means the again the idea would be leveraging of what we already have

from other languages so we assuming we have seen a lot of scripts we have

seen a lot of pronunciation rules

and we would now like to generate the dictionary and language

so let me first talk briefly about writing system so first of all of course

we put wonder

how many

how many languages are written at all does it for example makes sense to look

into languages should not have a writing system

if you consider all mikolov which is from a site maintained by simon baker

so if you use listing about seven hundred fifty languages which have a script

he things that the number that room number of languages with the script is close

to two thousand

which basically means that the majority of the other six thousand languages do not have

a written form

so that so it seems to be a rather serious

issue

and then you look at the writing systems there are different types of writing system

so that the logo graphic ones which means the characters of this of the writing

system are based on semantic unix

units and the graphemes rather represent meaning then representing some sort typical example is a

chinese hands

and then there are scripts which are called phone a graphic

so that means here the graphemes are representing silence

rather than meaning

and then you have different forms of for the graphics and segmental ones for example

there are one grapheme roughly corresponds to one sound so that's very convenient for building

dictionaries in the syllabic case grapheme may represent an entire syllable

and then they are feature ones like korean where or one grapheme a represent

and articulatory feature

but if you look at the world most parts of the world agree which means

there are perhaps

the scripts which are roman all the transcripts

and then another big chunk here is the one which has it will take script

and they're both

photographic segmental scripts so this

seems to be very nice for a between pronunciation dictionary generation

in these other once we looked

first

so that means we assume that there is or correlation or there was a relationship

between graphemes into sound or

we call the grapheme-to-phoneme or letter-to-sound

but we also know that there was alone in there there's just there are languages

which are very close which have a very close grapheme-to-phoneme relationship

and then there are others which are like english more pathological C

sorry

don't worry offend anybody but

okay so of course that's all the first thing we wanted to look at this

so how many data do we need

to create a pronunciation dictionary

so what you see here

is on the on the on ten globalphone dictionaries you see here the phoneme error

rate on the y-axis and you see on the x-axis the number of phones

and what it means this we took

pronunciations we had pronunciations in our dictionary

we took them how we trained or G two P model based on sick which

were from a max be signing

and then we use the G two P to generate more entries in the dictionary

and then we compare the generated entries to the true entries in the dictionary and

that gave us the phoneme error rate

so if you have for example five K of five thousand

phonemes that means we took

well

roughly one thousand pronunciations word pronunciations and so you have five thousand phonemes which you

can learn on

you can learn you on

and what we learned from this graph is first of all

the G two P depends on the language that's not really surprising remote this already

so we have these punch down here this is the slavic languages so you guys

are really lucky

and also spanish is below there and then we have portuguese and this one instrument

and the worst case

in our scenario here is english

so there is a strong dependency on the language the other thing we don't is

if you have a close

G two P relationship then

five K phonemes might already be enough

to build a decent G two P model

and it separate out roughly after fifteen K examples

but we also see that a language like german for example needs

six times more examples then the language portuguese so depending on the relationship you have

to work harder to get examples in order to build a good G two P

model

now when you get those

examples from

and to select the hooded this work he basically

looked into dictionary which is a resource where you find when you find

pronunciations put in by volunteers in that language and you look at how many

language already covered dictionary and have a nice one tape pronunciations so that means one

thousand words

which roughly corresponds to five thousand of phonemes

and so we found thirty seven languages

which are covered that was at the time of twenty eleven when he did this

and he also follow if you look at this one this is the roles

dictionary entries over several years so this is twenty ten this is twenty eleven

so there seems to be a lot going on so that means the community seems

discover that and work in more words and so the whole would be that

probably covers

in the future

both more languages but also words

but now of course the question is

it is you use this so we build an interface to build an interface where

you can basically uploaded vocabulary file

in it would go away in search dictionary for whether it finds the entries

and then and we use all the entries it finds to train a G two

P model and then it would basically generate whatever you want to generate

and it would go into the pages and looks for the ipa-based pronunciations

there is a lot of cleaning have to do because the pronunciations to not only

not always basically

referred to what was saying so sometimes you look for what side and you get

an ipa for what can i might deals what's that

so you have to do a lot of checking to make sure that you're not

learning the wrong things

and that's one of the reason why the dictionary a G two P performs worse

than if you have a good dictionary where you start from so this one for

six languages compares

the be the G two P performance

a based on the things we learn from dictionary compared to the things we had

already on globalphone

and you can see here if you compare the same colours

that there are some significant distances

which

probably refers to one to the error unless you have what we had we found

and dictionary and the other thing is

there might be many contribute just to put in pronunciations into the dictionary in dictionary

and that means you might have inconsistencies which always an issue forty two P models

okay

so if you have nothing but you have related language you can also do a

very to force very simple approach and that's what we tried with ukrainian so we

want to ukrainian we had a student who was collecting data and we said okay

we already have russian gary german english

we have many languages why don't you try to

build the dictionary from what we already have

and it's a very simple approach so basically it has

for steps in the first step

you to the grapheme to grapheme mapping so you have russian

and you met the ukrainian graphemes to russian graphemes

once you've done this then there's a second step you can now basically applied to

russian grapheme-to-phoneme model

and then you have a russian phonemes and then you have to map them back

so to speak ukrainian mapping so you see there is a lot of mapping going

and then you do some post processing

so if she so she this for the different source languages then let's look at

russian for example so he will first

the ukrainian graphemes to rough and graphemes which

may tim to forty three rule so we had to come up with forty three

ones

and then he would run the G two P and after that you would

create fifty six rules to convert back the russian phonemes to ukrainian phonemes

and then to do some post processing which required another fifty seven rules so altogether

we had to create about one hundred sixty ones which can be done it in

today

and what comes out it's and

is the dictionary well when we plug this into the highways are it gives the

twenty one point six percent error rate

if we use the simplest approach we just use the graphemes for asr we get

something like twenty three point eight percent

if we ask the student to sit down and build a rules in order to

create the dictionary by hand so you basically at hand crafted rules which

piled up to about eight are rules then he gets twenty two point four and

in the process we wondered why the russian based one is

better than

be handcrafted and found that we had some issues with our rule so we could

we fix the rules

and then basically ended up with the same performance

so that means if you have

if you initial if you don't have much time and you we need to even

to reduce the number of rules you want an expert to write then probably be

screwed

method is an option

but that of course depends on that you have a reasonable related language because if

you would with other non related languages you get significantly worse results

so if we do would with english or german then it doesn't pan out

okay so i have fifteen minutes

so the other thing you can do if you don't have pronunciations is you can

ask the crowd and so we

created a small toolkit which basically gives you keyboard and on the keyboard you have

i ipa sounds which is you have to get used to it but you can

pick basic you press i mean you can listen to

how it sounds

and then you get a word and then you are asked to produce the pronunciation

of this word and then once you're done you can go to the next work

and we use this we basically feed it is into a mechanical turk

and basically we headed run for twelve days to find out whether this would give

good results or not

so the first thing we learned is so we rented for twelve days and we

had

calculated look at how long people spend

building pronunciations

the average time most fifty three seconds

most of the work so the majority based you had to reject

and we had

out of these nineteen hundred pronunciations we got a more than fifty percent where not

really useful because what we found out as people we would test the their the

task they would basically

spent a second giving one a phoneme

and so people where we were very fast but very sloppy and we couldn't really

get a good we didn't find a good way to

basically have incentives to provide very good answers so doing this

i mean we did this in parallel with many different

many people working on the same words in order to find out whether the whether

we get good pronunciations but it didn't really

penn out nicely

so the second thing then we did this we reached out of our friends and

volunteers so we work harder to improve the interface we had a nice welcome page

of tutorial

we gave the people quality feedback so that they basically have an incentive to work

harder and we also had basically we put it out as again so people would

compete with each other

in a in a friendly manner

so one thing we found listed in the in the beginning they would spend a

lot of time to get the what's right so they would spend six minutes on

one what which i found quite high

and the and then finally they would be down to one and have a minutes

roughly put word

and you can see here so the system amount of users so it's really heart

to get to keep the uses one tractor but certainly the median time for what

we do significantly

and also the arrows which we found procession would go would go down

but it's harder to keep

the people in the so getting really messes amount of pronunciations out of this with

crowd sourcing before rather challenging

okay so now i would like to talk and wu approach which we work on

if there is no writing system at all

so that means you would like to build a system for languages which have

a written form

so this is

sort of how we think of it you have nothing done

we assume that thing as not written even if it's true and you can you

can ask what to speak for you like a sentence in training and so you

can ask simply say i'm sick and you would say she dropped i don't i

don't speak klingon but may maybe people do

and you can ask him i've help the and you would also produce a phrase

and so that what people want to subconsciously is we would try to identify sounds

and we would perceive this as a sequence of

and then we would probably if there let's see if you says she drop in

she P then we would probably assume that she might be something like a because

this is what was

the repeating phrase in what we asked him to do

and then we could basically derive that are means G and what seems to be

the word for sick and

she seems to be the word for healthy

and so the idea would be to

create this process that we're doing in our minds that we putting this on a

on the machine

and they had in already substantial work done the first one probably the that's what

i found was performed from the wall basis he

so he was using a monolingual unsupervised segmentation in order to get the phone sequences

into words

and then there was the work from our is the work from the past and

speaker and alex waibel they were using a cross lingual word to word alignment based

on keys plus

and then in the combination of work they basically combine the monolingual in the keys

plus approach

in order to find the phoneme to a word alignments

and so phoenix are back in our lab she started using a cross lingual words

to phoneme alignment it for that you improved we extended the

it is a approach but let me first roughly explain how it work

so you basically have to sentences you have a source language in this case it's

not a german

it says a different each digit date

and then you have english phone sequence which you happen to have derived from what

the person was saying so you have language words and things

for you

i mean ideally so and you have these little word boundaries already in there but

of course if you create a phone sequence of a with the phone recognizer fingers

all you have a lot of errors you might have a lot of errors here

and of course you don't have any word boundaries

and the question now is how can you

get be alignment between the words and the corresponding phone

and one of the issues okay so and what we next it is he extended

keys plus model or if he's using the ibm model three

and she

basically

change it such that it can be applied for a different word length so he

was a putting in a word length probability

in order to counterbalance the fact that if you to work to phoneme alignments that

you have way more phonemes

to be aligned to one to one word

so the model he came up with basically skips the lexical translation it uses the

fraternity uses the distortion but then puts in the word length

and basically then the words are sort of placeholders and then be via the lexical

translation would be translated to the phone sequence

and what you found so the first result it did on be take english spanish

so what he found is that he compared the keys that what to phoneme alignment

to its model

three P alignment

and you can see that consistently

independent of the phoneme error rate of the original phone sequence

he could outperform other keys plus possible to phoneme alignment

and she could also in terms of its course you could also outperform

the monolingual one

so this is on the pair from english to spanish and this is from the

pair from spanish to english

and so and f-score here of eighty percent for example means that he could get

the segmentation the word boundaries in the phone sequences right to

about ninety percent accuracy which is quite good

so it's something that would have solved

i mean that when we're not anywhere near that but is something we do have

a correspondence between the phones and in the in the

works

we could now think of the next

the next step would be to we take the phone sequences and generate a pronunciation

dictionary

and now we run into this problem with low already mentioned in her talk

that is some words might have different meanings

in so for example the word a german can mean both speech and language

so what will happen is you will have a lot of different pronunciations and some

vowel sound like speech even so they might be you know

everyone S

but others are significantly different in these are the language words

so the first thing you need to do what we did is we clustered

all the

occurring phone sequences in the different

what meanings

and we use the D V scan a cluster for that so if you if

when we're lucky we are ending up with the cluster which represents everything widget goes

to language

and the other cluster represents everything which goes to speech

and then if you have several words which sound like speech then what prefix was

using was the n-best lectures to from and real spoken order to

basic generate one a result for speech

and then he would put those two results into the dictionary so

since we don't have

a really written form we cannot give it what checks all you can either we

in our case we just did numbers

so there is one word which gets the number one and the has the pronunciation

language

and there's another word which gets the i teach when it has the pronunciation speech

okay and so we try this out not would be tech but we moved to

a pronunciation extraction on private data because felix found very nice page

bible gateway which provides

translations of bible of the bible in many different languages and he did not fifteen

translations intend languages and they even provide all you in english portuguese and some other

languages

in order to play around with

and so what needed to sorry she basically verse aligned the bible text so he

extracted more than thirty

rows alignments for training

his approach

and so what you see here is basically the distribution of the absolute phoneme errors

in the extracted pronunciations

so basically he compared extractor come a pronunciations to the real word pronunciations

and so for example this thing here means that

if you use T V spanish bible

create english pronunciations

there are about three thousand nine hundred words

out of the fourteen hundred which had forty thousand which are extracted

in these three thousand nine hundred words have no phoneme error so there really correspond

exactly to what

had been found in the dictionary

so that in total the dictionary had something like fourteen hundred words so this is

the use easily so this is the english one of the english the bible translations

and this one gives you

the distributions

of the phone arrows across all the different languages he started from to extract the

pronunciations

one thing so what we did in this case it's we assumed

it in a first step we assume that the target phone sequence

is the canonical pronunciation so that means this has not been done with the phone

recognizer on audio but it has been done

under the assumption that the phone

sequence escort

only what was on almost the word boundary and so the next step is now

to basically this with the real phone recognizer in see how well it will

cross or

okay so and then the next step but we have done that yet

is that we now have a transcribed audio which is basically transcribed and word I

we could create a dictionary based on what I Ds or whatever you want to

take here and then you have a language model and then we could put all

these things together to build a speech recognition system based on data which has not

been

but they haven't so this last that we haven't done yet

so i four minutes left three minutes left okay so the third one start run

is really quickly this is here the question is what

could be done if there are not many linguistic experts and i think everybody has

seen this problem here she wants to build this

system in a language and there's simply no but you their coach who could help

lori was just talking about finding a korean for transcribing right

and so

so what we did

the war started several years ago in that was and it's F project actually that

was funded by mary harper

and it did together with alan black the idea was

how can you bridge the gap between technology experts and the language experts and so

we build a lot of web pages tools

and the web based tools allow somebody who doesn't know anything or not very much

about speech recognition or G S

it allows this person to

work

on the language for him or herself so you basically have hand rails

which tells you what to do in as in this next step

in order to build a system and the person can go step by step through

would to build the system

and we had done

quite substantial work in the past so the system is

every here it is used in a semi now which is a multilingual seminar also

across seen U N K I T

and we always adopt the languages of the students who take the score so what

the use we have really done a lot of systems asr and tts systems in

the respective languages

and whenever we learn something we plug this into

introduce the true

and we using this we for education but we also using this for

purposes to do up resources and reach out

two people if we don't have time at hand

for example we had one

one

one time wanted to build something in company and there was only a single speaker

of company it seeing you so he was sending basically is what the web page

to his people back in india and got a lot of speech and work out

of this

being able to do virtually

okay so i think i

i should close here so i wanted to basically show you some

solutions we are working on

if we have resources and i try to give one example if we have no

transcript if we have no

pronunciations or if we need have no

but system at all

and also wanted to stress that we are keep working on our record language adaptation

toolkit

in order to leverage of fieldwork and also be able to out to do outreach

to the community

in order to then finally bridge the gap between those people who know the technology

and those people know the language

thank you very much

for a few questions

i just had the clarification question about your first

study where you had no transcription

so how do you obtain the pronunciation dictionary in the other models

are you get

i could let me just go back so the idea is that you have

a dictionary in it so target most check so the idea would be you have

a check dictionary given in the czech phone set

but the source recognisable right to run the data is in let's say polish so

what you need to do is you need to take the produce phones

and replace the check phones in the dictionary for the problem phones

so you basically map them in the czech dictionary

so you get the polish variant of a check dictionary

we do i forgot that i'm sorry so we are using through a global phone

everything is expressed and ipa so doing a mapping is rather straightforward

like you can do this with fifty mapping rules i would say not nothing too

fancy because

they're the context

can be used this way but the in the after the iterations the

acoustic models would be map adapted anyways

for the recognizer for languages be so and it's operating system actually the things they

represented the last part

what would be a recommendation to present the output because if there is no writing

system is our to target all right you cannot probably due

the users to just be I Ds numerical id so diverse every what we do

so i think

if you haven't if you have a system or language which is not written it

all the tts might be a good approach that you basically

speak it would be to the people so that they can listen to it

so that would be run option and then all you have to do is if

you get

number one you basically to teach us on the phones

and then the other thing of course the a simple straightforward one is apply any

sort of

phoneme to grapheme mapping to get it back so

but that's

all that we could come up with actually i don't like the number idea either

so we basically switch now to something with underscore so that you can at least

we did

you could potentially and translation

or understanding intentionally to

as ways to see

sure i mean actually what for what's out there is you have the word id

in the target word to source word

you shouldn't what this into translation system and then

translated into

language that's

twenty correct

i'm more interested in the asr it's always

just one year when you were doing the bible work and finding

see the words and read or differences were you actually can see how many of

the words that are january you're not real words

"'cause" you do in the list right and i didn't really

get to the so

so one thing is for the english case

you can see that the number of

generated words is roughly the number of

words so it's not overgenerating in its not under generating

but you can also see swedish for example son to generate into gary and check

is what generate and what we found is that

the generation is also related to how many words the source language has source we

just has very few and very check as many so they'll what generating an unknown

generating

and of course name of the game would be to sort of cheap this in

balance if we assume that it's roughly one on one

does it and see a question

one last no

quick question

you already asked

could tell

i practical question

so they you have an application then you want to build the

speech recognition based interface

i looking what although these numbers can you advise me that

you know what you're doing is the way to go or should i go transcribed

audio and leave the speech recognition reduce then

so for the record language adaptation server that's exactly what we were wondering so we

are now doing our blaming on the components and try to advise the use of

what's better is a better to work on the pronunciation that it doesn't work on

the audio

and for the first

one hour the results are usually it's better to work on though

getting some audio data so one alright at least

and then getting the pronunciations right but it highly depends on how much data you

have a way you are so

difficult to answer but i think generating an automatic answer from the from the from

an arrow blaming to tell the use of what to do next how to invest

best his or time

i think that's

the right way to go

of the question for all three speakers this morning for oriented for mary i think

the big elephant in the room is the getting textual data for languages that are

only spoken and touched on this a little bit

and if you don't have web data and only have conversational data that's recorded making

the transcripts i mean in with the G B goes that we developed really by

five for a large as cost was always getting the text data and the translation

data

so acoustic modeling is by comparison relatively straightforward cheap to do

so my question is how do you interested i mean the hero here the room

is married to do this gigantic database of conversational data and the wondering how is

a community we can make faster progress set that problem

i've never heard of a question do a speaker who has spun

well maybe this is something i think about a good question we can come by

so let's bring that up again and today

okay let's thank time and what rate

Building Speech Recognition Systems with Low Resources

Limited Resources Day

Tanja Schultz (KIT)