host this better
okay
thank you very much thanks of having mean here giving you the chance to talk
about our work
so i would like to
talk about how can we build systems with very low resource as
and very much as lori also tried i would like to give a brief
the definition of what we think the low resource languages and it's surprising you mentioned
if you ask the linguists they would give a different answer but no
linguists keep the very same and so they say
in other resource languages these crawl them and say that is a language which likes
electronic resources for speech and language processing
it lacks presence on the web
it likes the writing system or stable topography and it may elect linguistic expertise
and
a language qualifies for being under-resourced if you just one of these criteria applies but
in many cases several of these the
definition supply
some people call that under-resourced some people called we low resource or more resources or
whatever but it's all we
men means the same thing
however what's really different is if you look into mine or two languages so mine
or two languages not necessary low resourced
and the low resource languages not necessarily a minority language so these have to be
discriminative
we were trying to put together some definitions so there will be
a special issue in speech communication which i was likely to put together with low
associated one are
in the alex carp of and it is about under-resourced languages and how to build
asr systems forward
and will be out in january and next year so twenty fourteen
so the ideal case
i consider to be or i think many people increase if you have plenty of
resources
so that means you have plenty of phone sets you know about the phone set
up to of a language you do have
or you have corresponding transcripts
you have the pronunciation rules you have experts which build for you the pronunciation dictionaries
and you have plenty of text data
and then you can build asr systems you can you build an L P or
inching systems and also tts systems for matter of fact
so that would be the ideal world
however the word is not ideal and i was asked to talk about a low
resource is what we do if we have very few on all data
so i would like to present a few of the solutions which we are currently
working on
and so the first one is addressing the fact if you do not have any
transcripts at all in a language which you would like to work on
the second is if you do not have any pronunciation dictionary so how can you
dig in for example into web for getting pronunciations
and the second thing is i would like to show some work on what we're
currently doing if you have no writing system at all for language
and then what can you do if you do not have any linguistic X expertise
so you don't have language experts at hand
and for all the solutions i would like to show today my general underlying assumption
is
that i would like to leverage off existing knowledge and data resources which i have
gathered from many languages already
so i sort of believe in if you have seen a lot of things in
one language and many languages
it should help you to build
the resources in the next language
so to me
the holy grail of rapid adaptation in languages and probably also in domains this
that you have plenty of data given in many different languages
that you sort of derive a global inventory of models
and then you have a tiny bit of data or maybe next to zero data
and then basically adapt to that particular length
and so when i started we might instance with the alex waibel he basically a
push you very much to if you want to do something with many languages they
would be good if you have many languages
so i started out collecting a large
corpus of data so
so i wanted to have something which is
uniform in a way that it basically a appears in many different languages but has
the same forward the same style and so ever and so ever it basically
i started doing this in ninety five and ever since we keep collecting data so
we now accumulated twenty one languages
it's basically you have for each language we have twenty hour spoken by one hundred
speakers
so altogether it is a small amount compared to what people use today
but it's basically having covered many languages
in over the years we have built recognises and all these languages
so that means we do have a pool of knowledge
in the acoustics and also in dictionaries in text
which i would like to show you how to leverage this for
situations what we do not have
any resources keep
so the underlying idea when i started in my thesis what's mean that was still
a gaussian mixture model time so that was pretty dnns
so the idea would be to share acoustic models across languages
in order to get some sort of and a global phone inventory and by the
time and that's something which also you are included
we use the ipa the international phonetic alphabet
to share data across languages
so whenever to phones have the same ipa representation
the joint to be trained as a language independent fashion
and
one observation we did what had was that if you do this and this is
a graph what we do this for twelve languages
so we share twelve languages based on i
one of the observations was
this doesn't really
i mean this just see
she just going straight up and that was only twelve languages
so our expectation of my expectation was that this would keep growing so that means
the more languages we see we will have issued diversity and we're not anywhere near
being in an asymptotic behaviour of the score
so and then as we hurt and yesterday in these wonderful talks from morgan and
also from france idea about the dnns
so what i'm really happy about this or why am excited about the dnns as
for the very first time
what we see is
not only that they work well for
strapping in new languages
but what i had never seen before it acoustic modeling whenever i would share across
languages
i would get worse performance in the training languages so i never managed to get
a better performance and those languages are used for training i only got slightly better
performance when they use the four languages
so now what the dnns we really i mean i hope so for the very
first time if you train the multi-lingual network that you can even outperform on the
languages which are in the training set
which makes me very happy and i think gives hopeful
using them for bootstrapping also two languages
so the first thing i would like to talk about is what
can be done if there are no transcription that at all so you have a
new language
well not to have that's a pronunciation dictionary and a good amount of speech
but you do not have any transcriptions
so and for this experiment we took the check which doesn't mean that i consider
check to be low resource languages the reason we to check is
two things the first of all the conference was in the czech republic so we
thought this is an appealing idea and the other was here is reason is
we had four other slavic languages which are related to check and we wanted to
figure out
whether we can whether we can gain from having related languages so we had two
sets of experiments we to for slavic languages
which is equation russian obligatory in polish
in the hope that the relation might help for bootstrapping check
and then me to another situation which is probably more realistic namely that you have
already resource language rich so asr systems and resource rich languages that is english french
german and spanish
and we basically said okay we use we have given speech we have given a
dictionary and we have given the language model in check
what we're letting our transcripts
and the idea is basically that we wanted to leverage of knowledge which we gained
already in many other languages
so the idea here goes as follows
we basically take the source languages he in this case we have a polish creation
will there in russian
we take the recognizer we haven't this language
and now we're taking our target checked dictionary and language model which we saw to
have given
and we are using the dictionary to express the check dictionary in terms of polish
falls
sounds to me
but it basically allows us to use the context dependent acoustic models
from our polish asr system
and we don't the very same thing for operation gary in russian so we basically
create for different dictionaries which are expressed in the phone set of the of the
source language on this
so it's out check words but it would be expressed in encoders phones and russian
phones and so on
and so now we can do what's we can run the polls recognizer on checked
data
it would give us check works we have the russian diction we have the russian
asr and the polish and so on so we have
basic need we get for hypothesis
of the czech words
and so then what we're doing is we're leveraging of confidence score which had been
proposed by two mask and then too much off
which was called a stop you so it basically how i mean what we do
with will basically calculate how often the same word appears
in the hypothesis from the different recognizer so this is the polish recognizer clearly
check hypothesis this is the creation the bulgarian the rational
and whenever or more languages agree on the same work we assume this work to
be correctly identified in this can be then used for training on
and so the whole the whole idea is we need to leverage off or to
vote to george to boot model different languages
and the reason why we few this works quite well is
if you take is the usual confidence score and you apply basically one language
we know that
the confidence score it's rather unreliable so this one gives you the confidence score threshold
over the word error rate
and basically if you have just one language and it really you start out with
the very poor performance
then even a very good confidence course not very reliable
but if you do this for several languages then what happens is that if you
had at least two or two languages agree on the same words of the same
islands in the hypothesis
then you basically have a much more reliable the behavior of the confidence score
and we learn that
using a threshold which is roughly one divided by and plus and offset if and
is the number of languages which are applied
that is a good value
so then the whole framework works as follows you basically create your saul's recognizers with
the matched dictionary
you run on your check audio data to produce a transcriptions
once you get the transcriptions you look with the
multilingual east appeal which ones that much parts of the transcriptions
and then you know basically adapt
the acoustic model of your source recognizer
towards
using the transcriptions which you just drive that means to produce recogniser choosing to check
the russian recogniser choosing to check
the creation one and so on so everything gets more basically chewed in to check
and then go through all everything the way you just keep the you adapted acoustic
model and then you we iterate
so we basically it generate the whole process
in which we always through the transcriptions away
and once we have are certain amount of data this way automatically transcribed
then we are using a multilingual phone inventory to bootstrap the check system
and we were quite happy with the with the performance we saw so what you
see here is basically
the first iteration to in the second third and then the bootstrap and the final
step after
reiterating one more time
and this is basically the performance as you get up with gary integration with the
pollution but the russian
acoustic models
and basically the take a message is if you use the related languages the best
system we can come up with has twenty two point
seven percent
performance
which is
very close to the baseline which we have if you're assuming that we have twenty
hours of transcriptions give
so this is the supervised case
and this is the fully unsupervised case
for the related K so for the case for the good case where you have
related languages
and this little one here is the one for the not related resource-rich languages
and the number the best number we got with this is twenty three point three
so we are still within range even so we're not as close as if we
do the related cases
but i believe that the
then the differences marginal
you may wonder
about two things one thing is
does that work for
for a conversational speech
and this is something we currently trying so this is more plan speech so
we are in the making of doing this for conversational speech the other question you
may have as Y all the results also work but don't please keep in mind
that we only have twenty three hours of training data and the setup and we
have a
rather shoot or perplexity because it's newspaper articles
if we so we of course wondered whether we would gain something by increasing the
number of source languages so we did the same thing this time for vietnamese and
we look at
what is the difference between using two source languages for source languages and six for
source languages
and what we discovered is that
over the iterations what you see here in the bars is the amount of data
we could extract
for trends for transcription
and what you see here and the curves is the quality of the transcriptions
so what we found is that the quality of transcriptions teams to go slightly up
if we using more languages
but we can derive more transcription data so the amount of data we can big
sticks extract its larger
and so what we found it is if you look at the performance here that
finally if we do with six languages we get sixteen point eight on vietnamese which
compares to fourteen point three if we build it in a supervised fashion
now the company at here is if we now have only and next word use
native speaker knows how to build systems
here she could do something it the range of eleven point eight sort the gap
you get all the gain you get by doing language specific things is enormous i
would say and significantly larger than the get you have between supervised and unsupervised
so if you have the choice having somebody one or something about the language and
for example to tuning and tweaking like
for vietnamese control modeling
pitch features using multi syllables and so on that gives a larger gains then the
discrepancy between transcribed and untranscribed
okay so the second thing i would like to discusses so this was about what
can what we do if we don't have transcriptions the second thing is what do
you do
if you have no pronunciation dictionaries
so that means the again the idea would be leveraging of what we already have
from other languages so we assuming we have seen a lot of scripts we have
seen a lot of pronunciation rules
and we would now like to generate the dictionary and language
so let me first talk briefly about writing system so first of all of course
we put wonder
how many
how many languages are written at all does it for example makes sense to look
into languages should not have a writing system
if you consider all mikolov which is from a site maintained by simon baker
so if you use listing about seven hundred fifty languages which have a script
he things that the number that room number of languages with the script is close
to two thousand
which basically means that the majority of the other six thousand languages do not have
a written form
so that so it seems to be a rather serious
issue
and then you look at the writing systems there are different types of writing system
so that the logo graphic ones which means the characters of this of the writing
system are based on semantic unix
units and the graphemes rather represent meaning then representing some sort typical example is a
chinese hands
and then there are scripts which are called phone a graphic
so that means here the graphemes are representing silence
rather than meaning
and then you have different forms of for the graphics and segmental ones for example
there are one grapheme roughly corresponds to one sound so that's very convenient for building
dictionaries in the syllabic case grapheme may represent an entire syllable
and then they are feature ones like korean where or one grapheme a represent
and articulatory feature
but if you look at the world most parts of the world agree which means
there are perhaps
the scripts which are roman all the transcripts
and then another big chunk here is the one which has it will take script
and they're both
photographic segmental scripts so this
seems to be very nice for a between pronunciation dictionary generation
in these other once we looked
first
so that means we assume that there is or correlation or there was a relationship
between graphemes into sound or
we call the grapheme-to-phoneme or letter-to-sound
but we also know that there was alone in there there's just there are languages
which are very close which have a very close grapheme-to-phoneme relationship
and then there are others which are like english more pathological C
sorry
don't worry offend anybody but
okay so of course that's all the first thing we wanted to look at this
so how many data do we need
to create a pronunciation dictionary
so what you see here
is on the on the on ten globalphone dictionaries you see here the phoneme error
rate on the y-axis and you see on the x-axis the number of phones
and what it means this we took
pronunciations we had pronunciations in our dictionary
we took them how we trained or G two P model based on sick which
were from a max be signing
and then we use the G two P to generate more entries in the dictionary
and then we compare the generated entries to the true entries in the dictionary and
that gave us the phoneme error rate
so if you have for example five K of five thousand
phonemes that means we took
well
roughly one thousand pronunciations word pronunciations and so you have five thousand phonemes which you
can learn on
you can learn you on
and what we learned from this graph is first of all
the G two P depends on the language that's not really surprising remote this already
so we have these punch down here this is the slavic languages so you guys
are really lucky
and also spanish is below there and then we have portuguese and this one instrument
and the worst case
in our scenario here is english
so there is a strong dependency on the language the other thing we don't is
if you have a close
G two P relationship then
five K phonemes might already be enough
to build a decent G two P model
and it separate out roughly after fifteen K examples
but we also see that a language like german for example needs
six times more examples then the language portuguese so depending on the relationship you have
to work harder to get examples in order to build a good G two P
model
now when you get those
examples from
and to select the hooded this work he basically
looked into dictionary which is a resource where you find when you find
pronunciations put in by volunteers in that language and you look at how many
language already covered dictionary and have a nice one tape pronunciations so that means one
thousand words
which roughly corresponds to five thousand of phonemes
and so we found thirty seven languages
which are covered that was at the time of twenty eleven when he did this
and he also follow if you look at this one this is the roles
dictionary entries over several years so this is twenty ten this is twenty eleven
so there seems to be a lot going on so that means the community seems
to
discover that and work in more words and so the whole would be that
probably covers
in the future
both more languages but also words
but now of course the question is
it is you use this so we build an interface to build an interface where
you can basically uploaded vocabulary file
in it would go away in search dictionary for whether it finds the entries
and then and we use all the entries it finds to train a G two
P model and then it would basically generate whatever you want to generate
and it would go into the pages and looks for the ipa-based pronunciations
there is a lot of cleaning have to do because the pronunciations to not only
not always basically
referred to what was saying so sometimes you look for what side and you get
an ipa for what can i might deals what's that
so you have to do a lot of checking to make sure that you're not
learning the wrong things
and that's one of the reason why the dictionary a G two P performs worse
than if you have a good dictionary where you start from so this one for
six languages compares
the be the G two P performance
a based on the things we learn from dictionary compared to the things we had
already on globalphone
and you can see here if you compare the same colours
that there are some significant distances
which
probably refers to one to the error unless you have what we had we found
and dictionary and the other thing is
there might be many contribute just to put in pronunciations into the dictionary in dictionary
and that means you might have inconsistencies which always an issue forty two P models
okay
so if you have nothing but you have related language you can also do a
very to force very simple approach and that's what we tried with ukrainian so we
want to ukrainian we had a student who was collecting data and we said okay
we already have russian gary german english
we have many languages why don't you try to
build the dictionary from what we already have
and it's a very simple approach so basically it has
for steps in the first step
you to the grapheme to grapheme mapping so you have russian
and you met the ukrainian graphemes to russian graphemes
once you've done this then there's a second step you can now basically applied to
russian grapheme-to-phoneme model
and then you have a russian phonemes and then you have to map them back
so to speak ukrainian mapping so you see there is a lot of mapping going
on
and then you do some post processing
so if she so she this for the different source languages then let's look at
russian for example so he will first
the ukrainian graphemes to rough and graphemes which
may tim to forty three rule so we had to come up with forty three
ones
and then he would run the G two P and after that you would
create fifty six rules to convert back the russian phonemes to ukrainian phonemes
and then to do some post processing which required another fifty seven rules so altogether
we had to create about one hundred sixty ones which can be done it in
today
and what comes out it's and
is the dictionary well when we plug this into the highways are it gives the
twenty one point six percent error rate
if we use the simplest approach we just use the graphemes for asr we get
something like twenty three point eight percent
if we ask the student to sit down and build a rules in order to
create the dictionary by hand so you basically at hand crafted rules which
piled up to about eight are rules then he gets twenty two point four and
in the process we wondered why the russian based one is
better than
be handcrafted and found that we had some issues with our rule so we could
we fix the rules
and then basically ended up with the same performance
so that means if you have
if you initial if you don't have much time and you we need to even
to reduce the number of rules you want an expert to write then probably be
screwed
method is an option
but that of course depends on that you have a reasonable related language because if
you would with other non related languages you get significantly worse results
so if we do would with english or german then it doesn't pan out
okay so i have fifteen minutes
so the other thing you can do if you don't have pronunciations is you can
ask the crowd and so we
created a small toolkit which basically gives you keyboard and on the keyboard you have
i ipa sounds which is you have to get used to it but you can
pick basic you press i mean you can listen to
how it sounds
and then you get a word and then you are asked to produce the pronunciation
of this word and then once you're done you can go to the next work
and we use this we basically feed it is into a mechanical turk
and basically we headed run for twelve days to find out whether this would give
good results or not
so the first thing we learned is so we rented for twelve days and we
had
calculated look at how long people spend
building pronunciations
the average time most fifty three seconds
most of the work so the majority based you had to reject
and we had
out of these nineteen hundred pronunciations we got a more than fifty percent where not
really useful because what we found out as people we would test the their the
task they would basically
spent a second giving one a phoneme
and so people where we were very fast but very sloppy and we couldn't really
get a good we didn't find a good way to
basically have incentives to provide very good answers so doing this
i mean we did this in parallel with many different
many people working on the same words in order to find out whether the whether
we get good pronunciations but it didn't really
penn out nicely
so the second thing then we did this we reached out of our friends and
volunteers so we work harder to improve the interface we had a nice welcome page
of tutorial
we gave the people quality feedback so that they basically have an incentive to work
harder and we also had basically we put it out as again so people would
compete with each other
in a in a friendly manner
so one thing we found listed in the in the beginning they would spend a
lot of time to get the what's right so they would spend six minutes on
one what which i found quite high
and the and then finally they would be down to one and have a minutes
roughly put word
and you can see here so the system amount of users so it's really heart
to get to keep the uses one tractor but certainly the median time for what
we do significantly
and also the arrows which we found procession would go would go down
but it's harder to keep
the people in the so getting really messes amount of pronunciations out of this with
crowd sourcing before rather challenging
okay so now i would like to talk and wu approach which we work on
if there is no writing system at all
so that means you would like to build a system for languages which have
a written form
so this is
sort of how we think of it you have nothing done
we assume that thing as not written even if it's true and you can you
can ask what to speak for you like a sentence in training and so you
can ask simply say i'm sick and you would say she dropped i don't i
don't speak klingon but may maybe people do
and you can ask him i've help the and you would also produce a phrase
and so that what people want to subconsciously is we would try to identify sounds
and we would perceive this as a sequence of
and then we would probably if there let's see if you says she drop in
she P then we would probably assume that she might be something like a because
this is what was
the repeating phrase in what we asked him to do
and then we could basically derive that are means G and what seems to be
the word for sick and
she seems to be the word for healthy
and so the idea would be to
create this process that we're doing in our minds that we putting this on a
on the machine
and they had in already substantial work done the first one probably the that's what
i found was performed from the wall basis he
so he was using a monolingual unsupervised segmentation in order to get the phone sequences
into words
and then there was the work from our is the work from the past and
speaker and alex waibel they were using a cross lingual word to word alignment based
on keys plus
and then in the combination of work they basically combine the monolingual in the keys
plus approach
in order to find the phoneme to a word alignments
and so phoenix are back in our lab she started using a cross lingual words
to phoneme alignment it for that you improved we extended the
it is a approach but let me first roughly explain how it work
so you basically have to sentences you have a source language in this case it's
not a german
it says a different each digit date
and then you have english phone sequence which you happen to have derived from what
the person was saying so you have language words and things
for you
i mean ideally so and you have these little word boundaries already in there but
of course if you create a phone sequence of a with the phone recognizer fingers
all you have a lot of errors you might have a lot of errors here
and of course you don't have any word boundaries
and the question now is how can you
get be alignment between the words and the corresponding phone
and one of the issues okay so and what we next it is he extended
keys plus model or if he's using the ibm model three
and she
basically
change it such that it can be applied for a different word length so he
was a putting in a word length probability
in order to counterbalance the fact that if you to work to phoneme alignments that
you have way more phonemes
to be aligned to one to one word
so the model he came up with basically skips the lexical translation it uses the
fraternity uses the distortion but then puts in the word length
and basically then the words are sort of placeholders and then be via the lexical
translation would be translated to the phone sequence
and what you found so the first result it did on be take english spanish
so what he found is that he compared the keys that what to phoneme alignment
to its model
three P alignment
and you can see that consistently
independent of the phoneme error rate of the original phone sequence
he could outperform other keys plus possible to phoneme alignment
and she could also in terms of its course you could also outperform
the monolingual one
so this is on the pair from english to spanish and this is from the
pair from spanish to english
and so and f-score here of eighty percent for example means that he could get
the segmentation the word boundaries in the phone sequences right to
about ninety percent accuracy which is quite good
so it's something that would have solved
i mean that when we're not anywhere near that but is something we do have
a correspondence between the phones and in the in the
works
we could now think of the next
the next step would be to we take the phone sequences and generate a pronunciation
dictionary
and now we run into this problem with low already mentioned in her talk
that is some words might have different meanings
in so for example the word a german can mean both speech and language
so what will happen is you will have a lot of different pronunciations and some
vowel sound like speech even so they might be you know
everyone S
but others are significantly different in these are the language words
so the first thing you need to do what we did is we clustered
all the
occurring phone sequences in the different
what meanings
and we use the D V scan a cluster for that so if you if
when we're lucky we are ending up with the cluster which represents everything widget goes
to language
and the other cluster represents everything which goes to speech
and then if you have several words which sound like speech then what prefix was
using was the n-best lectures to from and real spoken order to
basic generate one a result for speech
and then he would put those two results into the dictionary so
since we don't have
a really written form we cannot give it what checks all you can either we
in our case we just did numbers
so there is one word which gets the number one and the has the pronunciation
language
and there's another word which gets the i teach when it has the pronunciation speech
okay and so we try this out not would be tech but we moved to
a pronunciation extraction on private data because felix found very nice page
bible gateway which provides
translations of bible of the bible in many different languages and he did not fifteen
translations intend languages and they even provide all you in english portuguese and some other
languages
in order to play around with
and so what needed to sorry she basically verse aligned the bible text so he
extracted more than thirty
rows alignments for training
his approach
and so what you see here is basically the distribution of the absolute phoneme errors
in the extracted pronunciations
so basically he compared extractor come a pronunciations to the real word pronunciations
and so for example this thing here means that
if you use T V spanish bible
create english pronunciations
there are about three thousand nine hundred words
out of the fourteen hundred which had forty thousand which are extracted
in these three thousand nine hundred words have no phoneme error so there really correspond
exactly to what
had been found in the dictionary
so that in total the dictionary had something like fourteen hundred words so this is
the use easily so this is the english one of the english the bible translations
and this one gives you
the distributions
of the phone arrows across all the different languages he started from to extract the
pronunciations
one thing so what we did in this case it's we assumed
it in a first step we assume that the target phone sequence
is the canonical pronunciation so that means this has not been done with the phone
recognizer on audio but it has been done
under the assumption that the phone
sequence escort
only what was on almost the word boundary and so the next step is now
to basically this with the real phone recognizer in see how well it will
cross or
okay so and then the next step but we have done that yet
is that we now have a transcribed audio which is basically transcribed and word I
Ds
we could create a dictionary based on what I Ds or whatever you want to
take here and then you have a language model and then we could put all
these things together to build a speech recognition system based on data which has not
been
but they haven't so this last that we haven't done yet
so i four minutes left three minutes left okay so the third one start run
is really quickly this is here the question is what
could be done if there are not many linguistic experts and i think everybody has
seen this problem here she wants to build this
system in a language and there's simply no but you their coach who could help
lori was just talking about finding a korean for transcribing right
and so
so what we did
the war started several years ago in that was and it's F project actually that
was funded by mary harper
and it did together with alan black the idea was
how can you bridge the gap between technology experts and the language experts and so
we build a lot of web pages tools
and the web based tools allow somebody who doesn't know anything or not very much
about speech recognition or G S
it allows this person to
work
on the language for him or herself so you basically have hand rails
which tells you what to do in as in this next step
in order to build a system and the person can go step by step through
would to build the system
and we had done
quite substantial work in the past so the system is
every here it is used in a semi now which is a multilingual seminar also
across seen U N K I T
and we always adopt the languages of the students who take the score so what
the use we have really done a lot of systems asr and tts systems in
the respective languages
and whenever we learn something we plug this into
introduce the true
and we using this we for education but we also using this for
purposes to do up resources and reach out
two people if we don't have time at hand
for example we had one
one
one time wanted to build something in company and there was only a single speaker
of company it seeing you so he was sending basically is what the web page
to his people back in india and got a lot of speech and work out
of this
by
being able to do virtually
okay so i think i
i should close here so i wanted to basically show you some
solutions we are working on
if we have resources and i try to give one example if we have no
transcript if we have no
pronunciations or if we need have no
but system at all
and also wanted to stress that we are keep working on our record language adaptation
toolkit
in order to leverage of fieldwork and also be able to out to do outreach
to the community
in order to then finally bridge the gap between those people who know the technology
and those people know the language
thank you very much
for a few questions
i just had the clarification question about your first
study where you had no transcription
so how do you obtain the pronunciation dictionary in the other models
are you get
i could let me just go back so the idea is that you have
a dictionary in it so target most check so the idea would be you have
a check dictionary given in the czech phone set
but the source recognisable right to run the data is in let's say polish so
what you need to do is you need to take the produce phones
and replace the check phones in the dictionary for the problem phones
so you basically map them in the czech dictionary
so you get the polish variant of a check dictionary
we do i forgot that i'm sorry so we are using through a global phone
everything is expressed and ipa so doing a mapping is rather straightforward
like you can do this with fifty mapping rules i would say not nothing too
fancy because
they're the context
can be used this way but the in the after the iterations the
acoustic models would be map adapted anyways
so
for the recognizer for languages be so and it's operating system actually the things they
represented the last part
what would be a recommendation to present the output because if there is no writing
system is our to target all right you cannot probably due
the users to just be I Ds numerical id so diverse every what we do
so i think
if you haven't if you have a system or language which is not written it
all the tts might be a good approach that you basically
speak it would be to the people so that they can listen to it
so that would be run option and then all you have to do is if
you get
number one you basically to teach us on the phones
and then the other thing of course the a simple straightforward one is apply any
sort of
phoneme to grapheme mapping to get it back so
but that's
all that we could come up with actually i don't like the number idea either
so we basically switch now to something with underscore so that you can at least
we did
you could potentially and translation
or understanding intentionally to
as ways to see
sure i mean actually what for what's out there is you have the word id
in the target word to source word
you shouldn't what this into translation system and then
translated into
language that's
twenty correct
i'm more interested in the asr it's always
just one year when you were doing the bible work and finding
see the words and read or differences were you actually can see how many of
the words that are january you're not real words
"'cause" you do in the list right and i didn't really
get to the so
so one thing is for the english case
you can see that the number of
generated words is roughly the number of
words so it's not overgenerating in its not under generating
but you can also see swedish for example son to generate into gary and check
is what generate and what we found is that
the generation is also related to how many words the source language has source we
just has very few and very check as many so they'll what generating an unknown
generating
and of course name of the game would be to sort of cheap this in
balance if we assume that it's roughly one on one
does it and see a question
one last no
quick question
you already asked
could tell
i practical question
so they you have an application then you want to build the
speech recognition based interface
i looking what although these numbers can you advise me that
you know what you're doing is the way to go or should i go transcribed
audio and leave the speech recognition reduce then
so for the record language adaptation server that's exactly what we were wondering so we
are now doing our blaming on the components and try to advise the use of
what's better is a better to work on the pronunciation that it doesn't work on
the audio
and for the first
one hour the results are usually it's better to work on though
getting some audio data so one alright at least
and then getting the pronunciations right but it highly depends on how much data you
have a way you are so
difficult to answer but i think generating an automatic answer from the from the from
an arrow blaming to tell the use of what to do next how to invest
best his or time
i think that's
the right way to go
of the question for all three speakers this morning for oriented for mary i think
the big elephant in the room is the getting textual data for languages that are
only spoken and touched on this a little bit
and if you don't have web data and only have conversational data that's recorded making
the transcripts i mean in with the G B goes that we developed really by
five for a large as cost was always getting the text data and the translation
data
so acoustic modeling is by comparison relatively straightforward cheap to do
so my question is how do you interested i mean the hero here the room
is married to do this gigantic database of conversational data and the wondering how is
a community we can make faster progress set that problem
i've never heard of a question do a speaker who has spun
i
well maybe this is something i think about a good question we can come by
so let's bring that up again and today
okay let's thank time and what rate