0:00:21host this better
0:00:23thank you very much thanks of having mean here giving you the chance to talk
0:00:27about our work
0:00:29so i would like to
0:00:30talk about how can we build systems with very low resource as
0:00:34and very much as lori also tried i would like to give a brief
0:00:38the definition of what we think the low resource languages and it's surprising you mentioned
0:00:43if you ask the linguists they would give a different answer but no
0:00:46linguists keep the very same and so they say
0:00:49in other resource languages these crawl them and say that is a language which likes
0:00:55electronic resources for speech and language processing
0:00:58it lacks presence on the web
0:01:00it likes the writing system or stable topography and it may elect linguistic expertise
0:01:07a language qualifies for being under-resourced if you just one of these criteria applies but
0:01:12in many cases several of these the
0:01:16definition supply
0:01:19some people call that under-resourced some people called we low resource or more resources or
0:01:25whatever but it's all we
0:01:26men means the same thing
0:01:28however what's really different is if you look into mine or two languages so mine
0:01:33or two languages not necessary low resourced
0:01:36and the low resource languages not necessarily a minority language so these have to be
0:01:42we were trying to put together some definitions so there will be
0:01:46a special issue in speech communication which i was likely to put together with low
0:01:51associated one are
0:01:53in the alex carp of and it is about under-resourced languages and how to build
0:01:58asr systems forward
0:02:00and will be out in january and next year so twenty fourteen
0:02:05so the ideal case
0:02:07i consider to be or i think many people increase if you have plenty of
0:02:12so that means you have plenty of phone sets you know about the phone set
0:02:16up to of a language you do have
0:02:17or you have corresponding transcripts
0:02:20you have the pronunciation rules you have experts which build for you the pronunciation dictionaries
0:02:26and you have plenty of text data
0:02:28and then you can build asr systems you can you build an L P or
0:02:32inching systems and also tts systems for matter of fact
0:02:35so that would be the ideal world
0:02:39however the word is not ideal and i was asked to talk about a low
0:02:43resource is what we do if we have very few on all data
0:02:47so i would like to present a few of the solutions which we are currently
0:02:51working on
0:02:52and so the first one is addressing the fact if you do not have any
0:02:55transcripts at all in a language which you would like to work on
0:02:59the second is if you do not have any pronunciation dictionary so how can you
0:03:03dig in for example into web for getting pronunciations
0:03:08and the second thing is i would like to show some work on what we're
0:03:12currently doing if you have no writing system at all for language
0:03:17and then what can you do if you do not have any linguistic X expertise
0:03:21so you don't have language experts at hand
0:03:24and for all the solutions i would like to show today my general underlying assumption
0:03:30that i would like to leverage off existing knowledge and data resources which i have
0:03:35gathered from many languages already
0:03:37so i sort of believe in if you have seen a lot of things in
0:03:40one language and many languages
0:03:42it should help you to build
0:03:44the resources in the next language
0:03:47so to me
0:03:49the holy grail of rapid adaptation in languages and probably also in domains this
0:03:54that you have plenty of data given in many different languages
0:03:58that you sort of derive a global inventory of models
0:04:04and then you have a tiny bit of data or maybe next to zero data
0:04:09and then basically adapt to that particular length
0:04:14and so when i started we might instance with the alex waibel he basically a
0:04:19push you very much to if you want to do something with many languages they
0:04:23would be good if you have many languages
0:04:25so i started out collecting a large
0:04:28corpus of data so
0:04:31so i wanted to have something which is
0:04:33uniform in a way that it basically a appears in many different languages but has
0:04:38the same forward the same style and so ever and so ever it basically
0:04:43i started doing this in ninety five and ever since we keep collecting data so
0:04:47we now accumulated twenty one languages
0:04:49it's basically you have for each language we have twenty hour spoken by one hundred
0:04:55so altogether it is a small amount compared to what people use today
0:04:59but it's basically having covered many languages
0:05:03in over the years we have built recognises and all these languages
0:05:06so that means we do have a pool of knowledge
0:05:09in the acoustics and also in dictionaries in text
0:05:13which i would like to show you how to leverage this for
0:05:17situations what we do not have
0:05:19any resources keep
0:05:21so the underlying idea when i started in my thesis what's mean that was still
0:05:26a gaussian mixture model time so that was pretty dnns
0:05:30so the idea would be to share acoustic models across languages
0:05:34in order to get some sort of and a global phone inventory and by the
0:05:39time and that's something which also you are included
0:05:42we use the ipa the international phonetic alphabet
0:05:45to share data across languages
0:05:48so whenever to phones have the same ipa representation
0:05:52the joint to be trained as a language independent fashion
0:05:57one observation we did what had was that if you do this and this is
0:06:01a graph what we do this for twelve languages
0:06:03so we share twelve languages based on i
0:06:06one of the observations was
0:06:08this doesn't really
0:06:10i mean this just see
0:06:12she just going straight up and that was only twelve languages
0:06:15so our expectation of my expectation was that this would keep growing so that means
0:06:19the more languages we see we will have issued diversity and we're not anywhere near
0:06:25being in an asymptotic behaviour of the score
0:06:29so and then as we hurt and yesterday in these wonderful talks from morgan and
0:06:35also from france idea about the dnns
0:06:38so what i'm really happy about this or why am excited about the dnns as
0:06:43for the very first time
0:06:44what we see is
0:06:46not only that they work well for
0:06:49strapping in new languages
0:06:51but what i had never seen before it acoustic modeling whenever i would share across
0:06:56i would get worse performance in the training languages so i never managed to get
0:07:01a better performance and those languages are used for training i only got slightly better
0:07:05performance when they use the four languages
0:07:07so now what the dnns we really i mean i hope so for the very
0:07:11first time if you train the multi-lingual network that you can even outperform on the
0:07:16languages which are in the training set
0:07:18which makes me very happy and i think gives hopeful
0:07:21using them for bootstrapping also two languages
0:07:25so the first thing i would like to talk about is what
0:07:29can be done if there are no transcription that at all so you have a
0:07:32new language
0:07:33well not to have that's a pronunciation dictionary and a good amount of speech
0:07:39but you do not have any transcriptions
0:07:42so and for this experiment we took the check which doesn't mean that i consider
0:07:47check to be low resource languages the reason we to check is
0:07:52two things the first of all the conference was in the czech republic so we
0:07:56thought this is an appealing idea and the other was here is reason is
0:07:59we had four other slavic languages which are related to check and we wanted to
0:08:04figure out
0:08:05whether we can whether we can gain from having related languages so we had two
0:08:09sets of experiments we to for slavic languages
0:08:13which is equation russian obligatory in polish
0:08:15in the hope that the relation might help for bootstrapping check
0:08:19and then me to another situation which is probably more realistic namely that you have
0:08:25already resource language rich so asr systems and resource rich languages that is english french
0:08:32german and spanish
0:08:34and we basically said okay we use we have given speech we have given a
0:08:38dictionary and we have given the language model in check
0:08:41what we're letting our transcripts
0:08:43and the idea is basically that we wanted to leverage of knowledge which we gained
0:08:48already in many other languages
0:08:50so the idea here goes as follows
0:08:52we basically take the source languages he in this case we have a polish creation
0:08:56will there in russian
0:08:58we take the recognizer we haven't this language
0:09:01and now we're taking our target checked dictionary and language model which we saw to
0:09:05have given
0:09:07and we are using the dictionary to express the check dictionary in terms of polish
0:09:13sounds to me
0:09:14but it basically allows us to use the context dependent acoustic models
0:09:18from our polish asr system
0:09:21and we don't the very same thing for operation gary in russian so we basically
0:09:25create for different dictionaries which are expressed in the phone set of the of the
0:09:30source language on this
0:09:32so it's out check words but it would be expressed in encoders phones and russian
0:09:37phones and so on
0:09:38and so now we can do what's we can run the polls recognizer on checked
0:09:43it would give us check works we have the russian diction we have the russian
0:09:47asr and the polish and so on so we have
0:09:50basic need we get for hypothesis
0:09:52of the czech words
0:09:54and so then what we're doing is we're leveraging of confidence score which had been
0:09:58proposed by two mask and then too much off
0:10:01which was called a stop you so it basically how i mean what we do
0:10:06with will basically calculate how often the same word appears
0:10:10in the hypothesis from the different recognizer so this is the polish recognizer clearly
0:10:15check hypothesis this is the creation the bulgarian the rational
0:10:19and whenever or more languages agree on the same work we assume this work to
0:10:24be correctly identified in this can be then used for training on
0:10:29and so the whole the whole idea is we need to leverage off or to
0:10:33vote to george to boot model different languages
0:10:37and the reason why we few this works quite well is
0:10:41if you take is the usual confidence score and you apply basically one language
0:10:46we know that
0:10:47the confidence score it's rather unreliable so this one gives you the confidence score threshold
0:10:53over the word error rate
0:10:55and basically if you have just one language and it really you start out with
0:11:00the very poor performance
0:11:01then even a very good confidence course not very reliable
0:11:05but if you do this for several languages then what happens is that if you
0:11:10had at least two or two languages agree on the same words of the same
0:11:14islands in the hypothesis
0:11:16then you basically have a much more reliable the behavior of the confidence score
0:11:22and we learn that
0:11:23using a threshold which is roughly one divided by and plus and offset if and
0:11:28is the number of languages which are applied
0:11:30that is a good value
0:11:32so then the whole framework works as follows you basically create your saul's recognizers with
0:11:39the matched dictionary
0:11:41you run on your check audio data to produce a transcriptions
0:11:46once you get the transcriptions you look with the
0:11:49multilingual east appeal which ones that much parts of the transcriptions
0:11:54and then you know basically adapt
0:11:56the acoustic model of your source recognizer
0:12:00using the transcriptions which you just drive that means to produce recogniser choosing to check
0:12:06the russian recogniser choosing to check
0:12:08the creation one and so on so everything gets more basically chewed in to check
0:12:13and then go through all everything the way you just keep the you adapted acoustic
0:12:18model and then you we iterate
0:12:19so we basically it generate the whole process
0:12:22in which we always through the transcriptions away
0:12:25and once we have are certain amount of data this way automatically transcribed
0:12:30then we are using a multilingual phone inventory to bootstrap the check system
0:12:36and we were quite happy with the with the performance we saw so what you
0:12:40see here is basically
0:12:42the first iteration to in the second third and then the bootstrap and the final
0:12:46step after
0:12:47reiterating one more time
0:12:49and this is basically the performance as you get up with gary integration with the
0:12:54pollution but the russian
0:12:57acoustic models
0:12:58and basically the take a message is if you use the related languages the best
0:13:02system we can come up with has twenty two point
0:13:05seven percent
0:13:07which is
0:13:08very close to the baseline which we have if you're assuming that we have twenty
0:13:13hours of transcriptions give
0:13:15so this is the supervised case
0:13:17and this is the fully unsupervised case
0:13:19for the related K so for the case for the good case where you have
0:13:23related languages
0:13:24and this little one here is the one for the not related resource-rich languages
0:13:29and the number the best number we got with this is twenty three point three
0:13:33so we are still within range even so we're not as close as if we
0:13:37do the related cases
0:13:38but i believe that the
0:13:40then the differences marginal
0:13:42you may wonder
0:13:43about two things one thing is
0:13:45does that work for
0:13:48for a conversational speech
0:13:50and this is something we currently trying so this is more plan speech so
0:13:54we are in the making of doing this for conversational speech the other question you
0:13:59may have as Y all the results also work but don't please keep in mind
0:14:03that we only have twenty three hours of training data and the setup and we
0:14:06have a
0:14:07rather shoot or perplexity because it's newspaper articles
0:14:13if we so we of course wondered whether we would gain something by increasing the
0:14:18number of source languages so we did the same thing this time for vietnamese and
0:14:22we look at
0:14:23what is the difference between using two source languages for source languages and six for
0:14:28source languages
0:14:29and what we discovered is that
0:14:31over the iterations what you see here in the bars is the amount of data
0:14:35we could extract
0:14:36for trends for transcription
0:14:39and what you see here and the curves is the quality of the transcriptions
0:14:43so what we found is that the quality of transcriptions teams to go slightly up
0:14:48if we using more languages
0:14:50but we can derive more transcription data so the amount of data we can big
0:14:55sticks extract its larger
0:14:58and so what we found it is if you look at the performance here that
0:15:03finally if we do with six languages we get sixteen point eight on vietnamese which
0:15:08compares to fourteen point three if we build it in a supervised fashion
0:15:13now the company at here is if we now have only and next word use
0:15:18native speaker knows how to build systems
0:15:21here she could do something it the range of eleven point eight sort the gap
0:15:26you get all the gain you get by doing language specific things is enormous i
0:15:31would say and significantly larger than the get you have between supervised and unsupervised
0:15:37so if you have the choice having somebody one or something about the language and
0:15:41for example to tuning and tweaking like
0:15:43for vietnamese control modeling
0:15:46pitch features using multi syllables and so on that gives a larger gains then the
0:15:51discrepancy between transcribed and untranscribed
0:15:57okay so the second thing i would like to discusses so this was about what
0:16:02can what we do if we don't have transcriptions the second thing is what do
0:16:06you do
0:16:06if you have no pronunciation dictionaries
0:16:09so that means the again the idea would be leveraging of what we already have
0:16:14from other languages so we assuming we have seen a lot of scripts we have
0:16:18seen a lot of pronunciation rules
0:16:20and we would now like to generate the dictionary and language
0:16:24so let me first talk briefly about writing system so first of all of course
0:16:29we put wonder
0:16:31how many
0:16:32how many languages are written at all does it for example makes sense to look
0:16:36into languages should not have a writing system
0:16:38if you consider all mikolov which is from a site maintained by simon baker
0:16:43so if you use listing about seven hundred fifty languages which have a script
0:16:49he things that the number that room number of languages with the script is close
0:16:53to two thousand
0:16:54which basically means that the majority of the other six thousand languages do not have
0:16:59a written form
0:17:01so that so it seems to be a rather serious
0:17:05and then you look at the writing systems there are different types of writing system
0:17:09so that the logo graphic ones which means the characters of this of the writing
0:17:15system are based on semantic unix
0:17:18units and the graphemes rather represent meaning then representing some sort typical example is a
0:17:23chinese hands
0:17:25and then there are scripts which are called phone a graphic
0:17:28so that means here the graphemes are representing silence
0:17:33rather than meaning
0:17:34and then you have different forms of for the graphics and segmental ones for example
0:17:39there are one grapheme roughly corresponds to one sound so that's very convenient for building
0:17:44dictionaries in the syllabic case grapheme may represent an entire syllable
0:17:50and then they are feature ones like korean where or one grapheme a represent
0:17:55and articulatory feature
0:17:58but if you look at the world most parts of the world agree which means
0:18:03there are perhaps
0:18:04the scripts which are roman all the transcripts
0:18:07and then another big chunk here is the one which has it will take script
0:18:12and they're both
0:18:14photographic segmental scripts so this
0:18:18seems to be very nice for a between pronunciation dictionary generation
0:18:22in these other once we looked
0:18:25so that means we assume that there is or correlation or there was a relationship
0:18:30between graphemes into sound or
0:18:32we call the grapheme-to-phoneme or letter-to-sound
0:18:35but we also know that there was alone in there there's just there are languages
0:18:39which are very close which have a very close grapheme-to-phoneme relationship
0:18:43and then there are others which are like english more pathological C
0:18:49don't worry offend anybody but
0:18:51okay so of course that's all the first thing we wanted to look at this
0:18:55so how many data do we need
0:18:57to create a pronunciation dictionary
0:18:59so what you see here
0:19:00is on the on the on ten globalphone dictionaries you see here the phoneme error
0:19:05rate on the y-axis and you see on the x-axis the number of phones
0:19:09and what it means this we took
0:19:13pronunciations we had pronunciations in our dictionary
0:19:15we took them how we trained or G two P model based on sick which
0:19:20were from a max be signing
0:19:22and then we use the G two P to generate more entries in the dictionary
0:19:27and then we compare the generated entries to the true entries in the dictionary and
0:19:32that gave us the phoneme error rate
0:19:34so if you have for example five K of five thousand
0:19:40phonemes that means we took
0:19:42roughly one thousand pronunciations word pronunciations and so you have five thousand phonemes which you
0:19:48can learn on
0:19:49you can learn you on
0:19:51and what we learned from this graph is first of all
0:19:53the G two P depends on the language that's not really surprising remote this already
0:19:58so we have these punch down here this is the slavic languages so you guys
0:20:02are really lucky
0:20:04and also spanish is below there and then we have portuguese and this one instrument
0:20:08and the worst case
0:20:09in our scenario here is english
0:20:12so there is a strong dependency on the language the other thing we don't is
0:20:17if you have a close
0:20:18G two P relationship then
0:20:21five K phonemes might already be enough
0:20:23to build a decent G two P model
0:20:27and it separate out roughly after fifteen K examples
0:20:31but we also see that a language like german for example needs
0:20:35six times more examples then the language portuguese so depending on the relationship you have
0:20:41to work harder to get examples in order to build a good G two P
0:20:46now when you get those
0:20:48examples from
0:20:50and to select the hooded this work he basically
0:20:54looked into dictionary which is a resource where you find when you find
0:21:00pronunciations put in by volunteers in that language and you look at how many
0:21:06language already covered dictionary and have a nice one tape pronunciations so that means one
0:21:12thousand words
0:21:13which roughly corresponds to five thousand of phonemes
0:21:17and so we found thirty seven languages
0:21:19which are covered that was at the time of twenty eleven when he did this
0:21:24and he also follow if you look at this one this is the roles
0:21:28dictionary entries over several years so this is twenty ten this is twenty eleven
0:21:32so there seems to be a lot going on so that means the community seems
0:21:39discover that and work in more words and so the whole would be that
0:21:44probably covers
0:21:45in the future
0:21:46both more languages but also words
0:21:49but now of course the question is
0:21:51it is you use this so we build an interface to build an interface where
0:21:56you can basically uploaded vocabulary file
0:22:00in it would go away in search dictionary for whether it finds the entries
0:22:04and then and we use all the entries it finds to train a G two
0:22:08P model and then it would basically generate whatever you want to generate
0:22:12and it would go into the pages and looks for the ipa-based pronunciations
0:22:17there is a lot of cleaning have to do because the pronunciations to not only
0:22:21not always basically
0:22:24referred to what was saying so sometimes you look for what side and you get
0:22:29an ipa for what can i might deals what's that
0:22:31so you have to do a lot of checking to make sure that you're not
0:22:33learning the wrong things
0:22:35and that's one of the reason why the dictionary a G two P performs worse
0:22:40than if you have a good dictionary where you start from so this one for
0:22:43six languages compares
0:22:45the be the G two P performance
0:22:48a based on the things we learn from dictionary compared to the things we had
0:22:53already on globalphone
0:22:54and you can see here if you compare the same colours
0:22:57that there are some significant distances
0:23:01probably refers to one to the error unless you have what we had we found
0:23:07and dictionary and the other thing is
0:23:09there might be many contribute just to put in pronunciations into the dictionary in dictionary
0:23:16and that means you might have inconsistencies which always an issue forty two P models
0:23:23so if you have nothing but you have related language you can also do a
0:23:26very to force very simple approach and that's what we tried with ukrainian so we
0:23:32want to ukrainian we had a student who was collecting data and we said okay
0:23:37we already have russian gary german english
0:23:40we have many languages why don't you try to
0:23:43build the dictionary from what we already have
0:23:46and it's a very simple approach so basically it has
0:23:49for steps in the first step
0:23:51you to the grapheme to grapheme mapping so you have russian
0:23:56and you met the ukrainian graphemes to russian graphemes
0:24:00once you've done this then there's a second step you can now basically applied to
0:24:04russian grapheme-to-phoneme model
0:24:07and then you have a russian phonemes and then you have to map them back
0:24:11so to speak ukrainian mapping so you see there is a lot of mapping going
0:24:16and then you do some post processing
0:24:18so if she so she this for the different source languages then let's look at
0:24:23russian for example so he will first
0:24:26the ukrainian graphemes to rough and graphemes which
0:24:30may tim to forty three rule so we had to come up with forty three
0:24:35and then he would run the G two P and after that you would
0:24:38create fifty six rules to convert back the russian phonemes to ukrainian phonemes
0:24:44and then to do some post processing which required another fifty seven rules so altogether
0:24:49we had to create about one hundred sixty ones which can be done it in
0:24:57and what comes out it's and
0:24:59is the dictionary well when we plug this into the highways are it gives the
0:25:04twenty one point six percent error rate
0:25:07if we use the simplest approach we just use the graphemes for asr we get
0:25:13something like twenty three point eight percent
0:25:15if we ask the student to sit down and build a rules in order to
0:25:20create the dictionary by hand so you basically at hand crafted rules which
0:25:24piled up to about eight are rules then he gets twenty two point four and
0:25:28in the process we wondered why the russian based one is
0:25:32better than
0:25:33be handcrafted and found that we had some issues with our rule so we could
0:25:37we fix the rules
0:25:39and then basically ended up with the same performance
0:25:43so that means if you have
0:25:45if you initial if you don't have much time and you we need to even
0:25:49to reduce the number of rules you want an expert to write then probably be
0:25:54method is an option
0:25:56but that of course depends on that you have a reasonable related language because if
0:26:01you would with other non related languages you get significantly worse results
0:26:05so if we do would with english or german then it doesn't pan out
0:26:11okay so i have fifteen minutes
0:26:14so the other thing you can do if you don't have pronunciations is you can
0:26:17ask the crowd and so we
0:26:19created a small toolkit which basically gives you keyboard and on the keyboard you have
0:26:25i ipa sounds which is you have to get used to it but you can
0:26:29pick basic you press i mean you can listen to
0:26:31how it sounds
0:26:32and then you get a word and then you are asked to produce the pronunciation
0:26:37of this word and then once you're done you can go to the next work
0:26:41and we use this we basically feed it is into a mechanical turk
0:26:47and basically we headed run for twelve days to find out whether this would give
0:26:51good results or not
0:26:53so the first thing we learned is so we rented for twelve days and we
0:26:59calculated look at how long people spend
0:27:01building pronunciations
0:27:03the average time most fifty three seconds
0:27:06most of the work so the majority based you had to reject
0:27:10and we had
0:27:12out of these nineteen hundred pronunciations we got a more than fifty percent where not
0:27:18really useful because what we found out as people we would test the their the
0:27:23task they would basically
0:27:25spent a second giving one a phoneme
0:27:28and so people where we were very fast but very sloppy and we couldn't really
0:27:33get a good we didn't find a good way to
0:27:36basically have incentives to provide very good answers so doing this
0:27:41i mean we did this in parallel with many different
0:27:44many people working on the same words in order to find out whether the whether
0:27:48we get good pronunciations but it didn't really
0:27:51penn out nicely
0:27:53so the second thing then we did this we reached out of our friends and
0:27:57volunteers so we work harder to improve the interface we had a nice welcome page
0:28:02of tutorial
0:28:03we gave the people quality feedback so that they basically have an incentive to work
0:28:07harder and we also had basically we put it out as again so people would
0:28:12compete with each other
0:28:14in a in a friendly manner
0:28:15so one thing we found listed in the in the beginning they would spend a
0:28:19lot of time to get the what's right so they would spend six minutes on
0:28:22one what which i found quite high
0:28:24and the and then finally they would be down to one and have a minutes
0:28:28roughly put word
0:28:31and you can see here so the system amount of users so it's really heart
0:28:35to get to keep the uses one tractor but certainly the median time for what
0:28:40we do significantly
0:28:42and also the arrows which we found procession would go would go down
0:28:48but it's harder to keep
0:28:50the people in the so getting really messes amount of pronunciations out of this with
0:28:55crowd sourcing before rather challenging
0:28:58okay so now i would like to talk and wu approach which we work on
0:29:04if there is no writing system at all
0:29:08so that means you would like to build a system for languages which have
0:29:14a written form
0:29:16so this is
0:29:17sort of how we think of it you have nothing done
0:29:21we assume that thing as not written even if it's true and you can you
0:29:24can ask what to speak for you like a sentence in training and so you
0:29:30can ask simply say i'm sick and you would say she dropped i don't i
0:29:34don't speak klingon but may maybe people do
0:29:37and you can ask him i've help the and you would also produce a phrase
0:29:40and so that what people want to subconsciously is we would try to identify sounds
0:29:47and we would perceive this as a sequence of
0:29:52and then we would probably if there let's see if you says she drop in
0:29:57she P then we would probably assume that she might be something like a because
0:30:01this is what was
0:30:03the repeating phrase in what we asked him to do
0:30:07and then we could basically derive that are means G and what seems to be
0:30:11the word for sick and
0:30:13she seems to be the word for healthy
0:30:16and so the idea would be to
0:30:18create this process that we're doing in our minds that we putting this on a
0:30:24on the machine
0:30:25and they had in already substantial work done the first one probably the that's what
0:30:31i found was performed from the wall basis he
0:30:34so he was using a monolingual unsupervised segmentation in order to get the phone sequences
0:30:39into words
0:30:41and then there was the work from our is the work from the past and
0:30:44speaker and alex waibel they were using a cross lingual word to word alignment based
0:30:50on keys plus
0:30:52and then in the combination of work they basically combine the monolingual in the keys
0:30:56plus approach
0:30:57in order to find the phoneme to a word alignments
0:31:02and so phoenix are back in our lab she started using a cross lingual words
0:31:08to phoneme alignment it for that you improved we extended the
0:31:12it is a approach but let me first roughly explain how it work
0:31:18so you basically have to sentences you have a source language in this case it's
0:31:22not a german
0:31:24it says a different each digit date
0:31:27and then you have english phone sequence which you happen to have derived from what
0:31:33the person was saying so you have language words and things
0:31:37for you
0:31:38i mean ideally so and you have these little word boundaries already in there but
0:31:42of course if you create a phone sequence of a with the phone recognizer fingers
0:31:48all you have a lot of errors you might have a lot of errors here
0:31:51and of course you don't have any word boundaries
0:31:53and the question now is how can you
0:31:56get be alignment between the words and the corresponding phone
0:32:02and one of the issues okay so and what we next it is he extended
0:32:07keys plus model or if he's using the ibm model three
0:32:12and she
0:32:15change it such that it can be applied for a different word length so he
0:32:20was a putting in a word length probability
0:32:22in order to counterbalance the fact that if you to work to phoneme alignments that
0:32:27you have way more phonemes
0:32:28to be aligned to one to one word
0:32:31so the model he came up with basically skips the lexical translation it uses the
0:32:37fraternity uses the distortion but then puts in the word length
0:32:41and basically then the words are sort of placeholders and then be via the lexical
0:32:46translation would be translated to the phone sequence
0:32:51and what you found so the first result it did on be take english spanish
0:32:56so what he found is that he compared the keys that what to phoneme alignment
0:33:01to its model
0:33:03three P alignment
0:33:04and you can see that consistently
0:33:06independent of the phoneme error rate of the original phone sequence
0:33:11he could outperform other keys plus possible to phoneme alignment
0:33:15and she could also in terms of its course you could also outperform
0:33:21the monolingual one
0:33:24so this is on the pair from english to spanish and this is from the
0:33:27pair from spanish to english
0:33:29and so and f-score here of eighty percent for example means that he could get
0:33:34the segmentation the word boundaries in the phone sequences right to
0:33:38about ninety percent accuracy which is quite good
0:33:44so it's something that would have solved
0:33:47i mean that when we're not anywhere near that but is something we do have
0:33:52a correspondence between the phones and in the in the
0:33:57we could now think of the next
0:33:59the next step would be to we take the phone sequences and generate a pronunciation
0:34:06and now we run into this problem with low already mentioned in her talk
0:34:09that is some words might have different meanings
0:34:12in so for example the word a german can mean both speech and language
0:34:18so what will happen is you will have a lot of different pronunciations and some
0:34:22vowel sound like speech even so they might be you know
0:34:26everyone S
0:34:27but others are significantly different in these are the language words
0:34:30so the first thing you need to do what we did is we clustered
0:34:34all the
0:34:35occurring phone sequences in the different
0:34:38what meanings
0:34:39and we use the D V scan a cluster for that so if you if
0:34:43when we're lucky we are ending up with the cluster which represents everything widget goes
0:34:48to language
0:34:49and the other cluster represents everything which goes to speech
0:34:53and then if you have several words which sound like speech then what prefix was
0:34:57using was the n-best lectures to from and real spoken order to
0:35:01basic generate one a result for speech
0:35:05and then he would put those two results into the dictionary so
0:35:10since we don't have
0:35:11a really written form we cannot give it what checks all you can either we
0:35:16in our case we just did numbers
0:35:19so there is one word which gets the number one and the has the pronunciation
0:35:24and there's another word which gets the i teach when it has the pronunciation speech
0:35:30okay and so we try this out not would be tech but we moved to
0:35:35a pronunciation extraction on private data because felix found very nice page
0:35:42bible gateway which provides
0:35:46translations of bible of the bible in many different languages and he did not fifteen
0:35:50translations intend languages and they even provide all you in english portuguese and some other
0:35:57in order to play around with
0:35:59and so what needed to sorry she basically verse aligned the bible text so he
0:36:05extracted more than thirty
0:36:07rows alignments for training
0:36:08his approach
0:36:10and so what you see here is basically the distribution of the absolute phoneme errors
0:36:15in the extracted pronunciations
0:36:18so basically he compared extractor come a pronunciations to the real word pronunciations
0:36:23and so for example this thing here means that
0:36:27if you use T V spanish bible
0:36:31create english pronunciations
0:36:33there are about three thousand nine hundred words
0:36:36out of the fourteen hundred which had forty thousand which are extracted
0:36:40in these three thousand nine hundred words have no phoneme error so there really correspond
0:36:45exactly to what
0:36:46had been found in the dictionary
0:36:50so that in total the dictionary had something like fourteen hundred words so this is
0:36:54the use easily so this is the english one of the english the bible translations
0:36:59and this one gives you
0:37:01the distributions
0:37:02of the phone arrows across all the different languages he started from to extract the
0:37:10one thing so what we did in this case it's we assumed
0:37:14it in a first step we assume that the target phone sequence
0:37:18is the canonical pronunciation so that means this has not been done with the phone
0:37:22recognizer on audio but it has been done
0:37:25under the assumption that the phone
0:37:27sequence escort
0:37:28only what was on almost the word boundary and so the next step is now
0:37:32to basically this with the real phone recognizer in see how well it will
0:37:38cross or
0:37:39okay so and then the next step but we have done that yet
0:37:43is that we now have a transcribed audio which is basically transcribed and word I
0:37:49we could create a dictionary based on what I Ds or whatever you want to
0:37:53take here and then you have a language model and then we could put all
0:37:57these things together to build a speech recognition system based on data which has not
0:38:04but they haven't so this last that we haven't done yet
0:38:07so i four minutes left three minutes left okay so the third one start run
0:38:13is really quickly this is here the question is what
0:38:16could be done if there are not many linguistic experts and i think everybody has
0:38:21seen this problem here she wants to build this
0:38:24system in a language and there's simply no but you their coach who could help
0:38:28lori was just talking about finding a korean for transcribing right
0:38:32and so
0:38:34so what we did
0:38:36the war started several years ago in that was and it's F project actually that
0:38:40was funded by mary harper
0:38:43and it did together with alan black the idea was
0:38:46how can you bridge the gap between technology experts and the language experts and so
0:38:51we build a lot of web pages tools
0:38:54and the web based tools allow somebody who doesn't know anything or not very much
0:38:59about speech recognition or G S
0:39:02it allows this person to
0:39:05on the language for him or herself so you basically have hand rails
0:39:09which tells you what to do in as in this next step
0:39:12in order to build a system and the person can go step by step through
0:39:16would to build the system
0:39:17and we had done
0:39:19quite substantial work in the past so the system is
0:39:22every here it is used in a semi now which is a multilingual seminar also
0:39:27across seen U N K I T
0:39:30and we always adopt the languages of the students who take the score so what
0:39:34the use we have really done a lot of systems asr and tts systems in
0:39:39the respective languages
0:39:40and whenever we learn something we plug this into
0:39:43introduce the true
0:39:45and we using this we for education but we also using this for
0:39:50purposes to do up resources and reach out
0:39:53two people if we don't have time at hand
0:39:55for example we had one
0:39:58one time wanted to build something in company and there was only a single speaker
0:40:02of company it seeing you so he was sending basically is what the web page
0:40:07to his people back in india and got a lot of speech and work out
0:40:11of this
0:40:12being able to do virtually
0:40:15okay so i think i
0:40:17i should close here so i wanted to basically show you some
0:40:23solutions we are working on
0:40:24if we have resources and i try to give one example if we have no
0:40:28transcript if we have no
0:40:31pronunciations or if we need have no
0:40:33but system at all
0:40:35and also wanted to stress that we are keep working on our record language adaptation
0:40:41in order to leverage of fieldwork and also be able to out to do outreach
0:40:46to the community
0:40:48in order to then finally bridge the gap between those people who know the technology
0:40:52and those people know the language
0:40:54thank you very much
0:41:04for a few questions
0:41:17i just had the clarification question about your first
0:41:21study where you had no transcription
0:41:24so how do you obtain the pronunciation dictionary in the other models
0:41:30are you get
0:41:30i could let me just go back so the idea is that you have
0:41:35a dictionary in it so target most check so the idea would be you have
0:41:39a check dictionary given in the czech phone set
0:41:43but the source recognisable right to run the data is in let's say polish so
0:41:48what you need to do is you need to take the produce phones
0:41:52and replace the check phones in the dictionary for the problem phones
0:41:56so you basically map them in the czech dictionary
0:41:59so you get the polish variant of a check dictionary
0:42:06we do i forgot that i'm sorry so we are using through a global phone
0:42:10everything is expressed and ipa so doing a mapping is rather straightforward
0:42:14like you can do this with fifty mapping rules i would say not nothing too
0:42:18fancy because
0:42:19they're the context
0:42:20can be used this way but the in the after the iterations the
0:42:24acoustic models would be map adapted anyways
0:42:32for the recognizer for languages be so and it's operating system actually the things they
0:42:36represented the last part
0:42:39what would be a recommendation to present the output because if there is no writing
0:42:43system is our to target all right you cannot probably due
0:42:47the users to just be I Ds numerical id so diverse every what we do
0:42:53so i think
0:42:55if you haven't if you have a system or language which is not written it
0:42:59all the tts might be a good approach that you basically
0:43:02speak it would be to the people so that they can listen to it
0:43:06so that would be run option and then all you have to do is if
0:43:09you get
0:43:11number one you basically to teach us on the phones
0:43:15and then the other thing of course the a simple straightforward one is apply any
0:43:20sort of
0:43:21phoneme to grapheme mapping to get it back so
0:43:25but that's
0:43:27all that we could come up with actually i don't like the number idea either
0:43:30so we basically switch now to something with underscore so that you can at least
0:43:34we did
0:43:37you could potentially and translation
0:43:39or understanding intentionally to
0:43:43as ways to see
0:43:44sure i mean actually what for what's out there is you have the word id
0:43:49in the target word to source word
0:43:53you shouldn't what this into translation system and then
0:43:57translated into
0:43:59language that's
0:44:00twenty correct
0:44:01i'm more interested in the asr it's always
0:44:10just one year when you were doing the bible work and finding
0:44:14see the words and read or differences were you actually can see how many of
0:44:18the words that are january you're not real words
0:44:21"'cause" you do in the list right and i didn't really
0:44:24get to the so
0:44:25so one thing is for the english case
0:44:29you can see that the number of
0:44:32generated words is roughly the number of
0:44:34words so it's not overgenerating in its not under generating
0:44:38but you can also see swedish for example son to generate into gary and check
0:44:42is what generate and what we found is that
0:44:45the generation is also related to how many words the source language has source we
0:44:51just has very few and very check as many so they'll what generating an unknown
0:44:57and of course name of the game would be to sort of cheap this in
0:45:00balance if we assume that it's roughly one on one
0:45:04does it and see a question
0:45:09one last no
0:45:10quick question
0:45:12you already asked
0:45:24could tell
0:45:27i practical question
0:45:30so they you have an application then you want to build the
0:45:35speech recognition based interface
0:45:39i looking what although these numbers can you advise me that
0:45:43you know what you're doing is the way to go or should i go transcribed
0:45:47audio and leave the speech recognition reduce then
0:45:52so for the record language adaptation server that's exactly what we were wondering so we
0:45:57are now doing our blaming on the components and try to advise the use of
0:46:02what's better is a better to work on the pronunciation that it doesn't work on
0:46:06the audio
0:46:07and for the first
0:46:08one hour the results are usually it's better to work on though
0:46:12getting some audio data so one alright at least
0:46:15and then getting the pronunciations right but it highly depends on how much data you
0:46:20have a way you are so
0:46:22difficult to answer but i think generating an automatic answer from the from the from
0:46:27an arrow blaming to tell the use of what to do next how to invest
0:46:32best his or time
0:46:33i think that's
0:46:35the right way to go
0:46:38of the question for all three speakers this morning for oriented for mary i think
0:46:45the big elephant in the room is the getting textual data for languages that are
0:46:50only spoken and touched on this a little bit
0:46:54and if you don't have web data and only have conversational data that's recorded making
0:47:00the transcripts i mean in with the G B goes that we developed really by
0:47:04five for a large as cost was always getting the text data and the translation
0:47:10so acoustic modeling is by comparison relatively straightforward cheap to do
0:47:15so my question is how do you interested i mean the hero here the room
0:47:19is married to do this gigantic database of conversational data and the wondering how is
0:47:25a community we can make faster progress set that problem
0:47:34i've never heard of a question do a speaker who has spun
0:47:39well maybe this is something i think about a good question we can come by
0:47:47so let's bring that up again and today
0:47:50okay let's thank time and what rate