0:00:21 | host this better |
---|
0:00:23 | okay |
---|
0:00:23 | thank you very much thanks of having mean here giving you the chance to talk |
---|
0:00:27 | about our work |
---|
0:00:29 | so i would like to |
---|
0:00:30 | talk about how can we build systems with very low resource as |
---|
0:00:34 | and very much as lori also tried i would like to give a brief |
---|
0:00:38 | the definition of what we think the low resource languages and it's surprising you mentioned |
---|
0:00:43 | if you ask the linguists they would give a different answer but no |
---|
0:00:46 | linguists keep the very same and so they say |
---|
0:00:49 | in other resource languages these crawl them and say that is a language which likes |
---|
0:00:55 | electronic resources for speech and language processing |
---|
0:00:58 | it lacks presence on the web |
---|
0:01:00 | it likes the writing system or stable topography and it may elect linguistic expertise |
---|
0:01:06 | and |
---|
0:01:07 | a language qualifies for being under-resourced if you just one of these criteria applies but |
---|
0:01:12 | in many cases several of these the |
---|
0:01:16 | definition supply |
---|
0:01:19 | some people call that under-resourced some people called we low resource or more resources or |
---|
0:01:25 | whatever but it's all we |
---|
0:01:26 | men means the same thing |
---|
0:01:28 | however what's really different is if you look into mine or two languages so mine |
---|
0:01:33 | or two languages not necessary low resourced |
---|
0:01:36 | and the low resource languages not necessarily a minority language so these have to be |
---|
0:01:41 | discriminative |
---|
0:01:42 | we were trying to put together some definitions so there will be |
---|
0:01:46 | a special issue in speech communication which i was likely to put together with low |
---|
0:01:51 | associated one are |
---|
0:01:53 | in the alex carp of and it is about under-resourced languages and how to build |
---|
0:01:58 | asr systems forward |
---|
0:02:00 | and will be out in january and next year so twenty fourteen |
---|
0:02:05 | so the ideal case |
---|
0:02:07 | i consider to be or i think many people increase if you have plenty of |
---|
0:02:11 | resources |
---|
0:02:12 | so that means you have plenty of phone sets you know about the phone set |
---|
0:02:16 | up to of a language you do have |
---|
0:02:17 | or you have corresponding transcripts |
---|
0:02:20 | you have the pronunciation rules you have experts which build for you the pronunciation dictionaries |
---|
0:02:26 | and you have plenty of text data |
---|
0:02:28 | and then you can build asr systems you can you build an L P or |
---|
0:02:32 | inching systems and also tts systems for matter of fact |
---|
0:02:35 | so that would be the ideal world |
---|
0:02:39 | however the word is not ideal and i was asked to talk about a low |
---|
0:02:43 | resource is what we do if we have very few on all data |
---|
0:02:47 | so i would like to present a few of the solutions which we are currently |
---|
0:02:51 | working on |
---|
0:02:52 | and so the first one is addressing the fact if you do not have any |
---|
0:02:55 | transcripts at all in a language which you would like to work on |
---|
0:02:59 | the second is if you do not have any pronunciation dictionary so how can you |
---|
0:03:03 | dig in for example into web for getting pronunciations |
---|
0:03:08 | and the second thing is i would like to show some work on what we're |
---|
0:03:12 | currently doing if you have no writing system at all for language |
---|
0:03:17 | and then what can you do if you do not have any linguistic X expertise |
---|
0:03:21 | so you don't have language experts at hand |
---|
0:03:24 | and for all the solutions i would like to show today my general underlying assumption |
---|
0:03:30 | is |
---|
0:03:30 | that i would like to leverage off existing knowledge and data resources which i have |
---|
0:03:35 | gathered from many languages already |
---|
0:03:37 | so i sort of believe in if you have seen a lot of things in |
---|
0:03:40 | one language and many languages |
---|
0:03:42 | it should help you to build |
---|
0:03:44 | the resources in the next language |
---|
0:03:47 | so to me |
---|
0:03:49 | the holy grail of rapid adaptation in languages and probably also in domains this |
---|
0:03:54 | that you have plenty of data given in many different languages |
---|
0:03:58 | that you sort of derive a global inventory of models |
---|
0:04:04 | and then you have a tiny bit of data or maybe next to zero data |
---|
0:04:09 | and then basically adapt to that particular length |
---|
0:04:14 | and so when i started we might instance with the alex waibel he basically a |
---|
0:04:19 | push you very much to if you want to do something with many languages they |
---|
0:04:23 | would be good if you have many languages |
---|
0:04:25 | so i started out collecting a large |
---|
0:04:28 | corpus of data so |
---|
0:04:31 | so i wanted to have something which is |
---|
0:04:33 | uniform in a way that it basically a appears in many different languages but has |
---|
0:04:38 | the same forward the same style and so ever and so ever it basically |
---|
0:04:43 | i started doing this in ninety five and ever since we keep collecting data so |
---|
0:04:47 | we now accumulated twenty one languages |
---|
0:04:49 | it's basically you have for each language we have twenty hour spoken by one hundred |
---|
0:04:54 | speakers |
---|
0:04:55 | so altogether it is a small amount compared to what people use today |
---|
0:04:59 | but it's basically having covered many languages |
---|
0:05:03 | in over the years we have built recognises and all these languages |
---|
0:05:06 | so that means we do have a pool of knowledge |
---|
0:05:09 | in the acoustics and also in dictionaries in text |
---|
0:05:13 | which i would like to show you how to leverage this for |
---|
0:05:17 | situations what we do not have |
---|
0:05:19 | any resources keep |
---|
0:05:21 | so the underlying idea when i started in my thesis what's mean that was still |
---|
0:05:26 | a gaussian mixture model time so that was pretty dnns |
---|
0:05:30 | so the idea would be to share acoustic models across languages |
---|
0:05:34 | in order to get some sort of and a global phone inventory and by the |
---|
0:05:39 | time and that's something which also you are included |
---|
0:05:42 | we use the ipa the international phonetic alphabet |
---|
0:05:45 | to share data across languages |
---|
0:05:48 | so whenever to phones have the same ipa representation |
---|
0:05:52 | the joint to be trained as a language independent fashion |
---|
0:05:56 | and |
---|
0:05:57 | one observation we did what had was that if you do this and this is |
---|
0:06:01 | a graph what we do this for twelve languages |
---|
0:06:03 | so we share twelve languages based on i |
---|
0:06:06 | one of the observations was |
---|
0:06:08 | this doesn't really |
---|
0:06:10 | i mean this just see |
---|
0:06:12 | she just going straight up and that was only twelve languages |
---|
0:06:15 | so our expectation of my expectation was that this would keep growing so that means |
---|
0:06:19 | the more languages we see we will have issued diversity and we're not anywhere near |
---|
0:06:25 | being in an asymptotic behaviour of the score |
---|
0:06:29 | so and then as we hurt and yesterday in these wonderful talks from morgan and |
---|
0:06:35 | also from france idea about the dnns |
---|
0:06:38 | so what i'm really happy about this or why am excited about the dnns as |
---|
0:06:43 | for the very first time |
---|
0:06:44 | what we see is |
---|
0:06:46 | not only that they work well for |
---|
0:06:49 | strapping in new languages |
---|
0:06:51 | but what i had never seen before it acoustic modeling whenever i would share across |
---|
0:06:55 | languages |
---|
0:06:56 | i would get worse performance in the training languages so i never managed to get |
---|
0:07:01 | a better performance and those languages are used for training i only got slightly better |
---|
0:07:05 | performance when they use the four languages |
---|
0:07:07 | so now what the dnns we really i mean i hope so for the very |
---|
0:07:11 | first time if you train the multi-lingual network that you can even outperform on the |
---|
0:07:16 | languages which are in the training set |
---|
0:07:18 | which makes me very happy and i think gives hopeful |
---|
0:07:21 | using them for bootstrapping also two languages |
---|
0:07:25 | so the first thing i would like to talk about is what |
---|
0:07:29 | can be done if there are no transcription that at all so you have a |
---|
0:07:32 | new language |
---|
0:07:33 | well not to have that's a pronunciation dictionary and a good amount of speech |
---|
0:07:39 | but you do not have any transcriptions |
---|
0:07:42 | so and for this experiment we took the check which doesn't mean that i consider |
---|
0:07:47 | check to be low resource languages the reason we to check is |
---|
0:07:52 | two things the first of all the conference was in the czech republic so we |
---|
0:07:56 | thought this is an appealing idea and the other was here is reason is |
---|
0:07:59 | we had four other slavic languages which are related to check and we wanted to |
---|
0:08:04 | figure out |
---|
0:08:05 | whether we can whether we can gain from having related languages so we had two |
---|
0:08:09 | sets of experiments we to for slavic languages |
---|
0:08:13 | which is equation russian obligatory in polish |
---|
0:08:15 | in the hope that the relation might help for bootstrapping check |
---|
0:08:19 | and then me to another situation which is probably more realistic namely that you have |
---|
0:08:25 | already resource language rich so asr systems and resource rich languages that is english french |
---|
0:08:32 | german and spanish |
---|
0:08:34 | and we basically said okay we use we have given speech we have given a |
---|
0:08:38 | dictionary and we have given the language model in check |
---|
0:08:41 | what we're letting our transcripts |
---|
0:08:43 | and the idea is basically that we wanted to leverage of knowledge which we gained |
---|
0:08:48 | already in many other languages |
---|
0:08:50 | so the idea here goes as follows |
---|
0:08:52 | we basically take the source languages he in this case we have a polish creation |
---|
0:08:56 | will there in russian |
---|
0:08:58 | we take the recognizer we haven't this language |
---|
0:09:01 | and now we're taking our target checked dictionary and language model which we saw to |
---|
0:09:05 | have given |
---|
0:09:07 | and we are using the dictionary to express the check dictionary in terms of polish |
---|
0:09:12 | falls |
---|
0:09:13 | sounds to me |
---|
0:09:14 | but it basically allows us to use the context dependent acoustic models |
---|
0:09:18 | from our polish asr system |
---|
0:09:21 | and we don't the very same thing for operation gary in russian so we basically |
---|
0:09:25 | create for different dictionaries which are expressed in the phone set of the of the |
---|
0:09:30 | source language on this |
---|
0:09:32 | so it's out check words but it would be expressed in encoders phones and russian |
---|
0:09:37 | phones and so on |
---|
0:09:38 | and so now we can do what's we can run the polls recognizer on checked |
---|
0:09:42 | data |
---|
0:09:43 | it would give us check works we have the russian diction we have the russian |
---|
0:09:47 | asr and the polish and so on so we have |
---|
0:09:50 | basic need we get for hypothesis |
---|
0:09:52 | of the czech words |
---|
0:09:54 | and so then what we're doing is we're leveraging of confidence score which had been |
---|
0:09:58 | proposed by two mask and then too much off |
---|
0:10:01 | which was called a stop you so it basically how i mean what we do |
---|
0:10:06 | with will basically calculate how often the same word appears |
---|
0:10:10 | in the hypothesis from the different recognizer so this is the polish recognizer clearly |
---|
0:10:15 | check hypothesis this is the creation the bulgarian the rational |
---|
0:10:19 | and whenever or more languages agree on the same work we assume this work to |
---|
0:10:24 | be correctly identified in this can be then used for training on |
---|
0:10:29 | and so the whole the whole idea is we need to leverage off or to |
---|
0:10:33 | vote to george to boot model different languages |
---|
0:10:37 | and the reason why we few this works quite well is |
---|
0:10:41 | if you take is the usual confidence score and you apply basically one language |
---|
0:10:46 | we know that |
---|
0:10:47 | the confidence score it's rather unreliable so this one gives you the confidence score threshold |
---|
0:10:53 | over the word error rate |
---|
0:10:55 | and basically if you have just one language and it really you start out with |
---|
0:11:00 | the very poor performance |
---|
0:11:01 | then even a very good confidence course not very reliable |
---|
0:11:05 | but if you do this for several languages then what happens is that if you |
---|
0:11:10 | had at least two or two languages agree on the same words of the same |
---|
0:11:14 | islands in the hypothesis |
---|
0:11:16 | then you basically have a much more reliable the behavior of the confidence score |
---|
0:11:22 | and we learn that |
---|
0:11:23 | using a threshold which is roughly one divided by and plus and offset if and |
---|
0:11:28 | is the number of languages which are applied |
---|
0:11:30 | that is a good value |
---|
0:11:32 | so then the whole framework works as follows you basically create your saul's recognizers with |
---|
0:11:39 | the matched dictionary |
---|
0:11:41 | you run on your check audio data to produce a transcriptions |
---|
0:11:46 | once you get the transcriptions you look with the |
---|
0:11:49 | multilingual east appeal which ones that much parts of the transcriptions |
---|
0:11:54 | and then you know basically adapt |
---|
0:11:56 | the acoustic model of your source recognizer |
---|
0:11:59 | towards |
---|
0:12:00 | using the transcriptions which you just drive that means to produce recogniser choosing to check |
---|
0:12:06 | the russian recogniser choosing to check |
---|
0:12:08 | the creation one and so on so everything gets more basically chewed in to check |
---|
0:12:13 | and then go through all everything the way you just keep the you adapted acoustic |
---|
0:12:18 | model and then you we iterate |
---|
0:12:19 | so we basically it generate the whole process |
---|
0:12:22 | in which we always through the transcriptions away |
---|
0:12:25 | and once we have are certain amount of data this way automatically transcribed |
---|
0:12:30 | then we are using a multilingual phone inventory to bootstrap the check system |
---|
0:12:36 | and we were quite happy with the with the performance we saw so what you |
---|
0:12:40 | see here is basically |
---|
0:12:42 | the first iteration to in the second third and then the bootstrap and the final |
---|
0:12:46 | step after |
---|
0:12:47 | reiterating one more time |
---|
0:12:49 | and this is basically the performance as you get up with gary integration with the |
---|
0:12:54 | pollution but the russian |
---|
0:12:57 | acoustic models |
---|
0:12:58 | and basically the take a message is if you use the related languages the best |
---|
0:13:02 | system we can come up with has twenty two point |
---|
0:13:05 | seven percent |
---|
0:13:06 | performance |
---|
0:13:07 | which is |
---|
0:13:08 | very close to the baseline which we have if you're assuming that we have twenty |
---|
0:13:13 | hours of transcriptions give |
---|
0:13:15 | so this is the supervised case |
---|
0:13:17 | and this is the fully unsupervised case |
---|
0:13:19 | for the related K so for the case for the good case where you have |
---|
0:13:23 | related languages |
---|
0:13:24 | and this little one here is the one for the not related resource-rich languages |
---|
0:13:29 | and the number the best number we got with this is twenty three point three |
---|
0:13:33 | so we are still within range even so we're not as close as if we |
---|
0:13:37 | do the related cases |
---|
0:13:38 | but i believe that the |
---|
0:13:40 | then the differences marginal |
---|
0:13:42 | you may wonder |
---|
0:13:43 | about two things one thing is |
---|
0:13:45 | does that work for |
---|
0:13:48 | for a conversational speech |
---|
0:13:50 | and this is something we currently trying so this is more plan speech so |
---|
0:13:54 | we are in the making of doing this for conversational speech the other question you |
---|
0:13:59 | may have as Y all the results also work but don't please keep in mind |
---|
0:14:03 | that we only have twenty three hours of training data and the setup and we |
---|
0:14:06 | have a |
---|
0:14:07 | rather shoot or perplexity because it's newspaper articles |
---|
0:14:13 | if we so we of course wondered whether we would gain something by increasing the |
---|
0:14:18 | number of source languages so we did the same thing this time for vietnamese and |
---|
0:14:22 | we look at |
---|
0:14:23 | what is the difference between using two source languages for source languages and six for |
---|
0:14:28 | source languages |
---|
0:14:29 | and what we discovered is that |
---|
0:14:31 | over the iterations what you see here in the bars is the amount of data |
---|
0:14:35 | we could extract |
---|
0:14:36 | for trends for transcription |
---|
0:14:39 | and what you see here and the curves is the quality of the transcriptions |
---|
0:14:43 | so what we found is that the quality of transcriptions teams to go slightly up |
---|
0:14:48 | if we using more languages |
---|
0:14:50 | but we can derive more transcription data so the amount of data we can big |
---|
0:14:55 | sticks extract its larger |
---|
0:14:58 | and so what we found it is if you look at the performance here that |
---|
0:15:03 | finally if we do with six languages we get sixteen point eight on vietnamese which |
---|
0:15:08 | compares to fourteen point three if we build it in a supervised fashion |
---|
0:15:13 | now the company at here is if we now have only and next word use |
---|
0:15:18 | native speaker knows how to build systems |
---|
0:15:21 | here she could do something it the range of eleven point eight sort the gap |
---|
0:15:26 | you get all the gain you get by doing language specific things is enormous i |
---|
0:15:31 | would say and significantly larger than the get you have between supervised and unsupervised |
---|
0:15:37 | so if you have the choice having somebody one or something about the language and |
---|
0:15:41 | for example to tuning and tweaking like |
---|
0:15:43 | for vietnamese control modeling |
---|
0:15:46 | pitch features using multi syllables and so on that gives a larger gains then the |
---|
0:15:51 | discrepancy between transcribed and untranscribed |
---|
0:15:57 | okay so the second thing i would like to discusses so this was about what |
---|
0:16:02 | can what we do if we don't have transcriptions the second thing is what do |
---|
0:16:06 | you do |
---|
0:16:06 | if you have no pronunciation dictionaries |
---|
0:16:09 | so that means the again the idea would be leveraging of what we already have |
---|
0:16:14 | from other languages so we assuming we have seen a lot of scripts we have |
---|
0:16:18 | seen a lot of pronunciation rules |
---|
0:16:20 | and we would now like to generate the dictionary and language |
---|
0:16:24 | so let me first talk briefly about writing system so first of all of course |
---|
0:16:29 | we put wonder |
---|
0:16:31 | how many |
---|
0:16:32 | how many languages are written at all does it for example makes sense to look |
---|
0:16:36 | into languages should not have a writing system |
---|
0:16:38 | if you consider all mikolov which is from a site maintained by simon baker |
---|
0:16:43 | so if you use listing about seven hundred fifty languages which have a script |
---|
0:16:49 | he things that the number that room number of languages with the script is close |
---|
0:16:53 | to two thousand |
---|
0:16:54 | which basically means that the majority of the other six thousand languages do not have |
---|
0:16:59 | a written form |
---|
0:17:01 | so that so it seems to be a rather serious |
---|
0:17:04 | issue |
---|
0:17:05 | and then you look at the writing systems there are different types of writing system |
---|
0:17:09 | so that the logo graphic ones which means the characters of this of the writing |
---|
0:17:15 | system are based on semantic unix |
---|
0:17:18 | units and the graphemes rather represent meaning then representing some sort typical example is a |
---|
0:17:23 | chinese hands |
---|
0:17:25 | and then there are scripts which are called phone a graphic |
---|
0:17:28 | so that means here the graphemes are representing silence |
---|
0:17:33 | rather than meaning |
---|
0:17:34 | and then you have different forms of for the graphics and segmental ones for example |
---|
0:17:39 | there are one grapheme roughly corresponds to one sound so that's very convenient for building |
---|
0:17:44 | dictionaries in the syllabic case grapheme may represent an entire syllable |
---|
0:17:50 | and then they are feature ones like korean where or one grapheme a represent |
---|
0:17:55 | and articulatory feature |
---|
0:17:58 | but if you look at the world most parts of the world agree which means |
---|
0:18:03 | there are perhaps |
---|
0:18:04 | the scripts which are roman all the transcripts |
---|
0:18:07 | and then another big chunk here is the one which has it will take script |
---|
0:18:12 | and they're both |
---|
0:18:14 | photographic segmental scripts so this |
---|
0:18:18 | seems to be very nice for a between pronunciation dictionary generation |
---|
0:18:22 | in these other once we looked |
---|
0:18:25 | first |
---|
0:18:25 | so that means we assume that there is or correlation or there was a relationship |
---|
0:18:30 | between graphemes into sound or |
---|
0:18:32 | we call the grapheme-to-phoneme or letter-to-sound |
---|
0:18:35 | but we also know that there was alone in there there's just there are languages |
---|
0:18:39 | which are very close which have a very close grapheme-to-phoneme relationship |
---|
0:18:43 | and then there are others which are like english more pathological C |
---|
0:18:48 | sorry |
---|
0:18:49 | don't worry offend anybody but |
---|
0:18:51 | okay so of course that's all the first thing we wanted to look at this |
---|
0:18:55 | so how many data do we need |
---|
0:18:57 | to create a pronunciation dictionary |
---|
0:18:59 | so what you see here |
---|
0:19:00 | is on the on the on ten globalphone dictionaries you see here the phoneme error |
---|
0:19:05 | rate on the y-axis and you see on the x-axis the number of phones |
---|
0:19:09 | and what it means this we took |
---|
0:19:13 | pronunciations we had pronunciations in our dictionary |
---|
0:19:15 | we took them how we trained or G two P model based on sick which |
---|
0:19:20 | were from a max be signing |
---|
0:19:22 | and then we use the G two P to generate more entries in the dictionary |
---|
0:19:27 | and then we compare the generated entries to the true entries in the dictionary and |
---|
0:19:32 | that gave us the phoneme error rate |
---|
0:19:34 | so if you have for example five K of five thousand |
---|
0:19:40 | phonemes that means we took |
---|
0:19:42 | well |
---|
0:19:42 | roughly one thousand pronunciations word pronunciations and so you have five thousand phonemes which you |
---|
0:19:48 | can learn on |
---|
0:19:49 | you can learn you on |
---|
0:19:51 | and what we learned from this graph is first of all |
---|
0:19:53 | the G two P depends on the language that's not really surprising remote this already |
---|
0:19:58 | so we have these punch down here this is the slavic languages so you guys |
---|
0:20:02 | are really lucky |
---|
0:20:04 | and also spanish is below there and then we have portuguese and this one instrument |
---|
0:20:08 | and the worst case |
---|
0:20:09 | in our scenario here is english |
---|
0:20:12 | so there is a strong dependency on the language the other thing we don't is |
---|
0:20:17 | if you have a close |
---|
0:20:18 | G two P relationship then |
---|
0:20:21 | five K phonemes might already be enough |
---|
0:20:23 | to build a decent G two P model |
---|
0:20:27 | and it separate out roughly after fifteen K examples |
---|
0:20:31 | but we also see that a language like german for example needs |
---|
0:20:35 | six times more examples then the language portuguese so depending on the relationship you have |
---|
0:20:41 | to work harder to get examples in order to build a good G two P |
---|
0:20:45 | model |
---|
0:20:46 | now when you get those |
---|
0:20:48 | examples from |
---|
0:20:50 | and to select the hooded this work he basically |
---|
0:20:54 | looked into dictionary which is a resource where you find when you find |
---|
0:21:00 | pronunciations put in by volunteers in that language and you look at how many |
---|
0:21:06 | language already covered dictionary and have a nice one tape pronunciations so that means one |
---|
0:21:12 | thousand words |
---|
0:21:13 | which roughly corresponds to five thousand of phonemes |
---|
0:21:17 | and so we found thirty seven languages |
---|
0:21:19 | which are covered that was at the time of twenty eleven when he did this |
---|
0:21:24 | and he also follow if you look at this one this is the roles |
---|
0:21:28 | dictionary entries over several years so this is twenty ten this is twenty eleven |
---|
0:21:32 | so there seems to be a lot going on so that means the community seems |
---|
0:21:38 | to |
---|
0:21:39 | discover that and work in more words and so the whole would be that |
---|
0:21:44 | probably covers |
---|
0:21:45 | in the future |
---|
0:21:46 | both more languages but also words |
---|
0:21:49 | but now of course the question is |
---|
0:21:51 | it is you use this so we build an interface to build an interface where |
---|
0:21:56 | you can basically uploaded vocabulary file |
---|
0:22:00 | in it would go away in search dictionary for whether it finds the entries |
---|
0:22:04 | and then and we use all the entries it finds to train a G two |
---|
0:22:08 | P model and then it would basically generate whatever you want to generate |
---|
0:22:12 | and it would go into the pages and looks for the ipa-based pronunciations |
---|
0:22:17 | there is a lot of cleaning have to do because the pronunciations to not only |
---|
0:22:21 | not always basically |
---|
0:22:24 | referred to what was saying so sometimes you look for what side and you get |
---|
0:22:29 | an ipa for what can i might deals what's that |
---|
0:22:31 | so you have to do a lot of checking to make sure that you're not |
---|
0:22:33 | learning the wrong things |
---|
0:22:35 | and that's one of the reason why the dictionary a G two P performs worse |
---|
0:22:40 | than if you have a good dictionary where you start from so this one for |
---|
0:22:43 | six languages compares |
---|
0:22:45 | the be the G two P performance |
---|
0:22:48 | a based on the things we learn from dictionary compared to the things we had |
---|
0:22:53 | already on globalphone |
---|
0:22:54 | and you can see here if you compare the same colours |
---|
0:22:57 | that there are some significant distances |
---|
0:23:01 | which |
---|
0:23:01 | probably refers to one to the error unless you have what we had we found |
---|
0:23:07 | and dictionary and the other thing is |
---|
0:23:09 | there might be many contribute just to put in pronunciations into the dictionary in dictionary |
---|
0:23:16 | and that means you might have inconsistencies which always an issue forty two P models |
---|
0:23:22 | okay |
---|
0:23:23 | so if you have nothing but you have related language you can also do a |
---|
0:23:26 | very to force very simple approach and that's what we tried with ukrainian so we |
---|
0:23:32 | want to ukrainian we had a student who was collecting data and we said okay |
---|
0:23:37 | we already have russian gary german english |
---|
0:23:40 | we have many languages why don't you try to |
---|
0:23:43 | build the dictionary from what we already have |
---|
0:23:46 | and it's a very simple approach so basically it has |
---|
0:23:49 | for steps in the first step |
---|
0:23:51 | you to the grapheme to grapheme mapping so you have russian |
---|
0:23:56 | and you met the ukrainian graphemes to russian graphemes |
---|
0:24:00 | once you've done this then there's a second step you can now basically applied to |
---|
0:24:04 | russian grapheme-to-phoneme model |
---|
0:24:07 | and then you have a russian phonemes and then you have to map them back |
---|
0:24:11 | so to speak ukrainian mapping so you see there is a lot of mapping going |
---|
0:24:15 | on |
---|
0:24:16 | and then you do some post processing |
---|
0:24:18 | so if she so she this for the different source languages then let's look at |
---|
0:24:23 | russian for example so he will first |
---|
0:24:26 | the ukrainian graphemes to rough and graphemes which |
---|
0:24:30 | may tim to forty three rule so we had to come up with forty three |
---|
0:24:34 | ones |
---|
0:24:35 | and then he would run the G two P and after that you would |
---|
0:24:38 | create fifty six rules to convert back the russian phonemes to ukrainian phonemes |
---|
0:24:44 | and then to do some post processing which required another fifty seven rules so altogether |
---|
0:24:49 | we had to create about one hundred sixty ones which can be done it in |
---|
0:24:56 | today |
---|
0:24:57 | and what comes out it's and |
---|
0:24:59 | is the dictionary well when we plug this into the highways are it gives the |
---|
0:25:04 | twenty one point six percent error rate |
---|
0:25:07 | if we use the simplest approach we just use the graphemes for asr we get |
---|
0:25:13 | something like twenty three point eight percent |
---|
0:25:15 | if we ask the student to sit down and build a rules in order to |
---|
0:25:20 | create the dictionary by hand so you basically at hand crafted rules which |
---|
0:25:24 | piled up to about eight are rules then he gets twenty two point four and |
---|
0:25:28 | in the process we wondered why the russian based one is |
---|
0:25:32 | better than |
---|
0:25:33 | be handcrafted and found that we had some issues with our rule so we could |
---|
0:25:37 | we fix the rules |
---|
0:25:39 | and then basically ended up with the same performance |
---|
0:25:43 | so that means if you have |
---|
0:25:45 | if you initial if you don't have much time and you we need to even |
---|
0:25:49 | to reduce the number of rules you want an expert to write then probably be |
---|
0:25:53 | screwed |
---|
0:25:54 | method is an option |
---|
0:25:56 | but that of course depends on that you have a reasonable related language because if |
---|
0:26:01 | you would with other non related languages you get significantly worse results |
---|
0:26:05 | so if we do would with english or german then it doesn't pan out |
---|
0:26:11 | okay so i have fifteen minutes |
---|
0:26:14 | so the other thing you can do if you don't have pronunciations is you can |
---|
0:26:17 | ask the crowd and so we |
---|
0:26:19 | created a small toolkit which basically gives you keyboard and on the keyboard you have |
---|
0:26:25 | i ipa sounds which is you have to get used to it but you can |
---|
0:26:29 | pick basic you press i mean you can listen to |
---|
0:26:31 | how it sounds |
---|
0:26:32 | and then you get a word and then you are asked to produce the pronunciation |
---|
0:26:37 | of this word and then once you're done you can go to the next work |
---|
0:26:41 | and we use this we basically feed it is into a mechanical turk |
---|
0:26:47 | and basically we headed run for twelve days to find out whether this would give |
---|
0:26:51 | good results or not |
---|
0:26:53 | so the first thing we learned is so we rented for twelve days and we |
---|
0:26:57 | had |
---|
0:26:59 | calculated look at how long people spend |
---|
0:27:01 | building pronunciations |
---|
0:27:03 | the average time most fifty three seconds |
---|
0:27:06 | most of the work so the majority based you had to reject |
---|
0:27:10 | and we had |
---|
0:27:12 | out of these nineteen hundred pronunciations we got a more than fifty percent where not |
---|
0:27:18 | really useful because what we found out as people we would test the their the |
---|
0:27:23 | task they would basically |
---|
0:27:25 | spent a second giving one a phoneme |
---|
0:27:28 | and so people where we were very fast but very sloppy and we couldn't really |
---|
0:27:33 | get a good we didn't find a good way to |
---|
0:27:36 | basically have incentives to provide very good answers so doing this |
---|
0:27:41 | i mean we did this in parallel with many different |
---|
0:27:44 | many people working on the same words in order to find out whether the whether |
---|
0:27:48 | we get good pronunciations but it didn't really |
---|
0:27:51 | penn out nicely |
---|
0:27:53 | so the second thing then we did this we reached out of our friends and |
---|
0:27:57 | volunteers so we work harder to improve the interface we had a nice welcome page |
---|
0:28:02 | of tutorial |
---|
0:28:03 | we gave the people quality feedback so that they basically have an incentive to work |
---|
0:28:07 | harder and we also had basically we put it out as again so people would |
---|
0:28:12 | compete with each other |
---|
0:28:14 | in a in a friendly manner |
---|
0:28:15 | so one thing we found listed in the in the beginning they would spend a |
---|
0:28:19 | lot of time to get the what's right so they would spend six minutes on |
---|
0:28:22 | one what which i found quite high |
---|
0:28:24 | and the and then finally they would be down to one and have a minutes |
---|
0:28:28 | roughly put word |
---|
0:28:31 | and you can see here so the system amount of users so it's really heart |
---|
0:28:35 | to get to keep the uses one tractor but certainly the median time for what |
---|
0:28:40 | we do significantly |
---|
0:28:42 | and also the arrows which we found procession would go would go down |
---|
0:28:48 | but it's harder to keep |
---|
0:28:50 | the people in the so getting really messes amount of pronunciations out of this with |
---|
0:28:55 | crowd sourcing before rather challenging |
---|
0:28:58 | okay so now i would like to talk and wu approach which we work on |
---|
0:29:04 | if there is no writing system at all |
---|
0:29:08 | so that means you would like to build a system for languages which have |
---|
0:29:14 | a written form |
---|
0:29:16 | so this is |
---|
0:29:17 | sort of how we think of it you have nothing done |
---|
0:29:21 | we assume that thing as not written even if it's true and you can you |
---|
0:29:24 | can ask what to speak for you like a sentence in training and so you |
---|
0:29:30 | can ask simply say i'm sick and you would say she dropped i don't i |
---|
0:29:34 | don't speak klingon but may maybe people do |
---|
0:29:37 | and you can ask him i've help the and you would also produce a phrase |
---|
0:29:40 | and so that what people want to subconsciously is we would try to identify sounds |
---|
0:29:47 | and we would perceive this as a sequence of |
---|
0:29:52 | and then we would probably if there let's see if you says she drop in |
---|
0:29:57 | she P then we would probably assume that she might be something like a because |
---|
0:30:01 | this is what was |
---|
0:30:03 | the repeating phrase in what we asked him to do |
---|
0:30:07 | and then we could basically derive that are means G and what seems to be |
---|
0:30:11 | the word for sick and |
---|
0:30:13 | she seems to be the word for healthy |
---|
0:30:16 | and so the idea would be to |
---|
0:30:18 | create this process that we're doing in our minds that we putting this on a |
---|
0:30:24 | on the machine |
---|
0:30:25 | and they had in already substantial work done the first one probably the that's what |
---|
0:30:31 | i found was performed from the wall basis he |
---|
0:30:34 | so he was using a monolingual unsupervised segmentation in order to get the phone sequences |
---|
0:30:39 | into words |
---|
0:30:41 | and then there was the work from our is the work from the past and |
---|
0:30:44 | speaker and alex waibel they were using a cross lingual word to word alignment based |
---|
0:30:50 | on keys plus |
---|
0:30:52 | and then in the combination of work they basically combine the monolingual in the keys |
---|
0:30:56 | plus approach |
---|
0:30:57 | in order to find the phoneme to a word alignments |
---|
0:31:02 | and so phoenix are back in our lab she started using a cross lingual words |
---|
0:31:08 | to phoneme alignment it for that you improved we extended the |
---|
0:31:12 | it is a approach but let me first roughly explain how it work |
---|
0:31:18 | so you basically have to sentences you have a source language in this case it's |
---|
0:31:22 | not a german |
---|
0:31:24 | it says a different each digit date |
---|
0:31:27 | and then you have english phone sequence which you happen to have derived from what |
---|
0:31:33 | the person was saying so you have language words and things |
---|
0:31:37 | for you |
---|
0:31:38 | i mean ideally so and you have these little word boundaries already in there but |
---|
0:31:42 | of course if you create a phone sequence of a with the phone recognizer fingers |
---|
0:31:48 | all you have a lot of errors you might have a lot of errors here |
---|
0:31:51 | and of course you don't have any word boundaries |
---|
0:31:53 | and the question now is how can you |
---|
0:31:56 | get be alignment between the words and the corresponding phone |
---|
0:32:02 | and one of the issues okay so and what we next it is he extended |
---|
0:32:07 | keys plus model or if he's using the ibm model three |
---|
0:32:12 | and she |
---|
0:32:14 | basically |
---|
0:32:15 | change it such that it can be applied for a different word length so he |
---|
0:32:20 | was a putting in a word length probability |
---|
0:32:22 | in order to counterbalance the fact that if you to work to phoneme alignments that |
---|
0:32:27 | you have way more phonemes |
---|
0:32:28 | to be aligned to one to one word |
---|
0:32:31 | so the model he came up with basically skips the lexical translation it uses the |
---|
0:32:37 | fraternity uses the distortion but then puts in the word length |
---|
0:32:41 | and basically then the words are sort of placeholders and then be via the lexical |
---|
0:32:46 | translation would be translated to the phone sequence |
---|
0:32:51 | and what you found so the first result it did on be take english spanish |
---|
0:32:56 | so what he found is that he compared the keys that what to phoneme alignment |
---|
0:33:01 | to its model |
---|
0:33:03 | three P alignment |
---|
0:33:04 | and you can see that consistently |
---|
0:33:06 | independent of the phoneme error rate of the original phone sequence |
---|
0:33:11 | he could outperform other keys plus possible to phoneme alignment |
---|
0:33:15 | and she could also in terms of its course you could also outperform |
---|
0:33:21 | the monolingual one |
---|
0:33:24 | so this is on the pair from english to spanish and this is from the |
---|
0:33:27 | pair from spanish to english |
---|
0:33:29 | and so and f-score here of eighty percent for example means that he could get |
---|
0:33:34 | the segmentation the word boundaries in the phone sequences right to |
---|
0:33:38 | about ninety percent accuracy which is quite good |
---|
0:33:44 | so it's something that would have solved |
---|
0:33:47 | i mean that when we're not anywhere near that but is something we do have |
---|
0:33:52 | a correspondence between the phones and in the in the |
---|
0:33:56 | works |
---|
0:33:57 | we could now think of the next |
---|
0:33:59 | the next step would be to we take the phone sequences and generate a pronunciation |
---|
0:34:05 | dictionary |
---|
0:34:06 | and now we run into this problem with low already mentioned in her talk |
---|
0:34:09 | that is some words might have different meanings |
---|
0:34:12 | in so for example the word a german can mean both speech and language |
---|
0:34:18 | so what will happen is you will have a lot of different pronunciations and some |
---|
0:34:22 | vowel sound like speech even so they might be you know |
---|
0:34:26 | everyone S |
---|
0:34:27 | but others are significantly different in these are the language words |
---|
0:34:30 | so the first thing you need to do what we did is we clustered |
---|
0:34:34 | all the |
---|
0:34:35 | occurring phone sequences in the different |
---|
0:34:38 | what meanings |
---|
0:34:39 | and we use the D V scan a cluster for that so if you if |
---|
0:34:43 | when we're lucky we are ending up with the cluster which represents everything widget goes |
---|
0:34:48 | to language |
---|
0:34:49 | and the other cluster represents everything which goes to speech |
---|
0:34:53 | and then if you have several words which sound like speech then what prefix was |
---|
0:34:57 | using was the n-best lectures to from and real spoken order to |
---|
0:35:01 | basic generate one a result for speech |
---|
0:35:05 | and then he would put those two results into the dictionary so |
---|
0:35:10 | since we don't have |
---|
0:35:11 | a really written form we cannot give it what checks all you can either we |
---|
0:35:16 | in our case we just did numbers |
---|
0:35:19 | so there is one word which gets the number one and the has the pronunciation |
---|
0:35:23 | language |
---|
0:35:24 | and there's another word which gets the i teach when it has the pronunciation speech |
---|
0:35:30 | okay and so we try this out not would be tech but we moved to |
---|
0:35:35 | a pronunciation extraction on private data because felix found very nice page |
---|
0:35:42 | bible gateway which provides |
---|
0:35:46 | translations of bible of the bible in many different languages and he did not fifteen |
---|
0:35:50 | translations intend languages and they even provide all you in english portuguese and some other |
---|
0:35:56 | languages |
---|
0:35:57 | in order to play around with |
---|
0:35:59 | and so what needed to sorry she basically verse aligned the bible text so he |
---|
0:36:05 | extracted more than thirty |
---|
0:36:07 | rows alignments for training |
---|
0:36:08 | his approach |
---|
0:36:10 | and so what you see here is basically the distribution of the absolute phoneme errors |
---|
0:36:15 | in the extracted pronunciations |
---|
0:36:18 | so basically he compared extractor come a pronunciations to the real word pronunciations |
---|
0:36:23 | and so for example this thing here means that |
---|
0:36:27 | if you use T V spanish bible |
---|
0:36:31 | create english pronunciations |
---|
0:36:33 | there are about three thousand nine hundred words |
---|
0:36:36 | out of the fourteen hundred which had forty thousand which are extracted |
---|
0:36:40 | in these three thousand nine hundred words have no phoneme error so there really correspond |
---|
0:36:45 | exactly to what |
---|
0:36:46 | had been found in the dictionary |
---|
0:36:50 | so that in total the dictionary had something like fourteen hundred words so this is |
---|
0:36:54 | the use easily so this is the english one of the english the bible translations |
---|
0:36:59 | and this one gives you |
---|
0:37:01 | the distributions |
---|
0:37:02 | of the phone arrows across all the different languages he started from to extract the |
---|
0:37:07 | pronunciations |
---|
0:37:10 | one thing so what we did in this case it's we assumed |
---|
0:37:14 | it in a first step we assume that the target phone sequence |
---|
0:37:18 | is the canonical pronunciation so that means this has not been done with the phone |
---|
0:37:22 | recognizer on audio but it has been done |
---|
0:37:25 | under the assumption that the phone |
---|
0:37:27 | sequence escort |
---|
0:37:28 | only what was on almost the word boundary and so the next step is now |
---|
0:37:32 | to basically this with the real phone recognizer in see how well it will |
---|
0:37:38 | cross or |
---|
0:37:39 | okay so and then the next step but we have done that yet |
---|
0:37:43 | is that we now have a transcribed audio which is basically transcribed and word I |
---|
0:37:48 | Ds |
---|
0:37:49 | we could create a dictionary based on what I Ds or whatever you want to |
---|
0:37:53 | take here and then you have a language model and then we could put all |
---|
0:37:57 | these things together to build a speech recognition system based on data which has not |
---|
0:38:02 | been |
---|
0:38:04 | but they haven't so this last that we haven't done yet |
---|
0:38:07 | so i four minutes left three minutes left okay so the third one start run |
---|
0:38:13 | is really quickly this is here the question is what |
---|
0:38:16 | could be done if there are not many linguistic experts and i think everybody has |
---|
0:38:21 | seen this problem here she wants to build this |
---|
0:38:24 | system in a language and there's simply no but you their coach who could help |
---|
0:38:28 | lori was just talking about finding a korean for transcribing right |
---|
0:38:32 | and so |
---|
0:38:34 | so what we did |
---|
0:38:36 | the war started several years ago in that was and it's F project actually that |
---|
0:38:40 | was funded by mary harper |
---|
0:38:43 | and it did together with alan black the idea was |
---|
0:38:46 | how can you bridge the gap between technology experts and the language experts and so |
---|
0:38:51 | we build a lot of web pages tools |
---|
0:38:54 | and the web based tools allow somebody who doesn't know anything or not very much |
---|
0:38:59 | about speech recognition or G S |
---|
0:39:02 | it allows this person to |
---|
0:39:04 | work |
---|
0:39:05 | on the language for him or herself so you basically have hand rails |
---|
0:39:09 | which tells you what to do in as in this next step |
---|
0:39:12 | in order to build a system and the person can go step by step through |
---|
0:39:16 | would to build the system |
---|
0:39:17 | and we had done |
---|
0:39:19 | quite substantial work in the past so the system is |
---|
0:39:22 | every here it is used in a semi now which is a multilingual seminar also |
---|
0:39:27 | across seen U N K I T |
---|
0:39:30 | and we always adopt the languages of the students who take the score so what |
---|
0:39:34 | the use we have really done a lot of systems asr and tts systems in |
---|
0:39:39 | the respective languages |
---|
0:39:40 | and whenever we learn something we plug this into |
---|
0:39:43 | introduce the true |
---|
0:39:45 | and we using this we for education but we also using this for |
---|
0:39:50 | purposes to do up resources and reach out |
---|
0:39:53 | two people if we don't have time at hand |
---|
0:39:55 | for example we had one |
---|
0:39:58 | one |
---|
0:39:58 | one time wanted to build something in company and there was only a single speaker |
---|
0:40:02 | of company it seeing you so he was sending basically is what the web page |
---|
0:40:07 | to his people back in india and got a lot of speech and work out |
---|
0:40:11 | of this |
---|
0:40:12 | by |
---|
0:40:12 | being able to do virtually |
---|
0:40:15 | okay so i think i |
---|
0:40:17 | i should close here so i wanted to basically show you some |
---|
0:40:23 | solutions we are working on |
---|
0:40:24 | if we have resources and i try to give one example if we have no |
---|
0:40:28 | transcript if we have no |
---|
0:40:31 | pronunciations or if we need have no |
---|
0:40:33 | but system at all |
---|
0:40:35 | and also wanted to stress that we are keep working on our record language adaptation |
---|
0:40:40 | toolkit |
---|
0:40:41 | in order to leverage of fieldwork and also be able to out to do outreach |
---|
0:40:46 | to the community |
---|
0:40:48 | in order to then finally bridge the gap between those people who know the technology |
---|
0:40:52 | and those people know the language |
---|
0:40:54 | thank you very much |
---|
0:41:04 | for a few questions |
---|
0:41:17 | i just had the clarification question about your first |
---|
0:41:21 | study where you had no transcription |
---|
0:41:24 | so how do you obtain the pronunciation dictionary in the other models |
---|
0:41:30 | are you get |
---|
0:41:30 | i could let me just go back so the idea is that you have |
---|
0:41:35 | a dictionary in it so target most check so the idea would be you have |
---|
0:41:39 | a check dictionary given in the czech phone set |
---|
0:41:43 | but the source recognisable right to run the data is in let's say polish so |
---|
0:41:48 | what you need to do is you need to take the produce phones |
---|
0:41:52 | and replace the check phones in the dictionary for the problem phones |
---|
0:41:56 | so you basically map them in the czech dictionary |
---|
0:41:59 | so you get the polish variant of a check dictionary |
---|
0:42:06 | we do i forgot that i'm sorry so we are using through a global phone |
---|
0:42:10 | everything is expressed and ipa so doing a mapping is rather straightforward |
---|
0:42:14 | like you can do this with fifty mapping rules i would say not nothing too |
---|
0:42:18 | fancy because |
---|
0:42:19 | they're the context |
---|
0:42:20 | can be used this way but the in the after the iterations the |
---|
0:42:24 | acoustic models would be map adapted anyways |
---|
0:42:31 | so |
---|
0:42:32 | for the recognizer for languages be so and it's operating system actually the things they |
---|
0:42:36 | represented the last part |
---|
0:42:39 | what would be a recommendation to present the output because if there is no writing |
---|
0:42:43 | system is our to target all right you cannot probably due |
---|
0:42:47 | the users to just be I Ds numerical id so diverse every what we do |
---|
0:42:53 | so i think |
---|
0:42:55 | if you haven't if you have a system or language which is not written it |
---|
0:42:59 | all the tts might be a good approach that you basically |
---|
0:43:02 | speak it would be to the people so that they can listen to it |
---|
0:43:06 | so that would be run option and then all you have to do is if |
---|
0:43:09 | you get |
---|
0:43:11 | number one you basically to teach us on the phones |
---|
0:43:15 | and then the other thing of course the a simple straightforward one is apply any |
---|
0:43:20 | sort of |
---|
0:43:21 | phoneme to grapheme mapping to get it back so |
---|
0:43:25 | but that's |
---|
0:43:27 | all that we could come up with actually i don't like the number idea either |
---|
0:43:30 | so we basically switch now to something with underscore so that you can at least |
---|
0:43:34 | we did |
---|
0:43:37 | you could potentially and translation |
---|
0:43:39 | or understanding intentionally to |
---|
0:43:43 | as ways to see |
---|
0:43:44 | sure i mean actually what for what's out there is you have the word id |
---|
0:43:49 | in the target word to source word |
---|
0:43:53 | you shouldn't what this into translation system and then |
---|
0:43:57 | translated into |
---|
0:43:59 | language that's |
---|
0:44:00 | twenty correct |
---|
0:44:01 | i'm more interested in the asr it's always |
---|
0:44:10 | just one year when you were doing the bible work and finding |
---|
0:44:14 | see the words and read or differences were you actually can see how many of |
---|
0:44:18 | the words that are january you're not real words |
---|
0:44:21 | "'cause" you do in the list right and i didn't really |
---|
0:44:24 | get to the so |
---|
0:44:25 | so one thing is for the english case |
---|
0:44:29 | you can see that the number of |
---|
0:44:32 | generated words is roughly the number of |
---|
0:44:34 | words so it's not overgenerating in its not under generating |
---|
0:44:38 | but you can also see swedish for example son to generate into gary and check |
---|
0:44:42 | is what generate and what we found is that |
---|
0:44:45 | the generation is also related to how many words the source language has source we |
---|
0:44:51 | just has very few and very check as many so they'll what generating an unknown |
---|
0:44:56 | generating |
---|
0:44:57 | and of course name of the game would be to sort of cheap this in |
---|
0:45:00 | balance if we assume that it's roughly one on one |
---|
0:45:04 | does it and see a question |
---|
0:45:09 | one last no |
---|
0:45:10 | quick question |
---|
0:45:12 | you already asked |
---|
0:45:24 | could tell |
---|
0:45:27 | i practical question |
---|
0:45:30 | so they you have an application then you want to build the |
---|
0:45:35 | speech recognition based interface |
---|
0:45:39 | i looking what although these numbers can you advise me that |
---|
0:45:43 | you know what you're doing is the way to go or should i go transcribed |
---|
0:45:47 | audio and leave the speech recognition reduce then |
---|
0:45:52 | so for the record language adaptation server that's exactly what we were wondering so we |
---|
0:45:57 | are now doing our blaming on the components and try to advise the use of |
---|
0:46:02 | what's better is a better to work on the pronunciation that it doesn't work on |
---|
0:46:06 | the audio |
---|
0:46:07 | and for the first |
---|
0:46:08 | one hour the results are usually it's better to work on though |
---|
0:46:12 | getting some audio data so one alright at least |
---|
0:46:15 | and then getting the pronunciations right but it highly depends on how much data you |
---|
0:46:20 | have a way you are so |
---|
0:46:22 | difficult to answer but i think generating an automatic answer from the from the from |
---|
0:46:27 | an arrow blaming to tell the use of what to do next how to invest |
---|
0:46:32 | best his or time |
---|
0:46:33 | i think that's |
---|
0:46:35 | the right way to go |
---|
0:46:38 | of the question for all three speakers this morning for oriented for mary i think |
---|
0:46:45 | the big elephant in the room is the getting textual data for languages that are |
---|
0:46:50 | only spoken and touched on this a little bit |
---|
0:46:54 | and if you don't have web data and only have conversational data that's recorded making |
---|
0:47:00 | the transcripts i mean in with the G B goes that we developed really by |
---|
0:47:04 | five for a large as cost was always getting the text data and the translation |
---|
0:47:09 | data |
---|
0:47:10 | so acoustic modeling is by comparison relatively straightforward cheap to do |
---|
0:47:15 | so my question is how do you interested i mean the hero here the room |
---|
0:47:19 | is married to do this gigantic database of conversational data and the wondering how is |
---|
0:47:25 | a community we can make faster progress set that problem |
---|
0:47:34 | i've never heard of a question do a speaker who has spun |
---|
0:47:37 | i |
---|
0:47:39 | well maybe this is something i think about a good question we can come by |
---|
0:47:47 | so let's bring that up again and today |
---|
0:47:50 | okay let's thank time and what rate |
---|