0:00:15 | so a good morning everyone |
---|
0:00:17 | i'm going to you do a sort of a |
---|
0:00:20 | this struggle |
---|
0:00:22 | passage also some the work we did before |
---|
0:00:27 | and the we i hope this'll work links and easy to |
---|
0:00:30 | so basic only to be talking about semi supervised and unsupervised acoustic model training with |
---|
0:00:35 | limited linguistic resources |
---|
0:00:38 | V i mean and as most of us know i'm going to this with a |
---|
0:00:44 | lot of actually overstate last decade instead of |
---|
0:00:48 | research |
---|
0:00:49 | and so i'm gonna talk about some experience we've had a team see |
---|
0:00:52 | about like the unsupervised and super unsupervised training |
---|
0:00:56 | L give a couple of case studies |
---|
0:00:58 | and i actually first case study will be on english which at this there |
---|
0:01:01 | then we'll all talk |
---|
0:01:03 | very briefly about some different types of lexical units from modeling switches graphemic units versus |
---|
0:01:09 | phonemic units in babble |
---|
0:01:11 | and already mentioned just briefly i added that slightest an acoustic model interpolation "'cause" we're |
---|
0:01:16 | talking about how to deal with all this header engenders data |
---|
0:01:20 | and all five fish with some comments |
---|
0:01:24 | so |
---|
0:01:27 | over the last |
---|
0:01:28 | decade or two we've seen part of advances in speech processing technologies lot of the |
---|
0:01:33 | technologies are getting out there from industrial companies and there's that kind of commonplace for |
---|
0:01:39 | a lot of the people right now and so people expect that this stuff really |
---|
0:01:42 | works and i think |
---|
0:01:44 | this is great that we're seeing or not really get out there at the other |
---|
0:01:47 | people's expectations are really |
---|
0:01:49 | i and we still have problems that are systems are pretty much developed for a |
---|
0:01:54 | given |
---|
0:01:55 | task a given language |
---|
0:01:57 | and we still have a lot of work to do deported get good performance on |
---|
0:02:00 | other tasks and languages we only cover a few tens maybe you fifty years old |
---|
0:02:05 | languages now as a community |
---|
0:02:07 | and many times language variance are actually even consider different languages is just easier to |
---|
0:02:13 | develop a system for different very |
---|
0:02:15 | and we still rely and language resources a lot |
---|
0:02:18 | but over the last decade or two we've been seeing that's a reliance on human |
---|
0:02:23 | intervention so we can use them |
---|
0:02:25 | with a little bit less human work |
---|
0:02:28 | so i guess this is sort of just |
---|
0:02:31 | everybody knows this or maybe everybody doesn't die if there's some people that are working |
---|
0:02:35 | on speech recognition here where we are holy grail listed all this technology that works |
---|
0:02:40 | on anything that it's independent of the speakers of the task |
---|
0:02:43 | there's no problem noise is no problem changing your microphone |
---|
0:02:46 | and i guess some says maybe fortunately for us still resource do "'cause" this remains |
---|
0:02:51 | the dream for us |
---|
0:02:53 | but we do have |
---|
0:02:55 | a lot lower error rates and we had a decade or two ago |
---|
0:02:58 | we can process many more types of data with different speaking styles different conditions you |
---|
0:03:03 | originally in the work that we we're doing was always requiring read speech who needs |
---|
0:03:08 | to recognise something that was read from attacks it doesn't |
---|
0:03:11 | that's a logical now look back at it |
---|
0:03:13 | we cover more languages and we have a fair amount of work reaching the output |
---|
0:03:19 | to make the transcripts more usable you know by systems or machines which is not |
---|
0:03:23 | exactly the same thing so you might wanna quit different information you're going downstream processing |
---|
0:03:27 | by machines |
---|
0:03:28 | purses if you're doing it for you to be reading |
---|
0:03:32 | so what's a low resource language i don't really have an answer but i think |
---|
0:03:36 | in many of us in this community |
---|
0:03:39 | typically mean that there are too many E resources so we don't find information online |
---|
0:03:44 | "'cause" that's what we're using now to develop systems |
---|
0:03:47 | if you speak to link was i think it's may be very different answers we |
---|
0:03:50 | get and i'm not don't really wanna get into that |
---|
0:03:53 | but |
---|
0:03:55 | basically the |
---|
0:03:58 | we need to be able to find it if we want to develop systems |
---|
0:04:01 | and that type of thing going to talk about our languages that are low resource |
---|
0:04:06 | in the in the sense that el the ldc don't have resources that they distribute |
---|
0:04:10 | google probably has them |
---|
0:04:13 | and you can we get them online can we develop systems with data that we |
---|
0:04:16 | find online |
---|
0:04:17 | i'm not going to really talk about the babble type languages or other rare languages |
---|
0:04:23 | where |
---|
0:04:24 | but you really don't even have mostly writing conventions you don't necessarily understand you have |
---|
0:04:29 | any information about the language except maybe some linguists that have spoken to some people |
---|
0:04:33 | aren't gonna visited |
---|
0:04:34 | and so i guess you're a little bit more in that direction for marianne from |
---|
0:04:38 | time in the |
---|
0:04:39 | next part of the talk |
---|
0:04:40 | and of course this framework with by outside on the fusion room |
---|
0:04:43 | i'm trying to do the speech translation for this text languages with this really no |
---|
0:04:48 | resources |
---|
0:04:49 | so we have little or essentially no |
---|
0:04:51 | available audio data |
---|
0:04:53 | you have |
---|
0:04:54 | probably nothing in terms of dictionaries you don't even necessarily have word this languages in |
---|
0:04:58 | general very limited knowledge about the language |
---|
0:05:01 | but you can also consider the many types of data for well-resourced language user language |
---|
0:05:05 | variance |
---|
0:05:06 | or almost low resource because we just don't have |
---|
0:05:09 | much available data for |
---|
0:05:11 | so let me take a little stuff back in time to the late nineteenth and |
---|
0:05:14 | early two thousand |
---|
0:05:16 | and one of the questions that you get all the time from funding agencies is |
---|
0:05:20 | how much stated you need |
---|
0:05:22 | okay |
---|
0:05:23 | we try to answer this that i don't think anybody knows which is an hour |
---|
0:05:26 | say what depends where you want to be it depends what you want to do |
---|
0:05:30 | the funding agency users were leases time brawls complaining the data collection is |
---|
0:05:35 | it is costly |
---|
0:05:37 | i'm why you always asking to find data that is see this is a recurrent |
---|
0:05:41 | question |
---|
0:05:42 | and so this is the curves that we did back in two thousand showing with |
---|
0:05:46 | supervised training on broadcast news data in english |
---|
0:05:50 | how much you word error rate is as a function of business have a pointer |
---|
0:05:56 | the red one |
---|
0:06:01 | no i mean well that's anyway you start with |
---|
0:06:04 | the little number of the really high number of the left is one and a |
---|
0:06:07 | half hours of audio data distraught bootstrapping system |
---|
0:06:10 | with a well trained language model |
---|
0:06:13 | the second point there is about thirty three with what a set of ten minutes |
---|
0:06:18 | the next one is one and half hours where you see that the word error |
---|
0:06:21 | rate is about thirty three percent and then as we have more data we go |
---|
0:06:25 | down we see that once we get to fifty or a hundred hours |
---|
0:06:28 | the sort of starts to platform so we're getting diminishing returns really additional data and |
---|
0:06:33 | so this one thing we can say |
---|
0:06:36 | the red |
---|
0:06:38 | we do |
---|
0:06:40 | okay so once we get you could you know what hundred hours of data |
---|
0:06:43 | basically |
---|
0:06:45 | you don't wanna spend a lot of money for that additional data could you just |
---|
0:06:48 | not getting much returns |
---|
0:06:49 | once again this is on broadcast news data we had a reasonably well train a |
---|
0:06:54 | language model so we're seeing this asymptotic behavior of the error rate something we observed |
---|
0:07:00 | in the community at large is that when you start a new task get |
---|
0:07:04 | rapid progress it's really fine "'cause" everything here the error rates are dropping we're getting |
---|
0:07:07 | twenty percent thirty percent and one here is great |
---|
0:07:10 | that once you get some reasonable |
---|
0:07:12 | every we're getting about six percent per year and where did some count if you |
---|
0:07:17 | look over say ten or fifteen years of progress |
---|
0:07:19 | it seems like the average |
---|
0:07:21 | improvement we're getting is about six percent per year |
---|
0:07:25 | so this groups |
---|
0:07:26 | i don't wanna do that |
---|
0:07:27 | additional data should cost less |
---|
0:07:29 | and we need to learn how to use to predict this is sort of what |
---|
0:07:32 | was going to remind back in two thousand K which is still i think quite |
---|
0:07:36 | relevant to that |
---|
0:07:38 | so |
---|
0:07:41 | you can think about different types of bubbles the supervision so way that when people |
---|
0:07:46 | were saying we should use phonetic |
---|
0:07:47 | or phone level transcriptions for training or phone models as logical |
---|
0:07:52 | it's gives you more information is better than using words |
---|
0:07:54 | and |
---|
0:07:55 | people did that |
---|
0:07:57 | our experience that they can see when we did some tests on this using timit |
---|
0:08:01 | type data switchboard and a breath is read speech corpus in french |
---|
0:08:04 | is that actually humans like a segmentation is that we're human phonetic transcriptions better but |
---|
0:08:10 | the system like the automatic ones better |
---|
0:08:12 | so basically if you use the word level transcription with the dictionary the covers a |
---|
0:08:16 | reasonable about variance the systems were better than training them on the phonetic transcriptions maybe |
---|
0:08:21 | that would not be true nowadays i don't know we have redone it |
---|
0:08:24 | but that sort of satisfied us to say okay we can go ahead we can |
---|
0:08:26 | do this approach that we do the standard alignment of word level transcriptions with the |
---|
0:08:31 | audio so then if you go to be next you can say okay we can |
---|
0:08:35 | have a large amount of actual reality data for large as round hundred hours or |
---|
0:08:40 | greater than a hundred hours |
---|
0:08:42 | we can have some |
---|
0:08:44 | after the annotated data but a lot of unlabeled data with approximate transcriptions not gonna |
---|
0:08:48 | give some results on this |
---|
0:08:50 | we can have no annotated data |
---|
0:08:52 | okay but we can find some sort of related text related information |
---|
0:08:56 | or we can have some small amount and then use this to bootstrap are systems |
---|
0:09:00 | that will be sort of semi supervised and this is what we heard about a |
---|
0:09:03 | little bit yesterday this is what people been doing so you basically transcribed raw data |
---|
0:09:07 | you say this is ground truth and you do your standard training to build a |
---|
0:09:11 | models this work |
---|
0:09:13 | there's no lot of variance that have been published so you can filter it you |
---|
0:09:16 | can use confidence measures you can use consensus networks you can do rover you can |
---|
0:09:20 | do |
---|
0:09:20 | lattices lots of different sort of variance |
---|
0:09:23 | and i listed some of the early work but in my recollection it was people |
---|
0:09:27 | involved years project and people involved in the project in europe |
---|
0:09:31 | and i if i forgot people i'm sorry i don't mean to but this is |
---|
0:09:35 | what comes to mind absolutely |
---|
0:09:37 | early adopters of this |
---|
0:09:39 | type of activities |
---|
0:09:42 | so |
---|
0:09:43 | if we just go back to supervised training and i think most people in this |
---|
0:09:46 | room know this i'm not gonna stand for a long time |
---|
0:09:49 | to normalize transcriptions what you do that for the language model anyway that's not so |
---|
0:09:53 | bad |
---|
0:09:54 | you need to do things and creating a word this you need to come up |
---|
0:09:57 | phonemic transcriptions and you meet you in the old days we collected so when we |
---|
0:10:02 | start errors in the transcripts we actually spend time correcting that because we only had |
---|
0:10:05 | thirty hours fifty hours we thought it would give a something i think young people |
---|
0:10:09 | today wouldn't even think about this but we spent a lot of times that we |
---|
0:10:13 | but that |
---|
0:10:14 | and then you you're standard training |
---|
0:10:18 | so |
---|
0:10:20 | this is this showing the results are using what we called semi supervised training |
---|
0:10:25 | so you had a language model that was |
---|
0:10:27 | trained on a certain amount of hours amazing the justice right okay so they the |
---|
0:10:32 | manual word error rate was eighteen percent if we had a fully train system |
---|
0:10:36 | we used closed captions as a language model one am showing here be done different |
---|
0:10:41 | variance |
---|
0:10:42 | so it's a sort of an approximate transcription that we had |
---|
0:10:44 | and we took |
---|
0:10:47 | in these numbers we started we |
---|
0:10:49 | every now i think |
---|
0:10:52 | ten hours of original data |
---|
0:10:54 | and we then transcribe varying amounts of |
---|
0:10:57 | unlabeled data so this is the raw unlabeled data |
---|
0:11:00 | and we said okay we can use it is so that this unfiltered this is |
---|
0:11:05 | close to an unsupervised training courses we can do a semi supervised what we say |
---|
0:11:09 | where the this not too much of an error rate difference between the transcript we |
---|
0:11:14 | generate and what the caption was that's good |
---|
0:11:17 | and so we took in this at a sort of phrase level where the segment |
---|
0:11:21 | level we just kept and train the segments where the word error rate for the |
---|
0:11:24 | captions to be from automatic transcriptions with less than X and i don't remember what |
---|
0:11:30 | the experts |
---|
0:11:30 | which probably less than twenty or thirty percent error rate |
---|
0:11:34 | and you can see that we get pretty close |
---|
0:11:36 | and so we get within ten percent absolute of the manual transcriptions using both in |
---|
0:11:41 | i words this is what we do mostly then we just don't bother filtering it's |
---|
0:11:46 | easy you just train on everything |
---|
0:11:49 | it seems to give about the same type of results |
---|
0:11:52 | be a measure that was introduced by dbn is was called the word error |
---|
0:11:57 | rate recovery and so basically you look at the difference between |
---|
0:12:02 | how much you get some supervised training and how much we get from unsupervised training |
---|
0:12:06 | from your initial starting point and so what we get here is about eighty five |
---|
0:12:10 | to ninety percent is we're covering most of what we could have gotten had we |
---|
0:12:14 | don't supervised training |
---|
0:12:17 | one problem we had this work is that there is some knowledge in the system |
---|
0:12:20 | because we did |
---|
0:12:22 | have prior knowledge from the dictionary we did have a pretty good language model was |
---|
0:12:25 | close to the data it didn't wasn't exactly same data but was close |
---|
0:12:30 | so we discuss the set i think it was in years meeting or maybe a |
---|
0:12:34 | conference mostly was with rich worse we were discussing it and said well you know |
---|
0:12:38 | take into an extreme let's see if we can use |
---|
0:12:44 | one hour training data work ten minutes of training data |
---|
0:12:47 | and we were crazy enough to do this in the time it was a lot |
---|
0:12:50 | of computation because every time every time you value with a different language model every |
---|
0:12:53 | time you use different amounts of data you have to be the code reader could |
---|
0:12:56 | multiple times |
---|
0:12:58 | these days be very easy to be one of these experiments but at the time |
---|
0:13:01 | to time to do that |
---|
0:13:03 | and so here we see that if we start with a ten minute bedroom system |
---|
0:13:07 | we've got a word error rate of sixty five percent we actually did this didn't |
---|
0:13:11 | think it would work |
---|
0:13:14 | and that was okay and that's just take some data so three to four hours |
---|
0:13:17 | of |
---|
0:13:18 | data |
---|
0:13:19 | and non people |
---|
0:13:21 | in improve the fact we throw away these ten minutes could was just more complicated |
---|
0:13:25 | to build models merging it to do that okay we take the ten to four |
---|
0:13:28 | hours of automatic |
---|
0:13:29 | and we go to fifty four or we go down and we stopped heard about |
---|
0:13:33 | under forty hours and we got thirty seven point four percent if you use the |
---|
0:13:38 | same language model with the full training data supervised |
---|
0:13:42 | you only get to about thirty percent |
---|
0:13:44 | so we're getting pretty |
---|
0:13:45 | good difference of where we need to get |
---|
0:13:48 | so we're happy with that |
---|
0:13:50 | and it about this time |
---|
0:13:53 | so what the green in came to do this is with this C and we |
---|
0:13:56 | sort of tried to we don't really apply this method to his work as we |
---|
0:14:01 | don't have enough audio data but we did try and look at |
---|
0:14:05 | questions that we've been asking for a long time |
---|
0:14:08 | as to how much |
---|
0:14:09 | data you need to train models |
---|
0:14:10 | what improve performance can you expect when you have limited resources what's more important audio |
---|
0:14:17 | data or text data |
---|
0:14:19 | and how can you prove the language models when you have very little bit so |
---|
0:14:21 | she can twenty around two thousand four if i remember correctly and we had available |
---|
0:14:26 | i guess we consider this reasonably small amounts of data we had thirty seven hours |
---|
0:14:31 | was |
---|
0:14:32 | bodily good but it's a not of a |
---|
0:14:35 | and we had about five million words of transcripts |
---|
0:14:38 | and we nothing about number |
---|
0:14:42 | and so what when the first things you did was to look at what's the |
---|
0:14:46 | influence of the audio data transcripts versus what transcripts on out-of-vocabulary |
---|
0:14:52 | and so on the left we have the out-of-vocabulary rate |
---|
0:14:54 | and here we're showing for two hours of transcripts ten hours of transcriptions about seven |
---|
0:14:59 | K words twenty K words and fifty K words with the thirty five |
---|
0:15:02 | and how it is your oov rate |
---|
0:15:05 | go down as you adding more transcripts and so you can see here and the |
---|
0:15:08 | top her |
---|
0:15:10 | if you add in sre and on this |
---|
0:15:13 | so i adding more text data so as the curves are the amount of transcripts |
---|
0:15:17 | we have |
---|
0:15:18 | and on there |
---|
0:15:20 | bottom x-axis we reading different amounts of text data that protect five million text sources |
---|
0:15:26 | and so if you start with |
---|
0:15:28 | just two hours of audio data if you had ten K |
---|
0:15:33 | you don't really lower your be too much |
---|
0:15:36 | if you have hundred K right a little bit more et cetera et cetera |
---|
0:15:40 | if you're have ten hours of data |
---|
0:15:43 | there isn't much of an effect and so you see that the effect of adding |
---|
0:15:46 | the text data is less than adding the audio data |
---|
0:15:49 | that actually at a cat to because we probably have some sort mismatch we know |
---|
0:15:52 | the audio data is the same type of data we're trying to look at |
---|
0:15:55 | and the text data is related but not really the same |
---|
0:16:00 | so then here's another curve that we're trying to |
---|
0:16:03 | look at the amount of audio data |
---|
0:16:07 | versus text data for |
---|
0:16:09 | language model is a little bit complicated discrete on hers here or you just two |
---|
0:16:14 | hours of audio data in the acoustic model |
---|
0:16:16 | and the bottom one than the green once again it's ten hours and the red |
---|
0:16:20 | is thirty five |
---|
0:16:21 | and |
---|
0:16:22 | but you can see that even if you add in more text data you're not |
---|
0:16:26 | really improving the word error rate now |
---|
0:16:29 | and everyone said okay is it coming from the acoustic data where's it coming from |
---|
0:16:33 | the transcripts we know the transcripts or less close |
---|
0:16:35 | and so we added in on the purple in the blue curves are using the |
---|
0:16:40 | transcripts from the ten hours for the thirty five hours and so we can see |
---|
0:16:44 | that if you only have two hours of audio data is just not going to |
---|
0:16:47 | do very well and you need more once you get to ten hours |
---|
0:16:51 | it seems like a improvement you get is |
---|
0:16:54 | little lesson this is interesting "'cause" this is what was being used currently in the |
---|
0:16:57 | babel project we're actually |
---|
0:16:59 | working with ten hours for some the conditions |
---|
0:17:02 | let me a few minutes about some other work he did for |
---|
0:17:08 | the language modeling and it's with on work comp word decomposition so i'm work is |
---|
0:17:14 | a very rich morphology and has lots of to poke our high out-of-vocabulary rates |
---|
0:17:21 | a problem also for language modeling is that is very took is not very well |
---|
0:17:25 | models so therefore it's interesting to use a word decompounding |
---|
0:17:30 | and when you look at the literature you can see needs results languages you get |
---|
0:17:33 | a nice gains some you don't always get it again you know we don't necessarily |
---|
0:17:37 | get that in word error rate |
---|
0:17:39 | and so |
---|
0:17:41 | one a |
---|
0:17:43 | idea that we had in this work was to try and avoid with the generate |
---|
0:17:46 | once you don't want to create visible units and |
---|
0:17:49 | so that's what i'm going to give a couple of ideas about and to do |
---|
0:17:53 | this we build matched conditions these types of we train language models retrained acoustic models |
---|
0:17:58 | for all the |
---|
0:17:59 | conditions |
---|
0:18:01 | so this is showing |
---|
0:18:03 | here we had the we use them morfessor algorithm |
---|
0:18:07 | which is relatively recent at the time |
---|
0:18:10 | so this basic morfessor will be the curse would |
---|
0:18:13 | then there's |
---|
0:18:15 | no reason it's referred to as harris we're basically you look at the number of |
---|
0:18:19 | strings letters the can succeed another better and that gives you an idea of what |
---|
0:18:23 | the perplexity is if you have a big a lot of different letters that can |
---|
0:18:27 | follow it's like the dbn you word if it's were you more and if it's |
---|
0:18:31 | not it's likely to be within the same |
---|
0:18:33 | a more |
---|
0:18:35 | we also then tried to use and distinctive feature properties to train at some speech |
---|
0:18:40 | information into the decomposition and looked using phonemic confusion constraints that were generated using phone |
---|
0:18:46 | alignments and so basically here if you have used to sequences neighbour a and may |
---|
0:18:52 | well |
---|
0:18:53 | in queens like that may because if the not the lot and the well we |
---|
0:18:57 | easily confusable if they were easily confusable was okay display |
---|
0:19:00 | the idea that |
---|
0:19:03 | constraints it's relatively language-independent but of course you do need to know the phonemes in |
---|
0:19:07 | the language or |
---|
0:19:08 | have and you set of phonemes in general |
---|
0:19:10 | so this is looking at what happens in terms of the number of tokens you |
---|
0:19:15 | get after T V splitting for different configurations |
---|
0:19:18 | and |
---|
0:19:20 | the length of them so it's something that was also the weight was represented rest |
---|
0:19:24 | everything that was to phonemes so that's where you see things a two four six |
---|
0:19:28 | eight et cetera |
---|
0:19:28 | and basically the main point is that anything |
---|
0:19:31 | that is that's this is your baseline by the words in the black |
---|
0:19:36 | and once you start cutting use |
---|
0:19:39 | units get shorter as you expect that's a goal of it |
---|
0:19:42 | and then if you use this confusion constraints those ones we cease uses green in |
---|
0:19:46 | the purple general there are a little bit less shifted to the left so we're |
---|
0:19:49 | creating slightly |
---|
0:19:50 | if you were very short units |
---|
0:19:53 | and that was the goal what we're trying to do and then here's a table |
---|
0:19:57 | that we probably don't want to go |
---|
0:20:00 | into too much but if you look at the baseline system we had twenty two |
---|
0:20:03 | point six percent |
---|
0:20:04 | no numbers are relatively close okay but |
---|
0:20:07 | if you split anything |
---|
0:20:09 | the error rate general gets worse isn't the black ones so you can use to |
---|
0:20:13 | the distinctive features that you really help |
---|
0:20:17 | you can |
---|
0:20:18 | the only ones that there's were used disk this phoneme confusion constraint and so here |
---|
0:20:22 | all of these two slightly better |
---|
0:20:24 | the and the baseline and those the only ones and so is really important to |
---|
0:20:28 | avoid adding |
---|
0:20:30 | we need to confusion your homophones and your system |
---|
0:20:32 | so that sort of the typical message for this |
---|
0:20:37 | so we |
---|
0:20:38 | the other one thing today we got fifty percent reduction in oov rates that was |
---|
0:20:43 | good except we were introducing errors and confusions and the little affix as |
---|
0:20:48 | that we're compensating or recording this more than yours we recovered |
---|
0:20:54 | and basically we did some studies and look at the previously over previous oov words |
---|
0:20:59 | and basically about half of them were correctly recognized using this method but we would |
---|
0:21:03 | swap it out |
---|
0:21:04 | with a recently introduced |
---|
0:21:06 | on the different aspects is that work |
---|
0:21:09 | so just another slide sorta not of all of his work what was more logical |
---|
0:21:13 | to put here in the talk so i |
---|
0:21:15 | but here is we've used unsupervised decomposition once again usually based on morfessor or some |
---|
0:21:20 | modifications of it for finnish hungarian and german |
---|
0:21:24 | and russian |
---|
0:21:25 | for the first three languages we got reasonable gains a between one and three percent |
---|
0:21:30 | and we can reduce our vocabulary size is from seven hundred thousand two million words |
---|
0:21:35 | to around three hundred thousand |
---|
0:21:36 | which are a little bit more usable for the system and probably more |
---|
0:21:39 | easy to train a reliable estimates |
---|
0:21:43 | for some of them we need acoustic model retraining so we could do not for |
---|
0:21:47 | german |
---|
0:21:47 | for finnish we tried both the |
---|
0:21:51 | acoustic model retraining or not |
---|
0:21:52 | and we |
---|
0:21:53 | well time's got three percent difference using the morphologically decomposed system whether or not retrain |
---|
0:21:58 | the acoustic model to |
---|
0:21:59 | interesting for us morfessor worked well for finnish i think in part because the authors |
---|
0:22:04 | were |
---|
0:22:04 | and so there |
---|
0:22:05 | the output was maybe design for that |
---|
0:22:08 | we also tried to do this and russian cts where we only had the time |
---|
0:22:12 | about ten hours of training data so conversational telephone speech |
---|
0:22:16 | six yes for some people that might know what |
---|
0:22:19 | and we got a reduction in the oov we were able to use a smaller |
---|
0:22:22 | vocabulary but we can get an egg in word error rate |
---|
0:22:26 | but once again this is very preliminary |
---|
0:22:28 | work we get done |
---|
0:22:29 | so now i'm going to shift your gym where my time |
---|
0:22:36 | fourteen is gone |
---|
0:22:37 | that okay several faster |
---|
0:22:39 | so to speak a few minutes about |
---|
0:22:42 | finish where we do have is one of the first languages and deeper we didn't |
---|
0:22:46 | have any |
---|
0:22:49 | audio and untranscribed audio and so |
---|
0:22:52 | we have found some online data with an approximate transcripts that comes from a |
---|
0:22:57 | initially used for foreigners finish |
---|
0:23:00 | and there is no transcribed development data either and said how are we gonna do |
---|
0:23:04 | this for many companies is easy to hire someone to transcribe some data for us |
---|
0:23:08 | is not so we see so it takes time to find the person |
---|
0:23:11 | if we're government research labs to |
---|
0:23:14 | this is a complicated so can we get ahead |
---|
0:23:17 | by doing something simpler and so we did is we |
---|
0:23:19 | use this is approximate transcriptions but also for the development not just for the |
---|
0:23:26 | unsupervised training |
---|
0:23:28 | and once again as i said before we use morphological decomposition for this |
---|
0:23:32 | so here's occur showing the |
---|
0:23:35 | estimated word error rate as we increase the amount of unsupervised training data |
---|
0:23:40 | so we have two hours of five hours and then sir stabilised again once we |
---|
0:23:45 | get around ten to fifteen hours |
---|
0:23:47 | we're stabilising C we get a beginning here and this approximate but is going the |
---|
0:23:51 | right direction that about two months later we had somebody the came in and |
---|
0:23:57 | transcribed data for us it was a two or three hours sets is not a |
---|
0:24:00 | lot it still took awhile first to get the person for them to do it |
---|
0:24:03 | and you can see the human error rate |
---|
0:24:05 | use |
---|
0:24:06 | in the following exactly the same curve |
---|
0:24:09 | are error rates higher zero underestimating here because what we did is we selected regions |
---|
0:24:13 | as and the done for the unsupervised training where there was a good knots between |
---|
0:24:17 | these sort of approximate transcriptions |
---|
0:24:19 | and |
---|
0:24:19 | what the system did we measured on that but we're not because it allowed us |
---|
0:24:23 | to develop without necessarily having to wait for this data to become |
---|
0:24:27 | available |
---|
0:24:30 | so the message on that is that the unsupervised acoustic model training worked reasonably well |
---|
0:24:35 | using these approximate transcripts |
---|
0:24:37 | with since then it on |
---|
0:24:38 | some sorry it is also worked on |
---|
0:24:42 | for the language models so we can improve our language models using the sort approximate |
---|
0:24:46 | transcription it worked |
---|
0:24:48 | we then added into the system some cross lingual mlp |
---|
0:24:51 | so we tried both french and english |
---|
0:24:53 | and we got about ten percent improvement |
---|
0:24:55 | and i said before with the morphological decomposition |
---|
0:24:59 | so now i'm gonna talk about not a language which also is consider somewhat low |
---|
0:25:05 | resourced so that in |
---|
0:25:06 | and this was work it was done |
---|
0:25:08 | with all the other operand was that nancy can was russian so you sort of |
---|
0:25:12 | down that unit interesting language for him and basically his words where they just know |
---|
0:25:16 | nothing for that and out there |
---|
0:25:17 | this assistance is not distribute corpora but you can find text and audio on the |
---|
0:25:22 | net so therefore something we could reasonably do |
---|
0:25:25 | it's a baltic language is not so many speakers of one point five million it's |
---|
0:25:29 | a complicated language but uses a lot now forget half of it |
---|
0:25:34 | and you please reasonably straightforward |
---|
0:25:37 | so i this is sort of the overview of the language models we found a |
---|
0:25:41 | fair amount of data |
---|
0:25:42 | good |
---|
0:25:44 | one point six million words |
---|
0:25:46 | and in domain data and hundred forty two million words newspapers so the in domain |
---|
0:25:50 | means it comes from like radio and T V stations and this just newspapers |
---|
0:25:54 | we used about a five hundred thousand word vocabulary just keeping words that occurred more |
---|
0:25:58 | than three times |
---|
0:26:00 | text processing thing is or standard |
---|
0:26:02 | however this isn't really important stuff it's if you don't do the text processing carefully |
---|
0:26:06 | you have problems when trying to cancer supervised training means that seems to be our |
---|
0:26:10 | experience and it was pretty much standard language models he threw in some neural network |
---|
0:26:15 | language models at the end so given distressed talk that was interesting to |
---|
0:26:19 | for that line into |
---|
0:26:20 | so this is this figure showing the |
---|
0:26:24 | word error rate have goes up so that these curves here the word error rate |
---|
0:26:27 | as a function of the iteration |
---|
0:26:30 | and me circles are shown you roughly the size of the acoustic units were roughly |
---|
0:26:34 | doubling at each stop |
---|
0:26:35 | the amount of audio data used in an unsupervised manner |
---|
0:26:40 | for the systems |
---|
0:26:41 | at this level here for we added in the mlp from russian |
---|
0:26:45 | are initial seed models were here |
---|
0:26:49 | came from the mixer three languages english french russian the audio data wiper about sixty |
---|
0:26:54 | hours at this stage to about seven hundred eight hundred hours at this stage raw |
---|
0:26:58 | so you're only using about half |
---|
0:27:01 | when you have to build models |
---|
0:27:02 | and of course something that's important used to increase the number of context in the |
---|
0:27:05 | states that you model the same time |
---|
0:27:07 | so it doesn't suffice just add more data to keep the model topology fixed you |
---|
0:27:11 | don't get much of again from the |
---|
0:27:15 | afterwards he did some additional tuning parameters and you pass decoding and |
---|
0:27:20 | use the four gram lm and you can see that we it's just the original |
---|
0:27:24 | so see i is |
---|
0:27:26 | case insensitive and C D is case sensitive not context independent context dependent |
---|
0:27:31 | as we're looking at the word error rate if you take into account case "'cause" |
---|
0:27:34 | what people want to read is really having case correct |
---|
0:27:37 | and even for different search engines sometimes is important to have the case correct |
---|
0:27:40 | because you want to know for the proper name or not |
---|
0:27:44 | and so for people that are found neural net language model got about what have |
---|
0:27:47 | to two percent gain |
---|
0:27:49 | by |
---|
0:27:49 | adding them and this is on dev data and then and validated we got pretty |
---|
0:27:53 | much similar results |
---|
0:27:54 | so we were happy with that so it's completely unsupervised we developed a system |
---|
0:27:58 | in about less than a month |
---|
0:28:01 | mainly at the end we were and this is trying for hungarian roughly the same |
---|
0:28:05 | thing |
---|
0:28:06 | we used a few data from |
---|
0:28:08 | five languages we had less audio data so we only one two |
---|
0:28:12 | about three hundred hours |
---|
0:28:13 | and we used a originally and mlp trained on english |
---|
0:28:17 | and then we use the transcripts of this level to then generate an mlp trail |
---|
0:28:23 | area |
---|
0:28:24 | using unsupervised transcription we got a another two point eight or so that napster |
---|
0:28:29 | so just to you been |
---|
0:28:31 | overview this just some results from the program which some of you are where i'm |
---|
0:28:36 | sure in some of you or less |
---|
0:28:37 | the systems to the one the including channel |
---|
0:28:41 | are trained on supervised data |
---|
0:28:43 | and the supervised data varies from fifty to see a hundred fifty two hundred hours |
---|
0:28:48 | upon language |
---|
0:28:49 | on the right the role train unsupervised |
---|
0:28:52 | the green sorry the low |
---|
0:28:55 | mine is the average error rate across the test data about three hours |
---|
0:29:00 | and so you see it's are going up and the ones on the side our |
---|
0:29:02 | general little bit higher the ones on the left not so much they're pretty good |
---|
0:29:08 | bulgarian and with the when you are a little bit higher here |
---|
0:29:11 | then on the a look some vocational come back to a few minutes |
---|
0:29:15 | if you look at the |
---|
0:29:17 | lowest word error rate i anyone to the segments we had from T V radio |
---|
0:29:20 | they're pretty known in fact even some of the unsupervised or the word and the |
---|
0:29:25 | supervised |
---|
0:29:26 | and finally this is the worst case since the worst-case word error rate is still |
---|
0:29:29 | pretty ice we still got a fair amount of work to do |
---|
0:29:32 | these data are mixed news and conversations |
---|
0:29:36 | and some of them are some languages a more interactive than others things like that |
---|
0:29:41 | so i'm going to skip the next slide which is too much stuff was to |
---|
0:29:44 | show the amount of data we use of people are interested come finally later |
---|
0:29:49 | and i want to say two or three words about dictionary so when the |
---|
0:29:54 | think that we're this them that passes very costly to do with dictionaries and so |
---|
0:29:58 | there's been |
---|
0:30:00 | more recently use a growing interest in using graphemic type units rather than phonetic units |
---|
0:30:06 | in years just in our systems |
---|
0:30:07 | and the first K work but i found was contact in i |
---|
0:30:11 | maybe people are aware earlier work |
---|
0:30:13 | doing this that are not aware of |
---|
0:30:15 | and avoid this production of the pronunciation dictionary |
---|
0:30:19 | basically the G two P problem becomes a text normalisation problems we can have numbers |
---|
0:30:24 | there are things like that you have to convert dates and times and all those |
---|
0:30:27 | types of things into words in order to do this and then you have units |
---|
0:30:32 | so this we then it means see for turkish tagalog passed to within the babel |
---|
0:30:37 | program |
---|
0:30:38 | and we get about |
---|
0:30:40 | as like other previous studies got about comparable results |
---|
0:30:44 | in general |
---|
0:30:45 | but for some languages we actually do better with the graphemic systems and the |
---|
0:30:50 | phonemic systems in fact i should mention that back in the gale days that was |
---|
0:30:54 | work using graphemic systems rare |
---|
0:30:56 | and basically this is some results we don't passed to for |
---|
0:31:03 | in the babel program we had a two pass system using the dbn voice activity |
---|
0:31:06 | detection in the but features thank you and we use both graphemic and phonemic systems |
---|
0:31:11 | and we can see that there about the same that the phonemic is about one |
---|
0:31:15 | point higher than the graphemic but if we do what you pass system where we |
---|
0:31:19 | do we need you graphemic we actually got a reasonable |
---|
0:31:22 | getting from that |
---|
0:31:23 | we believe that the one of the problems in the past was actually having for |
---|
0:31:29 | pronunciation generation so therefore they're bad |
---|
0:31:32 | or a lot of variability you don't have also where the graphemic systems can actually |
---|
0:31:36 | outperformed the phonemic |
---|
0:31:38 | so let me now speak about look some work just because it's a and this |
---|
0:31:41 | is work done with |
---|
0:31:42 | marty noted that there is looks from looks and work for those of you know |
---|
0:31:46 | her and it's a little country where the not too many people but it's really |
---|
0:31:50 | a multilingual environments of the |
---|
0:31:52 | the |
---|
0:31:53 | people |
---|
0:31:55 | when they go to school their first language is german and then i believe it's |
---|
0:31:59 | french and english at this study but it won't speak their local language about submission |
---|
0:32:04 | apparently even those this type of the country the few close your eyes and you |
---|
0:32:07 | guys are you don't see in you have exaggerated a little bit |
---|
0:32:11 | but you even have multiple dialects in different regions |
---|
0:32:16 | so what we did this we initialize to originally the first studies we did was |
---|
0:32:20 | just try and look at segmentation experiments for how |
---|
0:32:24 | which languages are favoured by look supportish data so we had basically no transcribed data |
---|
0:32:29 | time |
---|
0:32:29 | and we transcribe ten or fifteen minutes of some |
---|
0:32:33 | the data and so we do is we did some approximate mappings are saying that |
---|
0:32:37 | if you take the |
---|
0:32:39 | in like to mortgage okay maps pretty well to french english german but if you |
---|
0:32:43 | take you well that doesn't really exist in english but can see that from germany |
---|
0:32:47 | french and so it's okay in english will use the it |
---|
0:32:51 | to get a mapping so we have the same number of phonemes for sort of |
---|
0:32:54 | phonemes in each one |
---|
0:32:56 | and basically when we said okay we build models would put them in parable parallel |
---|
0:33:00 | and we can have a superset of models and we try and align |
---|
0:33:03 | he looks more this data with this so we had somewhere transcriptions of a small |
---|
0:33:07 | amount and see which languages are |
---|
0:33:10 | referred we do this we allowed |
---|
0:33:12 | so |
---|
0:33:13 | i don't know that much about the language myself but basically you can have french |
---|
0:33:16 | words inserted in the middle so the apart from the language is that they're also |
---|
0:33:21 | indian |
---|
0:33:21 | and so we allowed the language is to change a word boundaries you had to |
---|
0:33:26 | use the phones from a given seed model |
---|
0:33:29 | within the word change of word boundaries and basically we found that as you can |
---|
0:33:33 | expect since the looks mortgages the dramatic language in general the segmentation for german |
---|
0:33:39 | second was english which is closest but there's about ten percent that what you french |
---|
0:33:43 | in this was typically needs allows an effect for english |
---|
0:33:45 | typically dip sounds remind we don't really know exactly why |
---|
0:33:50 | so based on that we then said okay is now a couple years later we |
---|
0:33:54 | got some transcribed broadcast news data in button we're going |
---|
0:33:58 | and which are easy models richer context independent they're tiny they're not gonna perform well |
---|
0:34:03 | and we just decoded the two or three hours of training data |
---|
0:34:07 | and you can see that the word error rates are flying this you expect some |
---|
0:34:10 | in the right range for the amount of data for the fact that the context |
---|
0:34:13 | independent |
---|
0:34:13 | but the german models for preferred |
---|
0:34:15 | we actually did models that we're pooled estimate the data together and told it and |
---|
0:34:20 | those for |
---|
0:34:20 | like less than the german however we get already before we knew this "'cause" we |
---|
0:34:25 | didn't have the data we had started this so we used will models to |
---|
0:34:28 | do the automatic decoding and once again we did or standard techniques you can see |
---|
0:34:33 | that we're going from about thirty five to about twenty nine percent |
---|
0:34:37 | word error rate by doubling data and adding new increasing the context adding mlp features |
---|
0:34:42 | et cetera |
---|
0:34:44 | and we were able to model more context |
---|
0:34:46 | but |
---|
0:34:47 | is there is kind of high converge some the other languages and so we martine |
---|
0:34:51 | looks at that classification of errors and you can see |
---|
0:34:55 | basically there's a lot of confusions between |
---|
0:34:58 | homophones |
---|
0:35:00 | some of the data this is pretty interactive data so it's not the same bn |
---|
0:35:03 | data we have two types you and with human production various of people did false |
---|
0:35:08 | starts repetitions in this pronounce something |
---|
0:35:10 | or the distance at work |
---|
0:35:13 | and then a large percentage re-estimated somewhere between fifteen twenty percent writing variance so because |
---|
0:35:18 | look some work is just sort of this |
---|
0:35:21 | spoken language |
---|
0:35:22 | are these really errors or not and so this is an example of some of |
---|
0:35:25 | the writing variances of the words saturday and i'm not going to |
---|
0:35:28 | was times they are not that's probably not really how you pronounce it okay all |
---|
0:35:32 | these are written text all allowable so basically you can |
---|
0:35:37 | depending on what regional variant or you can say so or show |
---|
0:35:40 | you can say tiered you can change the bow |
---|
0:35:44 | and all these are accepted in the written form we can find them in the |
---|
0:35:47 | text |
---|
0:35:48 | and in what they say |
---|
0:35:50 | so |
---|
0:35:51 | even though this is i don't know people really consider this a low resource language |
---|
0:35:56 | you're not there's not much data were almost none available all the languages used in |
---|
0:36:00 | speaking it's not really used in writing so how much for time i think it's |
---|
0:36:04 | good |
---|
0:36:07 | i'm going to speak |
---|
0:36:08 | one minute about we're trying to do this on korean but we once again don't |
---|
0:36:11 | have transcribed dev data and we were trying to do a study where we look |
---|
0:36:16 | at the side of the language model to use for decoding an unsupervised recording using |
---|
0:36:22 | a hundred twenty thousand words two hundred two million |
---|
0:36:26 | or |
---|
0:36:27 | two K character |
---|
0:36:29 | language model |
---|
0:36:29 | just for the decoding here we looked at using phone units and have syllable acoustic |
---|
0:36:34 | units |
---|
0:36:35 | and the only again we had was from ldc there's about a ten hour dataset |
---|
0:36:39 | if we do a standard train model on it we took to the last two |
---|
0:36:43 | files we held for deaf because we didn't have any |
---|
0:36:46 | we got a word error rate of about thirty percent and the character error rate |
---|
0:36:49 | of about twenty percent |
---|
0:36:50 | on this data is probably optimistic because really seen data is all the training data |
---|
0:36:55 | just the last two so that it's very close to it |
---|
0:37:00 | and so what years were increasing the amount of |
---|
0:37:03 | data we use from the web and looking at influence on the word error rate |
---|
0:37:07 | and the character error rate using different size language models and you can see that |
---|
0:37:10 | for the two hundred thousand where the chilean |
---|
0:37:13 | it's about the same we sorry these results are all decoded with the same size |
---|
0:37:17 | language model the role decoded with the to indicate language model |
---|
0:37:21 | so it's only what we use for the unsupervised training that's changing |
---|
0:37:25 | and can we see that the real results are basically the same as we go |
---|
0:37:27 | with the same data |
---|
0:37:29 | but the character language model which we skip this step |
---|
0:37:33 | is doing slightly better in terms of character error rate than the others |
---|
0:37:37 | we don't know it's real we need to look into some more since is just |
---|
0:37:40 | really recent stuff we're doing so for people to think it's easy to get transcribers |
---|
0:37:46 | it's been a month and how we're looking for someone in france that has the |
---|
0:37:49 | right to work we can transcribe the query and for us |
---|
0:37:52 | we finally found someone and they can start working in till february |
---|
0:37:56 | okay so yes it's an easy thing to do but |
---|
0:37:59 | not necessarily depending on your constraints for hiring things is not so easy |
---|
0:38:03 | so we're gonna follow up on this more we hope to have some clear results |
---|
0:38:07 | at the end |
---|
0:38:09 | to words about acoustic model interpolation because you're string we spoke about we have this |
---|
0:38:15 | heterogeneous data |
---|
0:38:16 | how do we combine the data from different sources enrich make the statement that you |
---|
0:38:19 | want to use all that you don't want to throw it away but you |
---|
0:38:22 | data weightings that's |
---|
0:38:23 | when we doing it is you can just |
---|
0:38:25 | at |
---|
0:38:26 | more we just some of the data remover for others in such a go frog |
---|
0:38:32 | at the syllables into is working on acoustic model interpolation |
---|
0:38:36 | and had a paper |
---|
0:38:38 | speech i think |
---|
0:38:40 | and looking at if you can do something random polling you can use a different |
---|
0:38:44 | sets and then interpolate them and use this on the european portuguese |
---|
0:38:50 | with the baseline putting gave you thirty one point seven percent in the interpolation give |
---|
0:38:54 | you about the same result but almost easier to deal with |
---|
0:38:57 | "'cause" you can you can train your data on smaller sets you |
---|
0:39:00 | and then interpolate done the same idea of what's done for language modeling for years |
---|
0:39:04 | now |
---|
0:39:05 | then we also looked at what we hear knife that using different a variance for |
---|
0:39:11 | english and this is some work that have been published with me to your back |
---|
0:39:15 | in two thousand and ten and basically |
---|
0:39:18 | we get a little bit of data for some of these |
---|
0:39:21 | we don't degrade for any of the variance you with respect to the visual pull |
---|
0:39:24 | model whereas with the map adaptation we actually did a little bit more some one |
---|
0:39:27 | or two with the variance i don't remember which ones |
---|
0:39:31 | so let me finish up |
---|
0:39:34 | i guess the take a message like say that the unsupervised acoustic model training successful |
---|
0:39:40 | it's been applied in a lot to broadcast news type data more recently in the |
---|
0:39:45 | babel project to |
---|
0:39:48 | wider range of the data i think it's really exciting that we can do that |
---|
0:39:50 | we still have to find the data |
---|
0:39:52 | but it it's really nice that we can do this type stuff |
---|
0:39:56 | but the error rates are still kind of time |
---|
0:39:59 | you it's a sorry even though the eraser so kind of i we can we're |
---|
0:40:03 | going in the right direction general |
---|
0:40:05 | the |
---|
0:40:06 | i'm sure rich or people but can see more some more about this they are |
---|
0:40:10 | during the meeting |
---|
0:40:11 | this is something that's interesting is that it seems in this will make people from |
---|
0:40:15 | yesterday happy it seems that the |
---|
0:40:18 | mlps a more robust to the fact that the transcriptions are imperfect in and they |
---|
0:40:22 | take less of a hit in the hmms |
---|
0:40:24 | as that sort of interesting |
---|
0:40:25 | observation |
---|
0:40:28 | and so we can use this untranscribed data were automatically and on automatically transcribed data |
---|
0:40:33 | to produce |
---|
0:40:34 | references for the training the mlps that's really nice |
---|
0:40:38 | the your hopefulness type of approach will allow us to extend two different types of |
---|
0:40:42 | task more easily miss you don't have to use the time of collecting the data |
---|
0:40:46 | entry transcribing the data you collect |
---|
0:40:48 | we still have to collected |
---|
0:40:50 | i didn't speak about multilingual acoustic modeling |
---|
0:40:53 | which is something that we in general |
---|
0:40:57 | shows and to do bootstrapping restore should just taking like models from other languages |
---|
0:41:03 | is it better use multilingual can we do better |
---|
0:41:06 | i think it depends on what you have been hand would be nice if we |
---|
0:41:08 | could everything we've tried in babble has gotten worse the sparse of a little bit |
---|
0:41:13 | disappointed with what we've been doing |
---|
0:41:16 | then of course something i didn't talk about what you do we languages have no |
---|
0:41:20 | written standard formats or touched upon it with the bottom for example i don't really |
---|
0:41:25 | now we're trying to do some work in others are in this even for about |
---|
0:41:29 | it's a paper round two thousand five i think of trying to automatically discover lexical |
---|
0:41:34 | units |
---|
0:41:35 | but when the main problems you also have you know they're meaningful |
---|
0:41:40 | i said i'm a bunch of times that here myself saying it's or systems or |
---|
0:41:44 | the kinetic the going to learn that people that say like and you know you |
---|
0:41:47 | can either in that |
---|
0:41:48 | then the word-like even if you get it is meaningful in some cases and it's |
---|
0:41:51 | not meaningful another cases and so how do you with that how do not was |
---|
0:41:54 | useful |
---|
0:41:56 | but i think it's really exciting it's been fun stuff i hope the |
---|
0:42:00 | those of you that have worked on unsupervised training will continue in those of you |
---|
0:42:04 | that have in my money give it a try |
---|
0:42:06 | and so |
---|
0:42:08 | thank you for giving me the opportunity speak |
---|
0:42:11 | and these are all but i've worked with closely on this work and there's probably |
---|
0:42:15 | other people i've forgotten and sorry |
---|
0:42:19 | so thank you |
---|
0:42:28 | thanks lori we have some time for questions |
---|
0:42:39 | natural unit is gradually improvements more data you have any idea what is being can |
---|
0:42:45 | improve i mean in this which words are getting better which ones that state maybe |
---|
0:42:51 | that probably not that yes now we have about that that's an interesting thing to |
---|
0:42:55 | look at |
---|
0:42:56 | something that we |
---|
0:42:58 | what we would like to do that we haven't done yet is to actually not |
---|
0:43:01 | just continue we don't normally we incrementally increasing amount of data would be interesting even |
---|
0:43:06 | change the datasets is just use different round of portions of i think we should |
---|
0:43:10 | cover better |
---|
0:43:12 | "'cause" when you're models are like something gonna continue liking it |
---|
0:43:16 | we have a look at words would be interesting |
---|
0:43:20 | so the question it so |
---|
0:43:22 | you know if you talk to machine learning person that works in some a supervised |
---|
0:43:25 | learning the get really nervous when you say self training or some supervision because it's |
---|
0:43:29 | this thing where you're starting with something which isn't working that well it can counters |
---|
0:43:34 | and actually go unstable and the opposite direction so there's this sort of sensitive |
---|
0:43:39 | you're starting with a baseline recognizer trained on a small amount that's work reasonably well |
---|
0:43:44 | you can improve if you're starting something was working really bad i can get worse |
---|
0:43:49 | and i and i noticed with some of a lot of the results that you |
---|
0:43:53 | had in this talk shows a lot of broadcast news we are starting with |
---|
0:43:56 | but are performing susan hours those all these results for all languages get your data |
---|
0:44:01 | we had nothing started zero |
---|
0:44:03 | zero you are transcribed data |
---|
0:44:05 | in language or you're making language |
---|
0:44:08 | okay so we started on a sort of all these languages here to the right |
---|
0:44:13 | we have zero |
---|
0:44:14 | in language data |
---|
0:44:15 | and we started with seed models of word context independent if you use you know |
---|
0:44:19 | that from another language and reduced to the max and so the noise model you |
---|
0:44:23 | will roughly sixty to eighty percent word error rate when you start |
---|
0:44:26 | on your data |
---|
0:44:28 | so we're starting really high kicks but |
---|
0:44:30 | or language models even though they're trained on |
---|
0:44:34 | newspaper text newswire text and things like that |
---|
0:44:36 | are pretty well representative the task there can be very strong constraints in there |
---|
0:44:41 | which is why i find the babble were really exciting which means you we haven't |
---|
0:44:46 | done it ourselves on the unsupervised part |
---|
0:44:49 | but the it's really exciting because there don't have the screen coming from a language |
---|
0:44:53 | model |
---|
0:44:54 | all you have a small amount of |
---|
0:44:56 | transcriptions so ten hours of transcriptions so here there is information is coming into the |
---|
0:45:00 | system from the text |
---|
0:45:02 | and that we're |
---|
0:45:03 | i personally believe the |
---|
0:45:05 | why works |
---|
0:45:07 | just something to see if you don't normalise |
---|
0:45:10 | correctly so we had certain situations people that i don't know how to pronounce numbers |
---|
0:45:14 | i'm just gonna keep them as numbers |
---|
0:45:17 | convergence is a lot harder doesn't work very well if you say i'm gonna throw |
---|
0:45:21 | away the numbers it which is what some of the people to be the language |
---|
0:45:25 | modeling did |
---|
0:45:26 | it also doesn't work so well |
---|
0:45:27 | so you really need to have something that represents pretty well |
---|
0:45:31 | well cosine it seems |
---|
0:45:32 | from my we also had some languages we're would you |
---|
0:45:36 | people from the litter here we think the problem when you take text that are |
---|
0:45:40 | online sometimes yes |
---|
0:45:42 | it's a texan other languages you actually have to filter out the text that are |
---|
0:45:45 | not language you're targeting tended did you wanna |
---|
0:45:49 | come up i think you're so that users come up during questions or |
---|
0:45:53 | okay |
---|
0:45:54 | no during the question you wanted her to change |
---|
0:46:00 | it wasn't it i don't think questions business at your and the formant to all |
---|
0:46:05 | a |
---|
0:46:06 | and the last question from or okay one L |
---|
0:46:12 | depending application at the end of the day you may want to have readable text |
---|
0:46:16 | like to queue for translation or broadcast news |
---|
0:46:20 | and at that point two H and four by hand think of names |
---|
0:46:25 | it's probably more important than |
---|
0:46:27 | and also percent of the K L I |
---|
0:46:30 | let me just call always systems are case them punctuation |
---|
0:46:35 | but we're not measuring the word error rate on punctuation the case where all the |
---|
0:46:39 | systems produce the punctuated case double |
---|
0:46:44 | the named entities are not specifically detected but hopefully if there proper nouns will be |
---|
0:46:51 | uppercase diff we did a language model right |
---|
0:46:53 | so this is something that actually we've |
---|
0:46:56 | we tried in that's why had this slide work in the case insensitive in case |
---|
0:47:01 | okay and then in case insensitive word error rates this is about a two percent |
---|
0:47:04 | difference |
---|
0:47:06 | in that the punctuation is a lot harder to evaluate so some work that we |
---|
0:47:09 | do with the acting colour who's now and that |
---|
0:47:14 | trying to evaluate the punctuation based on the road program and it's very difficult if |
---|
0:47:19 | you take to is for humans they don't agree how to punctuate things maybe not |
---|
0:47:23 | of speech |
---|
0:47:24 | for a |
---|
0:47:26 | full stops |
---|
0:47:28 | closer and big if there's eighty percent inter annotator agreement and if you go to |
---|
0:47:32 | common sits down relative |
---|
0:47:35 | so it's very heart |
---|
0:47:36 | but what really want something that's acceptable no you don't really care about a ground |
---|
0:47:41 | to i think sort of like translation you know really care exactly what it is |
---|
0:47:44 | as long as a reasonable |
---|
0:47:47 | reasonably correct punctuation if you have multiple forms of are possible just like you can |
---|
0:47:51 | translate something in multiple ways if you get one of them that's correct this could |
---|
0:47:54 | not so i think punctuation false same category if you use a common or after |
---|
0:47:59 | something it's not very important as long as you can understand correctly |
---|
0:48:03 | as i think as a really heart problems to evaluate |
---|
0:48:07 | fact even more so than doing something that seems reasonable |
---|
0:48:12 | the used as a separate comes on that the punctuation is that in the postprocessing |
---|
0:48:16 | step |
---|
0:48:18 | and other sites of done you bbn is done punctuation unity the other sites also |
---|
0:48:24 | over ten years you know |
---|
0:48:28 | but you're right |
---|
0:48:30 | okay |
---|