0:00:15so a good morning everyone
0:00:17i'm going to you do a sort of a
0:00:20this struggle
0:00:22passage also some the work we did before
0:00:27and the we i hope this'll work links and easy to
0:00:30so basic only to be talking about semi supervised and unsupervised acoustic model training with
0:00:35limited linguistic resources
0:00:38V i mean and as most of us know i'm going to this with a
0:00:44lot of actually overstate last decade instead of
0:00:48research
0:00:49and so i'm gonna talk about some experience we've had a team see
0:00:52about like the unsupervised and super unsupervised training
0:00:56L give a couple of case studies
0:00:58and i actually first case study will be on english which at this there
0:01:01then we'll all talk
0:01:03very briefly about some different types of lexical units from modeling switches graphemic units versus
0:01:09phonemic units in babble
0:01:11and already mentioned just briefly i added that slightest an acoustic model interpolation "'cause" we're
0:01:16talking about how to deal with all this header engenders data
0:01:20and all five fish with some comments
0:01:24so
0:01:27over the last
0:01:28decade or two we've seen part of advances in speech processing technologies lot of the
0:01:33technologies are getting out there from industrial companies and there's that kind of commonplace for
0:01:39a lot of the people right now and so people expect that this stuff really
0:01:42works and i think
0:01:44this is great that we're seeing or not really get out there at the other
0:01:47people's expectations are really
0:01:49i and we still have problems that are systems are pretty much developed for a
0:01:54given
0:01:55task a given language
0:01:57and we still have a lot of work to do deported get good performance on
0:02:00other tasks and languages we only cover a few tens maybe you fifty years old
0:02:05languages now as a community
0:02:07and many times language variance are actually even consider different languages is just easier to
0:02:13develop a system for different very
0:02:15and we still rely and language resources a lot
0:02:18but over the last decade or two we've been seeing that's a reliance on human
0:02:23intervention so we can use them
0:02:25with a little bit less human work
0:02:28so i guess this is sort of just
0:02:31everybody knows this or maybe everybody doesn't die if there's some people that are working
0:02:35on speech recognition here where we are holy grail listed all this technology that works
0:02:40on anything that it's independent of the speakers of the task
0:02:43there's no problem noise is no problem changing your microphone
0:02:46and i guess some says maybe fortunately for us still resource do "'cause" this remains
0:02:51the dream for us
0:02:53but we do have
0:02:55a lot lower error rates and we had a decade or two ago
0:02:58we can process many more types of data with different speaking styles different conditions you
0:03:03originally in the work that we we're doing was always requiring read speech who needs
0:03:08to recognise something that was read from attacks it doesn't
0:03:11that's a logical now look back at it
0:03:13we cover more languages and we have a fair amount of work reaching the output
0:03:19to make the transcripts more usable you know by systems or machines which is not
0:03:23exactly the same thing so you might wanna quit different information you're going downstream processing
0:03:27by machines
0:03:28purses if you're doing it for you to be reading
0:03:32so what's a low resource language i don't really have an answer but i think
0:03:36in many of us in this community
0:03:39typically mean that there are too many E resources so we don't find information online
0:03:44"'cause" that's what we're using now to develop systems
0:03:47if you speak to link was i think it's may be very different answers we
0:03:50get and i'm not don't really wanna get into that
0:03:53but
0:03:55basically the
0:03:58we need to be able to find it if we want to develop systems
0:04:01and that type of thing going to talk about our languages that are low resource
0:04:06in the in the sense that el the ldc don't have resources that they distribute
0:04:10google probably has them
0:04:13and you can we get them online can we develop systems with data that we
0:04:16find online
0:04:17i'm not going to really talk about the babble type languages or other rare languages
0:04:23where
0:04:24but you really don't even have mostly writing conventions you don't necessarily understand you have
0:04:29any information about the language except maybe some linguists that have spoken to some people
0:04:33aren't gonna visited
0:04:34and so i guess you're a little bit more in that direction for marianne from
0:04:38time in the
0:04:39next part of the talk
0:04:40and of course this framework with by outside on the fusion room
0:04:43i'm trying to do the speech translation for this text languages with this really no
0:04:48resources
0:04:49so we have little or essentially no
0:04:51available audio data
0:04:53you have
0:04:54probably nothing in terms of dictionaries you don't even necessarily have word this languages in
0:04:58general very limited knowledge about the language
0:05:01but you can also consider the many types of data for well-resourced language user language
0:05:05variance
0:05:06or almost low resource because we just don't have
0:05:09much available data for
0:05:11so let me take a little stuff back in time to the late nineteenth and
0:05:14early two thousand
0:05:16and one of the questions that you get all the time from funding agencies is
0:05:20how much stated you need
0:05:22okay
0:05:23we try to answer this that i don't think anybody knows which is an hour
0:05:26say what depends where you want to be it depends what you want to do
0:05:30the funding agency users were leases time brawls complaining the data collection is
0:05:35it is costly
0:05:37i'm why you always asking to find data that is see this is a recurrent
0:05:41question
0:05:42and so this is the curves that we did back in two thousand showing with
0:05:46supervised training on broadcast news data in english
0:05:50how much you word error rate is as a function of business have a pointer
0:05:56the red one
0:06:01no i mean well that's anyway you start with
0:06:04the little number of the really high number of the left is one and a
0:06:07half hours of audio data distraught bootstrapping system
0:06:10with a well trained language model
0:06:13the second point there is about thirty three with what a set of ten minutes
0:06:18the next one is one and half hours where you see that the word error
0:06:21rate is about thirty three percent and then as we have more data we go
0:06:25down we see that once we get to fifty or a hundred hours
0:06:28the sort of starts to platform so we're getting diminishing returns really additional data and
0:06:33so this one thing we can say
0:06:36the red
0:06:38we do
0:06:40okay so once we get you could you know what hundred hours of data
0:06:43basically
0:06:45you don't wanna spend a lot of money for that additional data could you just
0:06:48not getting much returns
0:06:49once again this is on broadcast news data we had a reasonably well train a
0:06:54language model so we're seeing this asymptotic behavior of the error rate something we observed
0:07:00in the community at large is that when you start a new task get
0:07:04rapid progress it's really fine "'cause" everything here the error rates are dropping we're getting
0:07:07twenty percent thirty percent and one here is great
0:07:10that once you get some reasonable
0:07:12every we're getting about six percent per year and where did some count if you
0:07:17look over say ten or fifteen years of progress
0:07:19it seems like the average
0:07:21improvement we're getting is about six percent per year
0:07:25so this groups
0:07:26i don't wanna do that
0:07:27additional data should cost less
0:07:29and we need to learn how to use to predict this is sort of what
0:07:32was going to remind back in two thousand K which is still i think quite
0:07:36relevant to that
0:07:38so
0:07:41you can think about different types of bubbles the supervision so way that when people
0:07:46were saying we should use phonetic
0:07:47or phone level transcriptions for training or phone models as logical
0:07:52it's gives you more information is better than using words
0:07:54and
0:07:55people did that
0:07:57our experience that they can see when we did some tests on this using timit
0:08:01type data switchboard and a breath is read speech corpus in french
0:08:04is that actually humans like a segmentation is that we're human phonetic transcriptions better but
0:08:10the system like the automatic ones better
0:08:12so basically if you use the word level transcription with the dictionary the covers a
0:08:16reasonable about variance the systems were better than training them on the phonetic transcriptions maybe
0:08:21that would not be true nowadays i don't know we have redone it
0:08:24but that sort of satisfied us to say okay we can go ahead we can
0:08:26do this approach that we do the standard alignment of word level transcriptions with the
0:08:31audio so then if you go to be next you can say okay we can
0:08:35have a large amount of actual reality data for large as round hundred hours or
0:08:40greater than a hundred hours
0:08:42we can have some
0:08:44after the annotated data but a lot of unlabeled data with approximate transcriptions not gonna
0:08:48give some results on this
0:08:50we can have no annotated data
0:08:52okay but we can find some sort of related text related information
0:08:56or we can have some small amount and then use this to bootstrap are systems
0:09:00that will be sort of semi supervised and this is what we heard about a
0:09:03little bit yesterday this is what people been doing so you basically transcribed raw data
0:09:07you say this is ground truth and you do your standard training to build a
0:09:11models this work
0:09:13there's no lot of variance that have been published so you can filter it you
0:09:16can use confidence measures you can use consensus networks you can do rover you can
0:09:20do
0:09:20lattices lots of different sort of variance
0:09:23and i listed some of the early work but in my recollection it was people
0:09:27involved years project and people involved in the project in europe
0:09:31and i if i forgot people i'm sorry i don't mean to but this is
0:09:35what comes to mind absolutely
0:09:37early adopters of this
0:09:39type of activities
0:09:42so
0:09:43if we just go back to supervised training and i think most people in this
0:09:46room know this i'm not gonna stand for a long time
0:09:49to normalize transcriptions what you do that for the language model anyway that's not so
0:09:53bad
0:09:54you need to do things and creating a word this you need to come up
0:09:57phonemic transcriptions and you meet you in the old days we collected so when we
0:10:02start errors in the transcripts we actually spend time correcting that because we only had
0:10:05thirty hours fifty hours we thought it would give a something i think young people
0:10:09today wouldn't even think about this but we spent a lot of times that we
0:10:13but that
0:10:14and then you you're standard training
0:10:18so
0:10:20this is this showing the results are using what we called semi supervised training
0:10:25so you had a language model that was
0:10:27trained on a certain amount of hours amazing the justice right okay so they the
0:10:32manual word error rate was eighteen percent if we had a fully train system
0:10:36we used closed captions as a language model one am showing here be done different
0:10:41variance
0:10:42so it's a sort of an approximate transcription that we had
0:10:44and we took
0:10:47in these numbers we started we
0:10:49every now i think
0:10:52ten hours of original data
0:10:54and we then transcribe varying amounts of
0:10:57unlabeled data so this is the raw unlabeled data
0:11:00and we said okay we can use it is so that this unfiltered this is
0:11:05close to an unsupervised training courses we can do a semi supervised what we say
0:11:09where the this not too much of an error rate difference between the transcript we
0:11:14generate and what the caption was that's good
0:11:17and so we took in this at a sort of phrase level where the segment
0:11:21level we just kept and train the segments where the word error rate for the
0:11:24captions to be from automatic transcriptions with less than X and i don't remember what
0:11:30the experts
0:11:30which probably less than twenty or thirty percent error rate
0:11:34and you can see that we get pretty close
0:11:36and so we get within ten percent absolute of the manual transcriptions using both in
0:11:41i words this is what we do mostly then we just don't bother filtering it's
0:11:46easy you just train on everything
0:11:49it seems to give about the same type of results
0:11:52be a measure that was introduced by dbn is was called the word error
0:11:57rate recovery and so basically you look at the difference between
0:12:02how much you get some supervised training and how much we get from unsupervised training
0:12:06from your initial starting point and so what we get here is about eighty five
0:12:10to ninety percent is we're covering most of what we could have gotten had we
0:12:14don't supervised training
0:12:17one problem we had this work is that there is some knowledge in the system
0:12:20because we did
0:12:22have prior knowledge from the dictionary we did have a pretty good language model was
0:12:25close to the data it didn't wasn't exactly same data but was close
0:12:30so we discuss the set i think it was in years meeting or maybe a
0:12:34conference mostly was with rich worse we were discussing it and said well you know
0:12:38take into an extreme let's see if we can use
0:12:44one hour training data work ten minutes of training data
0:12:47and we were crazy enough to do this in the time it was a lot
0:12:50of computation because every time every time you value with a different language model every
0:12:53time you use different amounts of data you have to be the code reader could
0:12:56multiple times
0:12:58these days be very easy to be one of these experiments but at the time
0:13:01to time to do that
0:13:03and so here we see that if we start with a ten minute bedroom system
0:13:07we've got a word error rate of sixty five percent we actually did this didn't
0:13:11think it would work
0:13:14and that was okay and that's just take some data so three to four hours
0:13:17of
0:13:18data
0:13:19and non people
0:13:21in improve the fact we throw away these ten minutes could was just more complicated
0:13:25to build models merging it to do that okay we take the ten to four
0:13:28hours of automatic
0:13:29and we go to fifty four or we go down and we stopped heard about
0:13:33under forty hours and we got thirty seven point four percent if you use the
0:13:38same language model with the full training data supervised
0:13:42you only get to about thirty percent
0:13:44so we're getting pretty
0:13:45good difference of where we need to get
0:13:48so we're happy with that
0:13:50and it about this time
0:13:53so what the green in came to do this is with this C and we
0:13:56sort of tried to we don't really apply this method to his work as we
0:14:01don't have enough audio data but we did try and look at
0:14:05questions that we've been asking for a long time
0:14:08as to how much
0:14:09data you need to train models
0:14:10what improve performance can you expect when you have limited resources what's more important audio
0:14:17data or text data
0:14:19and how can you prove the language models when you have very little bit so
0:14:21she can twenty around two thousand four if i remember correctly and we had available
0:14:26i guess we consider this reasonably small amounts of data we had thirty seven hours
0:14:31was
0:14:32bodily good but it's a not of a
0:14:35and we had about five million words of transcripts
0:14:38and we nothing about number
0:14:42and so what when the first things you did was to look at what's the
0:14:46influence of the audio data transcripts versus what transcripts on out-of-vocabulary
0:14:52and so on the left we have the out-of-vocabulary rate
0:14:54and here we're showing for two hours of transcripts ten hours of transcriptions about seven
0:14:59K words twenty K words and fifty K words with the thirty five
0:15:02and how it is your oov rate
0:15:05go down as you adding more transcripts and so you can see here and the
0:15:08top her
0:15:10if you add in sre and on this
0:15:13so i adding more text data so as the curves are the amount of transcripts
0:15:17we have
0:15:18and on there
0:15:20bottom x-axis we reading different amounts of text data that protect five million text sources
0:15:26and so if you start with
0:15:28just two hours of audio data if you had ten K
0:15:33you don't really lower your be too much
0:15:36if you have hundred K right a little bit more et cetera et cetera
0:15:40if you're have ten hours of data
0:15:43there isn't much of an effect and so you see that the effect of adding
0:15:46the text data is less than adding the audio data
0:15:49that actually at a cat to because we probably have some sort mismatch we know
0:15:52the audio data is the same type of data we're trying to look at
0:15:55and the text data is related but not really the same
0:16:00so then here's another curve that we're trying to
0:16:03look at the amount of audio data
0:16:07versus text data for
0:16:09language model is a little bit complicated discrete on hers here or you just two
0:16:14hours of audio data in the acoustic model
0:16:16and the bottom one than the green once again it's ten hours and the red
0:16:20is thirty five
0:16:21and
0:16:22but you can see that even if you add in more text data you're not
0:16:26really improving the word error rate now
0:16:29and everyone said okay is it coming from the acoustic data where's it coming from
0:16:33the transcripts we know the transcripts or less close
0:16:35and so we added in on the purple in the blue curves are using the
0:16:40transcripts from the ten hours for the thirty five hours and so we can see
0:16:44that if you only have two hours of audio data is just not going to
0:16:47do very well and you need more once you get to ten hours
0:16:51it seems like a improvement you get is
0:16:54little lesson this is interesting "'cause" this is what was being used currently in the
0:16:57babel project we're actually
0:16:59working with ten hours for some the conditions
0:17:02let me a few minutes about some other work he did for
0:17:08the language modeling and it's with on work comp word decomposition so i'm work is
0:17:14a very rich morphology and has lots of to poke our high out-of-vocabulary rates
0:17:21a problem also for language modeling is that is very took is not very well
0:17:25models so therefore it's interesting to use a word decompounding
0:17:30and when you look at the literature you can see needs results languages you get
0:17:33a nice gains some you don't always get it again you know we don't necessarily
0:17:37get that in word error rate
0:17:39and so
0:17:41one a
0:17:43idea that we had in this work was to try and avoid with the generate
0:17:46once you don't want to create visible units and
0:17:49so that's what i'm going to give a couple of ideas about and to do
0:17:53this we build matched conditions these types of we train language models retrained acoustic models
0:17:58for all the
0:17:59conditions
0:18:01so this is showing
0:18:03here we had the we use them morfessor algorithm
0:18:07which is relatively recent at the time
0:18:10so this basic morfessor will be the curse would
0:18:13then there's
0:18:15no reason it's referred to as harris we're basically you look at the number of
0:18:19strings letters the can succeed another better and that gives you an idea of what
0:18:23the perplexity is if you have a big a lot of different letters that can
0:18:27follow it's like the dbn you word if it's were you more and if it's
0:18:31not it's likely to be within the same
0:18:33a more
0:18:35we also then tried to use and distinctive feature properties to train at some speech
0:18:40information into the decomposition and looked using phonemic confusion constraints that were generated using phone
0:18:46alignments and so basically here if you have used to sequences neighbour a and may
0:18:52well
0:18:53in queens like that may because if the not the lot and the well we
0:18:57easily confusable if they were easily confusable was okay display
0:19:00the idea that
0:19:03constraints it's relatively language-independent but of course you do need to know the phonemes in
0:19:07the language or
0:19:08have and you set of phonemes in general
0:19:10so this is looking at what happens in terms of the number of tokens you
0:19:15get after T V splitting for different configurations
0:19:18and
0:19:20the length of them so it's something that was also the weight was represented rest
0:19:24everything that was to phonemes so that's where you see things a two four six
0:19:28eight et cetera
0:19:28and basically the main point is that anything
0:19:31that is that's this is your baseline by the words in the black
0:19:36and once you start cutting use
0:19:39units get shorter as you expect that's a goal of it
0:19:42and then if you use this confusion constraints those ones we cease uses green in
0:19:46the purple general there are a little bit less shifted to the left so we're
0:19:49creating slightly
0:19:50if you were very short units
0:19:53and that was the goal what we're trying to do and then here's a table
0:19:57that we probably don't want to go
0:20:00into too much but if you look at the baseline system we had twenty two
0:20:03point six percent
0:20:04no numbers are relatively close okay but
0:20:07if you split anything
0:20:09the error rate general gets worse isn't the black ones so you can use to
0:20:13the distinctive features that you really help
0:20:17you can
0:20:18the only ones that there's were used disk this phoneme confusion constraint and so here
0:20:22all of these two slightly better
0:20:24the and the baseline and those the only ones and so is really important to
0:20:28avoid adding
0:20:30we need to confusion your homophones and your system
0:20:32so that sort of the typical message for this
0:20:37so we
0:20:38the other one thing today we got fifty percent reduction in oov rates that was
0:20:43good except we were introducing errors and confusions and the little affix as
0:20:48that we're compensating or recording this more than yours we recovered
0:20:54and basically we did some studies and look at the previously over previous oov words
0:20:59and basically about half of them were correctly recognized using this method but we would
0:21:03swap it out
0:21:04with a recently introduced
0:21:06on the different aspects is that work
0:21:09so just another slide sorta not of all of his work what was more logical
0:21:13to put here in the talk so i
0:21:15but here is we've used unsupervised decomposition once again usually based on morfessor or some
0:21:20modifications of it for finnish hungarian and german
0:21:24and russian
0:21:25for the first three languages we got reasonable gains a between one and three percent
0:21:30and we can reduce our vocabulary size is from seven hundred thousand two million words
0:21:35to around three hundred thousand
0:21:36which are a little bit more usable for the system and probably more
0:21:39easy to train a reliable estimates
0:21:43for some of them we need acoustic model retraining so we could do not for
0:21:47german
0:21:47for finnish we tried both the
0:21:51acoustic model retraining or not
0:21:52and we
0:21:53well time's got three percent difference using the morphologically decomposed system whether or not retrain
0:21:58the acoustic model to
0:21:59interesting for us morfessor worked well for finnish i think in part because the authors
0:22:04were
0:22:04and so there
0:22:05the output was maybe design for that
0:22:08we also tried to do this and russian cts where we only had the time
0:22:12about ten hours of training data so conversational telephone speech
0:22:16six yes for some people that might know what
0:22:19and we got a reduction in the oov we were able to use a smaller
0:22:22vocabulary but we can get an egg in word error rate
0:22:26but once again this is very preliminary
0:22:28work we get done
0:22:29so now i'm going to shift your gym where my time
0:22:36fourteen is gone
0:22:37that okay several faster
0:22:39so to speak a few minutes about
0:22:42finish where we do have is one of the first languages and deeper we didn't
0:22:46have any
0:22:49audio and untranscribed audio and so
0:22:52we have found some online data with an approximate transcripts that comes from a
0:22:57initially used for foreigners finish
0:23:00and there is no transcribed development data either and said how are we gonna do
0:23:04this for many companies is easy to hire someone to transcribe some data for us
0:23:08is not so we see so it takes time to find the person
0:23:11if we're government research labs to
0:23:14this is a complicated so can we get ahead
0:23:17by doing something simpler and so we did is we
0:23:19use this is approximate transcriptions but also for the development not just for the
0:23:26unsupervised training
0:23:28and once again as i said before we use morphological decomposition for this
0:23:32so here's occur showing the
0:23:35estimated word error rate as we increase the amount of unsupervised training data
0:23:40so we have two hours of five hours and then sir stabilised again once we
0:23:45get around ten to fifteen hours
0:23:47we're stabilising C we get a beginning here and this approximate but is going the
0:23:51right direction that about two months later we had somebody the came in and
0:23:57transcribed data for us it was a two or three hours sets is not a
0:24:00lot it still took awhile first to get the person for them to do it
0:24:03and you can see the human error rate
0:24:05use
0:24:06in the following exactly the same curve
0:24:09are error rates higher zero underestimating here because what we did is we selected regions
0:24:13as and the done for the unsupervised training where there was a good knots between
0:24:17these sort of approximate transcriptions
0:24:19and
0:24:19what the system did we measured on that but we're not because it allowed us
0:24:23to develop without necessarily having to wait for this data to become
0:24:27available
0:24:30so the message on that is that the unsupervised acoustic model training worked reasonably well
0:24:35using these approximate transcripts
0:24:37with since then it on
0:24:38some sorry it is also worked on
0:24:42for the language models so we can improve our language models using the sort approximate
0:24:46transcription it worked
0:24:48we then added into the system some cross lingual mlp
0:24:51so we tried both french and english
0:24:53and we got about ten percent improvement
0:24:55and i said before with the morphological decomposition
0:24:59so now i'm gonna talk about not a language which also is consider somewhat low
0:25:05resourced so that in
0:25:06and this was work it was done
0:25:08with all the other operand was that nancy can was russian so you sort of
0:25:12down that unit interesting language for him and basically his words where they just know
0:25:16nothing for that and out there
0:25:17this assistance is not distribute corpora but you can find text and audio on the
0:25:22net so therefore something we could reasonably do
0:25:25it's a baltic language is not so many speakers of one point five million it's
0:25:29a complicated language but uses a lot now forget half of it
0:25:34and you please reasonably straightforward
0:25:37so i this is sort of the overview of the language models we found a
0:25:41fair amount of data
0:25:42good
0:25:44one point six million words
0:25:46and in domain data and hundred forty two million words newspapers so the in domain
0:25:50means it comes from like radio and T V stations and this just newspapers
0:25:54we used about a five hundred thousand word vocabulary just keeping words that occurred more
0:25:58than three times
0:26:00text processing thing is or standard
0:26:02however this isn't really important stuff it's if you don't do the text processing carefully
0:26:06you have problems when trying to cancer supervised training means that seems to be our
0:26:10experience and it was pretty much standard language models he threw in some neural network
0:26:15language models at the end so given distressed talk that was interesting to
0:26:19for that line into
0:26:20so this is this figure showing the
0:26:24word error rate have goes up so that these curves here the word error rate
0:26:27as a function of the iteration
0:26:30and me circles are shown you roughly the size of the acoustic units were roughly
0:26:34doubling at each stop
0:26:35the amount of audio data used in an unsupervised manner
0:26:40for the systems
0:26:41at this level here for we added in the mlp from russian
0:26:45are initial seed models were here
0:26:49came from the mixer three languages english french russian the audio data wiper about sixty
0:26:54hours at this stage to about seven hundred eight hundred hours at this stage raw
0:26:58so you're only using about half
0:27:01when you have to build models
0:27:02and of course something that's important used to increase the number of context in the
0:27:05states that you model the same time
0:27:07so it doesn't suffice just add more data to keep the model topology fixed you
0:27:11don't get much of again from the
0:27:15afterwards he did some additional tuning parameters and you pass decoding and
0:27:20use the four gram lm and you can see that we it's just the original
0:27:24so see i is
0:27:26case insensitive and C D is case sensitive not context independent context dependent
0:27:31as we're looking at the word error rate if you take into account case "'cause"
0:27:34what people want to read is really having case correct
0:27:37and even for different search engines sometimes is important to have the case correct
0:27:40because you want to know for the proper name or not
0:27:44and so for people that are found neural net language model got about what have
0:27:47to two percent gain
0:27:49by
0:27:49adding them and this is on dev data and then and validated we got pretty
0:27:53much similar results
0:27:54so we were happy with that so it's completely unsupervised we developed a system
0:27:58in about less than a month
0:28:01mainly at the end we were and this is trying for hungarian roughly the same
0:28:05thing
0:28:06we used a few data from
0:28:08five languages we had less audio data so we only one two
0:28:12about three hundred hours
0:28:13and we used a originally and mlp trained on english
0:28:17and then we use the transcripts of this level to then generate an mlp trail
0:28:23area
0:28:24using unsupervised transcription we got a another two point eight or so that napster
0:28:29so just to you been
0:28:31overview this just some results from the program which some of you are where i'm
0:28:36sure in some of you or less
0:28:37the systems to the one the including channel
0:28:41are trained on supervised data
0:28:43and the supervised data varies from fifty to see a hundred fifty two hundred hours
0:28:48upon language
0:28:49on the right the role train unsupervised
0:28:52the green sorry the low
0:28:55mine is the average error rate across the test data about three hours
0:29:00and so you see it's are going up and the ones on the side our
0:29:02general little bit higher the ones on the left not so much they're pretty good
0:29:08bulgarian and with the when you are a little bit higher here
0:29:11then on the a look some vocational come back to a few minutes
0:29:15if you look at the
0:29:17lowest word error rate i anyone to the segments we had from T V radio
0:29:20they're pretty known in fact even some of the unsupervised or the word and the
0:29:25supervised
0:29:26and finally this is the worst case since the worst-case word error rate is still
0:29:29pretty ice we still got a fair amount of work to do
0:29:32these data are mixed news and conversations
0:29:36and some of them are some languages a more interactive than others things like that
0:29:41so i'm going to skip the next slide which is too much stuff was to
0:29:44show the amount of data we use of people are interested come finally later
0:29:49and i want to say two or three words about dictionary so when the
0:29:54think that we're this them that passes very costly to do with dictionaries and so
0:29:58there's been
0:30:00more recently use a growing interest in using graphemic type units rather than phonetic units
0:30:06in years just in our systems
0:30:07and the first K work but i found was contact in i
0:30:11maybe people are aware earlier work
0:30:13doing this that are not aware of
0:30:15and avoid this production of the pronunciation dictionary
0:30:19basically the G two P problem becomes a text normalisation problems we can have numbers
0:30:24there are things like that you have to convert dates and times and all those
0:30:27types of things into words in order to do this and then you have units
0:30:32so this we then it means see for turkish tagalog passed to within the babel
0:30:37program
0:30:38and we get about
0:30:40as like other previous studies got about comparable results
0:30:44in general
0:30:45but for some languages we actually do better with the graphemic systems and the
0:30:50phonemic systems in fact i should mention that back in the gale days that was
0:30:54work using graphemic systems rare
0:30:56and basically this is some results we don't passed to for
0:31:03in the babel program we had a two pass system using the dbn voice activity
0:31:06detection in the but features thank you and we use both graphemic and phonemic systems
0:31:11and we can see that there about the same that the phonemic is about one
0:31:15point higher than the graphemic but if we do what you pass system where we
0:31:19do we need you graphemic we actually got a reasonable
0:31:22getting from that
0:31:23we believe that the one of the problems in the past was actually having for
0:31:29pronunciation generation so therefore they're bad
0:31:32or a lot of variability you don't have also where the graphemic systems can actually
0:31:36outperformed the phonemic
0:31:38so let me now speak about look some work just because it's a and this
0:31:41is work done with
0:31:42marty noted that there is looks from looks and work for those of you know
0:31:46her and it's a little country where the not too many people but it's really
0:31:50a multilingual environments of the
0:31:52the
0:31:53people
0:31:55when they go to school their first language is german and then i believe it's
0:31:59french and english at this study but it won't speak their local language about submission
0:32:04apparently even those this type of the country the few close your eyes and you
0:32:07guys are you don't see in you have exaggerated a little bit
0:32:11but you even have multiple dialects in different regions
0:32:16so what we did this we initialize to originally the first studies we did was
0:32:20just try and look at segmentation experiments for how
0:32:24which languages are favoured by look supportish data so we had basically no transcribed data
0:32:29time
0:32:29and we transcribe ten or fifteen minutes of some
0:32:33the data and so we do is we did some approximate mappings are saying that
0:32:37if you take the
0:32:39in like to mortgage okay maps pretty well to french english german but if you
0:32:43take you well that doesn't really exist in english but can see that from germany
0:32:47french and so it's okay in english will use the it
0:32:51to get a mapping so we have the same number of phonemes for sort of
0:32:54phonemes in each one
0:32:56and basically when we said okay we build models would put them in parable parallel
0:33:00and we can have a superset of models and we try and align
0:33:03he looks more this data with this so we had somewhere transcriptions of a small
0:33:07amount and see which languages are
0:33:10referred we do this we allowed
0:33:12so
0:33:13i don't know that much about the language myself but basically you can have french
0:33:16words inserted in the middle so the apart from the language is that they're also
0:33:21indian
0:33:21and so we allowed the language is to change a word boundaries you had to
0:33:26use the phones from a given seed model
0:33:29within the word change of word boundaries and basically we found that as you can
0:33:33expect since the looks mortgages the dramatic language in general the segmentation for german
0:33:39second was english which is closest but there's about ten percent that what you french
0:33:43in this was typically needs allows an effect for english
0:33:45typically dip sounds remind we don't really know exactly why
0:33:50so based on that we then said okay is now a couple years later we
0:33:54got some transcribed broadcast news data in button we're going
0:33:58and which are easy models richer context independent they're tiny they're not gonna perform well
0:34:03and we just decoded the two or three hours of training data
0:34:07and you can see that the word error rates are flying this you expect some
0:34:10in the right range for the amount of data for the fact that the context
0:34:13independent
0:34:13but the german models for preferred
0:34:15we actually did models that we're pooled estimate the data together and told it and
0:34:20those for
0:34:20like less than the german however we get already before we knew this "'cause" we
0:34:25didn't have the data we had started this so we used will models to
0:34:28do the automatic decoding and once again we did or standard techniques you can see
0:34:33that we're going from about thirty five to about twenty nine percent
0:34:37word error rate by doubling data and adding new increasing the context adding mlp features
0:34:42et cetera
0:34:44and we were able to model more context
0:34:46but
0:34:47is there is kind of high converge some the other languages and so we martine
0:34:51looks at that classification of errors and you can see
0:34:55basically there's a lot of confusions between
0:34:58homophones
0:35:00some of the data this is pretty interactive data so it's not the same bn
0:35:03data we have two types you and with human production various of people did false
0:35:08starts repetitions in this pronounce something
0:35:10or the distance at work
0:35:13and then a large percentage re-estimated somewhere between fifteen twenty percent writing variance so because
0:35:18look some work is just sort of this
0:35:21spoken language
0:35:22are these really errors or not and so this is an example of some of
0:35:25the writing variances of the words saturday and i'm not going to
0:35:28was times they are not that's probably not really how you pronounce it okay all
0:35:32these are written text all allowable so basically you can
0:35:37depending on what regional variant or you can say so or show
0:35:40you can say tiered you can change the bow
0:35:44and all these are accepted in the written form we can find them in the
0:35:47text
0:35:48and in what they say
0:35:50so
0:35:51even though this is i don't know people really consider this a low resource language
0:35:56you're not there's not much data were almost none available all the languages used in
0:36:00speaking it's not really used in writing so how much for time i think it's
0:36:04good
0:36:07i'm going to speak
0:36:08one minute about we're trying to do this on korean but we once again don't
0:36:11have transcribed dev data and we were trying to do a study where we look
0:36:16at the side of the language model to use for decoding an unsupervised recording using
0:36:22a hundred twenty thousand words two hundred two million
0:36:26or
0:36:27two K character
0:36:29language model
0:36:29just for the decoding here we looked at using phone units and have syllable acoustic
0:36:34units
0:36:35and the only again we had was from ldc there's about a ten hour dataset
0:36:39if we do a standard train model on it we took to the last two
0:36:43files we held for deaf because we didn't have any
0:36:46we got a word error rate of about thirty percent and the character error rate
0:36:49of about twenty percent
0:36:50on this data is probably optimistic because really seen data is all the training data
0:36:55just the last two so that it's very close to it
0:37:00and so what years were increasing the amount of
0:37:03data we use from the web and looking at influence on the word error rate
0:37:07and the character error rate using different size language models and you can see that
0:37:10for the two hundred thousand where the chilean
0:37:13it's about the same we sorry these results are all decoded with the same size
0:37:17language model the role decoded with the to indicate language model
0:37:21so it's only what we use for the unsupervised training that's changing
0:37:25and can we see that the real results are basically the same as we go
0:37:27with the same data
0:37:29but the character language model which we skip this step
0:37:33is doing slightly better in terms of character error rate than the others
0:37:37we don't know it's real we need to look into some more since is just
0:37:40really recent stuff we're doing so for people to think it's easy to get transcribers
0:37:46it's been a month and how we're looking for someone in france that has the
0:37:49right to work we can transcribe the query and for us
0:37:52we finally found someone and they can start working in till february
0:37:56okay so yes it's an easy thing to do but
0:37:59not necessarily depending on your constraints for hiring things is not so easy
0:38:03so we're gonna follow up on this more we hope to have some clear results
0:38:07at the end
0:38:09to words about acoustic model interpolation because you're string we spoke about we have this
0:38:15heterogeneous data
0:38:16how do we combine the data from different sources enrich make the statement that you
0:38:19want to use all that you don't want to throw it away but you
0:38:22data weightings that's
0:38:23when we doing it is you can just
0:38:25at
0:38:26more we just some of the data remover for others in such a go frog
0:38:32at the syllables into is working on acoustic model interpolation
0:38:36and had a paper
0:38:38speech i think
0:38:40and looking at if you can do something random polling you can use a different
0:38:44sets and then interpolate them and use this on the european portuguese
0:38:50with the baseline putting gave you thirty one point seven percent in the interpolation give
0:38:54you about the same result but almost easier to deal with
0:38:57"'cause" you can you can train your data on smaller sets you
0:39:00and then interpolate done the same idea of what's done for language modeling for years
0:39:04now
0:39:05then we also looked at what we hear knife that using different a variance for
0:39:11english and this is some work that have been published with me to your back
0:39:15in two thousand and ten and basically
0:39:18we get a little bit of data for some of these
0:39:21we don't degrade for any of the variance you with respect to the visual pull
0:39:24model whereas with the map adaptation we actually did a little bit more some one
0:39:27or two with the variance i don't remember which ones
0:39:31so let me finish up
0:39:34i guess the take a message like say that the unsupervised acoustic model training successful
0:39:40it's been applied in a lot to broadcast news type data more recently in the
0:39:45babel project to
0:39:48wider range of the data i think it's really exciting that we can do that
0:39:50we still have to find the data
0:39:52but it it's really nice that we can do this type stuff
0:39:56but the error rates are still kind of time
0:39:59you it's a sorry even though the eraser so kind of i we can we're
0:40:03going in the right direction general
0:40:05the
0:40:06i'm sure rich or people but can see more some more about this they are
0:40:10during the meeting
0:40:11this is something that's interesting is that it seems in this will make people from
0:40:15yesterday happy it seems that the
0:40:18mlps a more robust to the fact that the transcriptions are imperfect in and they
0:40:22take less of a hit in the hmms
0:40:24as that sort of interesting
0:40:25observation
0:40:28and so we can use this untranscribed data were automatically and on automatically transcribed data
0:40:33to produce
0:40:34references for the training the mlps that's really nice
0:40:38the your hopefulness type of approach will allow us to extend two different types of
0:40:42task more easily miss you don't have to use the time of collecting the data
0:40:46entry transcribing the data you collect
0:40:48we still have to collected
0:40:50i didn't speak about multilingual acoustic modeling
0:40:53which is something that we in general
0:40:57shows and to do bootstrapping restore should just taking like models from other languages
0:41:03is it better use multilingual can we do better
0:41:06i think it depends on what you have been hand would be nice if we
0:41:08could everything we've tried in babble has gotten worse the sparse of a little bit
0:41:13disappointed with what we've been doing
0:41:16then of course something i didn't talk about what you do we languages have no
0:41:20written standard formats or touched upon it with the bottom for example i don't really
0:41:25now we're trying to do some work in others are in this even for about
0:41:29it's a paper round two thousand five i think of trying to automatically discover lexical
0:41:34units
0:41:35but when the main problems you also have you know they're meaningful
0:41:40i said i'm a bunch of times that here myself saying it's or systems or
0:41:44the kinetic the going to learn that people that say like and you know you
0:41:47can either in that
0:41:48then the word-like even if you get it is meaningful in some cases and it's
0:41:51not meaningful another cases and so how do you with that how do not was
0:41:54useful
0:41:56but i think it's really exciting it's been fun stuff i hope the
0:42:00those of you that have worked on unsupervised training will continue in those of you
0:42:04that have in my money give it a try
0:42:06and so
0:42:08thank you for giving me the opportunity speak
0:42:11and these are all but i've worked with closely on this work and there's probably
0:42:15other people i've forgotten and sorry
0:42:19so thank you
0:42:28thanks lori we have some time for questions
0:42:39natural unit is gradually improvements more data you have any idea what is being can
0:42:45improve i mean in this which words are getting better which ones that state maybe
0:42:51that probably not that yes now we have about that that's an interesting thing to
0:42:55look at
0:42:56something that we
0:42:58what we would like to do that we haven't done yet is to actually not
0:43:01just continue we don't normally we incrementally increasing amount of data would be interesting even
0:43:06change the datasets is just use different round of portions of i think we should
0:43:10cover better
0:43:12"'cause" when you're models are like something gonna continue liking it
0:43:16we have a look at words would be interesting
0:43:20so the question it so
0:43:22you know if you talk to machine learning person that works in some a supervised
0:43:25learning the get really nervous when you say self training or some supervision because it's
0:43:29this thing where you're starting with something which isn't working that well it can counters
0:43:34and actually go unstable and the opposite direction so there's this sort of sensitive
0:43:39you're starting with a baseline recognizer trained on a small amount that's work reasonably well
0:43:44you can improve if you're starting something was working really bad i can get worse
0:43:49and i and i noticed with some of a lot of the results that you
0:43:53had in this talk shows a lot of broadcast news we are starting with
0:43:56but are performing susan hours those all these results for all languages get your data
0:44:01we had nothing started zero
0:44:03zero you are transcribed data
0:44:05in language or you're making language
0:44:08okay so we started on a sort of all these languages here to the right
0:44:13we have zero
0:44:14in language data
0:44:15and we started with seed models of word context independent if you use you know
0:44:19that from another language and reduced to the max and so the noise model you
0:44:23will roughly sixty to eighty percent word error rate when you start
0:44:26on your data
0:44:28so we're starting really high kicks but
0:44:30or language models even though they're trained on
0:44:34newspaper text newswire text and things like that
0:44:36are pretty well representative the task there can be very strong constraints in there
0:44:41which is why i find the babble were really exciting which means you we haven't
0:44:46done it ourselves on the unsupervised part
0:44:49but the it's really exciting because there don't have the screen coming from a language
0:44:53model
0:44:54all you have a small amount of
0:44:56transcriptions so ten hours of transcriptions so here there is information is coming into the
0:45:00system from the text
0:45:02and that we're
0:45:03i personally believe the
0:45:05why works
0:45:07just something to see if you don't normalise
0:45:10correctly so we had certain situations people that i don't know how to pronounce numbers
0:45:14i'm just gonna keep them as numbers
0:45:17convergence is a lot harder doesn't work very well if you say i'm gonna throw
0:45:21away the numbers it which is what some of the people to be the language
0:45:25modeling did
0:45:26it also doesn't work so well
0:45:27so you really need to have something that represents pretty well
0:45:31well cosine it seems
0:45:32from my we also had some languages we're would you
0:45:36people from the litter here we think the problem when you take text that are
0:45:40online sometimes yes
0:45:42it's a texan other languages you actually have to filter out the text that are
0:45:45not language you're targeting tended did you wanna
0:45:49come up i think you're so that users come up during questions or
0:45:53okay
0:45:54no during the question you wanted her to change
0:46:00it wasn't it i don't think questions business at your and the formant to all
0:46:05a
0:46:06and the last question from or okay one L
0:46:12depending application at the end of the day you may want to have readable text
0:46:16like to queue for translation or broadcast news
0:46:20and at that point two H and four by hand think of names
0:46:25it's probably more important than
0:46:27and also percent of the K L I
0:46:30let me just call always systems are case them punctuation
0:46:35but we're not measuring the word error rate on punctuation the case where all the
0:46:39systems produce the punctuated case double
0:46:44the named entities are not specifically detected but hopefully if there proper nouns will be
0:46:51uppercase diff we did a language model right
0:46:53so this is something that actually we've
0:46:56we tried in that's why had this slide work in the case insensitive in case
0:47:01okay and then in case insensitive word error rates this is about a two percent
0:47:04difference
0:47:06in that the punctuation is a lot harder to evaluate so some work that we
0:47:09do with the acting colour who's now and that
0:47:14trying to evaluate the punctuation based on the road program and it's very difficult if
0:47:19you take to is for humans they don't agree how to punctuate things maybe not
0:47:23of speech
0:47:24for a
0:47:26full stops
0:47:28closer and big if there's eighty percent inter annotator agreement and if you go to
0:47:32common sits down relative
0:47:35so it's very heart
0:47:36but what really want something that's acceptable no you don't really care about a ground
0:47:41to i think sort of like translation you know really care exactly what it is
0:47:44as long as a reasonable
0:47:47reasonably correct punctuation if you have multiple forms of are possible just like you can
0:47:51translate something in multiple ways if you get one of them that's correct this could
0:47:54not so i think punctuation false same category if you use a common or after
0:47:59something it's not very important as long as you can understand correctly
0:48:03as i think as a really heart problems to evaluate
0:48:07fact even more so than doing something that seems reasonable
0:48:12the used as a separate comes on that the punctuation is that in the postprocessing
0:48:16step
0:48:18and other sites of done you bbn is done punctuation unity the other sites also
0:48:24over ten years you know
0:48:28but you're right
0:48:30okay