Přepis řeči - VARIATIONAL APPROXIMATION OF LONG-SPAN LANGUAGE MODELS FOR LVCSR

"'kay" uh first a couple of quick disclaimers uh this is uh work mostly done by a look the us and that to much we love who give the previous talk uh a with a lot of help from a the fund and from no and i just have the pleasure of sort of pretending that that have something to do that and going to talk uh uh not actually couldn't come because of some uh i is a problems not on the check side but going back to the U S so he sends apologies and uh he even made the slide so apologies to then then V of the slides a two nice i don't think he's trying to hide any alright right so uh you had to much tell you how the and neural net for gives is great performance one of the issues with the model like that is that you know it has essentially at least theoretically infinite memory and and the other it does depend on the past five seven eight word so you really can't do lattice was coding with the model like this so the main idea about this paper is can we do something but the neural net put language model so that we can rescored scored is but and if you want the idea and a nutshell this whole variational approximation is a scary term i don't know how one came up with that it's actually very simple idea imagine for a moment that the really was a true language model that generate all the sentences that you are nice speak right how do we actually but but such a mark what we do is we take a lot of actual text sentences so it's a little bit like saying we sample from this true underlying distribution they get a bunch of sentence and then we approximate that and the line model with the markov chain like be do it second third for of them sort are the markov chain and that's it and them approximation of the true underlying model that you and i believe looks are not heads are better right i so that have is the same you to much as uh a neural net language model and pretend that that's a to model of language generate lots and lots of the data from that model instead of having human beings i to text and simply estimate a n-gram model from that that's essentially a long and short the paper right so i'll tell you how it works oh yeah it is a house statistical i okay i will trying to do this this way no that's okay it's not okay so a be get annotated speech you use the generative training procedure to create acoustic models we get some text because you didn't it to training to do a language model the combine them figure out some scale factor we get new speech we it to the decoder which has all these three models produce transcribed utterance and essentially via are implementing this formula P of a given W to the part mu P of W in return find the them that just standard when the last now the language model is typically as i said approximated using an n-gram so we take P of W light it as W being a whole sentence did this P of W I given W you want to write minus one the pretty approximated by some uh a like in the last couple of words in the history that gives rise to and the models and typically uh we use small and then small number of and n-grams in order to make the decoding feasible and the you and we get work quality so means instead create a so search space like a lattice this and we score the lattice using a bigger than n-gram again standard to practise everybody get knows there's nothing uh so this is asking can we go beyond n-grams gonna use something like to much as neural net to do lattice rescoring like at or we could perhaps talk about using more complex acoustic model and all these a feasible but uh lattices provided your models are tractable in the sense of being local because when phones even the have a they have a five phone context that's to local n-grams even for any nipples five hours to local but if you truly have a long span model that's not possible you can do it with word lattices so you tend to do it then best right so and best list have the advantage that you can deploy play really long span model but they do have a bias and this paper is about how to get past that so the real question is can we do but like to this is what we do with best so let's see how we go from so what's good about a word lattices the provide a light so space but and they had not biased by the new model that it will are used to rescore the lattices because they really have a lot of options and them but they do require the new models to be look and best less they don't require a new models to be local the you models can be really long span but they do offer a limited search space limited by and and more importantly that top and are chosen according to the old more so the choice of the top-n and hypotheses is by so in some other word uh and no what on this and present a paper at is do you and and this i cast that as opposed to going on right now which you might be able to catch after the session but i i shows how to essentially expand the search space how to search more than just the and the lattice right and that's like saying no let's do a lattice like stuff but and best less except let's some somehow try to make the end very large without really having a large so that saying let's do it the decoding these coding that some magic you can this that point goes in the other direction it's a is i somehow approximate the long span model and make it look so yes the neural net model has a long history but can i somehow get a local approximation of that model and that's for the steak alright right so let's see so this is sort of the outline of the top that's coming up so the first order of business what long span model of we approximate already told you this is the the could in that model that the much just presented and so you all know about it i one based time so what's is approximation going to be but i think of it this way that's so this is the space of or language models but than that he's the set of all we could a neural networks and let's just say this is set of all n-gram models like and let's say this the model you'd like well typically you can decode rescored lattices were then gram models that's so there's a checked against the what kind of the i think it's not a needs die out that's colour yeah okay anyway so that's tractable the blue one is not a so what we should do what we should really be using a tractable model which is as close as possible to the model we really like to right but i i and a model that the actually use for scoring based on the training data may not be that more are from the same training data you estimate run neural net model P and you estimate an n-gram model am and and may not be close to to start used start of the and which just close this to the out to your are long span language model whatever it might be so what we do do in this paper to saint of "'em" what happens if we use to stop what if we approximate this but the model P but the than the m-gram gram the n-gram model so that's how you wanna look at this so what is action and or skip all so it what happens when you try to use someone else's like okay so that that the approximations gonna work you're gonna look for a and i'm model Q among the set of all n-grams script you which is close this to this long span model in the sense of the them for enough everybody that alright right so what do you do but will essentially say the kl divergence between P and Q is basically just a sum what all X P X to X what's X X is all possible sentences in the war "'cause" these that let's a sentence level model and if you forget about the be lower P to what you really want as you want the Q that maximises the sum of all sentences in the universe of P X log to X where P X is the neural net probability of the sentence and Q X is the n-gram probability of the scent right of course you can get P X of every sentence in the universe but what you could do is just like we do with normal human language models we approximate them by synthesizing they from our model namely getting people to write down text and estimating an n-gram model from the text so what we do it we'll synthesized sentences using using this neural net language model and will simply estimate and then gram model from the synthesized date so the recipes is very simple you get your fancy long span language that is a prerequisite this fancy long i'm language model needs to be a generative more meaning you need to be able to simulate sentence using this model but the i don't and is ideal for that and you synthesized sentence and then once you have a huge enough corpus that you're comfortable estimating whatever n-gram you an estimate you can go ahead an estimated as if somebody gave you tons of X so sounds crazy yeah so let's see what it that so i'm gonna give you T sets of experiments first a baby one then the medium and then than the really muscular one so this is the baby experiment we start off at the penn treebank corpus uh and i'd have about a million words for training and about twenty thousand words for evaluation and the vocabularies made of the top ten thousand words and is the standard set and just to tell you how standard the says uh sure and yellow nick had the random forest language models on it cal buff present a structured language model run drawer bit somewhere a a language model as well then it's film no mary hard ball that that is a hard but i is not on it uh these fish block of and the say them and thought of the uh the costs and pins cash like language model all sorts of people of sound is that an exactly this corpus so in fact we didn't have to be that experiment fig simply take their set up and copy the number it's pretty standard so what we get on that so what we need of the estimated this uh a neural net actually to matched it and uh are then we simply that in simulation more you of having a million words of training text we generated two hundred thirty million words of training i two to thirty million at that how much we could create in a day or so we stop okay and then we simply estimate and and n-gram model from the simulated text so the and model that are generated different simulated text that called five eight P at i'd and then six a variational approximation of the neural net we can either generate bike bands or five gram models from and are on those that so the both a good trigram model with standard can nice smoothing has a perplexity of one for the yeah if you approximate the neural network by the synthetic method now this is generated a is to made don't two hundred thirty million words of data since a lot of data but its perplexity is only one fifty two which is sort of compatible to one for and a you if you interpolate that do you get one twenty four so maybe there's reason to be high or do we compare but so a a shoes a random forest language model had sort of it to work to like at look at the two previous words in the line in that and and we can compare that that's so that's one thirty to this one twenty four okay so far so uh you can also compare it but the car L or make other or or or or a on of hard for language models they look at previous words and syntactic heads and so on so think of them as five gram models because the look at for preceding words although these are not the four consecutive words use a based on some so if you do wanna do that we can compare it by stimulating a five gram or so i five gram can set nine models based based on the same one million words of training text of the perplexity of what one for T and this neural net has a perplexity of also about one forty but when you interpolate to get one twenty and so i again competitive at all the other model and then finally uh the i Z in clock of model is across cross it's models it looks at previous sentence of the compared to that we simply implement a very simple cache language model cash up all the words in the last in the sort of previous couple sentences and just to that but you kind model and you get a perplexity of hundred eleven mind you the system not as good as one or two which is what the much has ended so that's tell you that the exact neural net language model still better than the approximation be creating after of the at approximating it long as an out and then with the five the approximation already is pretty good and quite different from the and us to may just re million words of text so that two are working but okay so this is nice perplexity as that uh yeah the next experiment uh we were looking at the mit lectures data so for those of you who don't know the score corpus there are uh a few tens of hours of acoustic training data i thing some like twenty and then a couple of hours of evaluation data these a professors giving lectures and a couple of speakers so we have transcripts for about uh and a fifty words of text uh a speech and uh we have a lot of data from broadcast news which is out of domain main news and these that like to that it might be and uh so we basically said let's see what we can do with it we estimated a at and then from the in-domain data simulated twenty times more data that three million words of text and estimate and then gram from there and uh compared to uh the basic like nist allow model this would be a computation saying what can you do the simulated language model have "'cause" you don't wanna throw we all the broadcast news models because they do have a lot of good english of it for the in as well and a a lot of use a baseline lines you can choose your if you just use the ms so my model the estimated from the mit data plus interpolated of broadcast news you do unit as well as you can and you get a a of rates like twenty four point seven on one lecture the twenty two point four and uh the lecture you got a reasonable number uh then uh but with the big acoustic model are trained using that the less of these are fairly they can a better trained acoustic model that men but i they have some time and my i don't know if they have P made that is not that they do so yes so all the goodies that in there and if you score the top hundred using the for neural network because remote that's a full sentence model uh you get some reduction in a twenty four point one for the first set than the twenty two point for the other if you go much be but in the n-best list you do get a you you to go a bit more of an improvement almost to close to one percent point eight point nine and a lot of a pretty good but only ripple both also fifteen or so maybe they could we but sure or anyway that's what you get by doing and best rescoring and i said one the problems is that the and that is presented to you is good according to the original five four gram model uh and then what you can do as you can replace the original four gram model but this phony for that model estimated from simulated you much more text you three million words one for thousand words and all any you can see on the first line of the second block that if you simply used the variational approximation of the four gram of the can use an that to the four gram and decode but they you already have a point for point two percent reduction in word error rate what's more interesting is not if you simply score the hundred best which is much less work you can get almost all the game so this is starting to look given but so what you're saying is that not only do you get better than a output if you use the uh and got approximation of the new net language model you produce better because we you then rescored it but the for new in that language model you get uh did actions that much lower N "'cause" but you have to score and that of and and you would if you had the original so this is the medium size experiment if you met than that of a large an experiment this has to do with english uh conversational telephone speech and some uh meeting speech from the nist the uh D or seven evaluation data and again that are about five million words transcribed so that's are basic uh in the main training data and then uh we can either build a case of my model or we can create a neural net model and then synthesized another huge order of magnitude like for million words of text from it and then use that X to build the language model so again the original language model which is in blue and the fate line simulated data language model which is right and again the way that two gram five gram that indicates that be produced lattices using a bigram model but then be rescored coded using a a five gram model and there's recording was done and no so this uses the remote and uh on that uh he again again and again a bunch of is that's again you can decide what to baseline as i like to think of the is of my five gram lattice rescoring as the baseline a you get the uh on the cts data are you get twenty percent that of rate on the uh meetings data a good thirty two point four and if you D score these using a neural net you get down to twenty seven point one if you hundred best rescoring or twenty six point five if few thousand best and that sort of the kind of things you can get if you D using a standard n-gram and score using the neural but in signal standard n-gram you find out of place it but this new n-gram which is an approximation of the neural net yeah we get so you get a nice reduction just by using a different and them morton's a twenty eight point or becomes twenty seven point two and thirty two point four becomes thirty one point seven which is nice uh we don't get as much of a gain and least coding so maybe the done to good a job of meeting the that this is but uh a a there's a gain to be had it least in the first pass decoding so again uh to conclude uh the basically convinced us and hopefully have convinced you that if you have a very long span model but you can fit in your additional decoding the rather than just leave it to the end for scoring you might as though simulate some text for that didn't build an and them model of the simulated text because that already is better than just using it all then them and it might save you some work during these code uh we able would improve significantly lower an and that that's of course mainly coming from the on and then and uh before i conclude let me show you one more slide which is interesting does the one that the much sure that the end of a stop uh this is a a much larger training set because the one of the reviewers by the way set all this all night what does it scale to large data what happens when you a lot of is original training data so does also a that we have four hundred million words to trained initial and them and then once we train a neural net but that which takes whatever and ever but that's to much as problem uh we then simulate and a billion words of text and then but if and gram model are of that and as you can see the or ignored decoding that the four gram gives you for being point one at a rate and if you simply be decode using the approximated for gram based on the neural net you already get the been point a tells you that it just place saying you're all four gram model with the new month based on approximating a bit language model is already a good start and then the last number of what the uh showed you it's twelve point one if you score B uh lattices as sum and best list start of this in model me go back and say that's it and thank you i questions okay yes you long and model has to be good yeah yes okay oh yeah okay okay i can give you a quantity on set in terms of correlation but if you think of at the first a i a doesn't it is but seem funny because you repeat a question yes i i Z yeah it doesn't this looks like the standard bias variance trade off if you have lots of initial text to train the neural net like first of all this will work only the neural net is much better than the and gram models are trying to replace with the simulation yeah and it is yes you are can approximate not actual text with the four gram but some imagine text with the four gram so if the imagine text is not a good representation of actual that it has to be it that are not be good actual text it has to be a but a presentation of actual text than the for gram approximation of the actual so that two models competing here there's actual text which is a good representation of you language that the four gram which is a okay to presentation of language so you model has to be better than the for that and once you have a you taken get of the by and the and the simulation reduces bit okay i think request a i just wonder how big your one billion language model um is after you you made it an n-gram language model oh oh person to really one don't quite a bit by the way this is the um i'm i current think that is a no but yeah no i don't think of the number of of the double my fingers but yes there's is standard yeah stalled Q like pruning uh for all sorts of things so it is uh i don't the much where you remember how big that language model the the uh five million and fifty million yeah five million for decoding fifty million these coding i think i just have a question the bound the um on so um you don't have set how would you run on a away and you were where language model and a lattice rescoring right so sorry i can't hear you because of the double thing i here the i well as well as the reflections so so you you do have a a direct results using lattice rescoring with new right length model right a direct result of lattice is coding using neural net morgan sell you don't but i'll tell that table the for one the K and plus you and decoding with work no that a broadcast news so this is basically mission my model based on the broadcast news for hundred million words of broadcast news in the second one is can a some i'm model from the broadcast news interpolated with the for gram approximation of the new and a but and i would just and is "'cause" my be slightly a basic on what what's it was probably just seem implementation because the nature of the expansion news is having no back structures so this every every single context is forty it is that because that's to cool making the decoding or then then you want to be of the neural net uh you you or lattices was that C the neural then because of the recurrent current are is keeping information about the entire pot cost so when two parts um in the is you cannot model the angle can no okay yes okay but for one more question a two and a i uh uh how much do we have to synthesise i think the simulations then work for like you know jess a factor of two or factor of fine they started working menu you at least an order of magnitude but the original model yeah because after i it that is a statistic which is interesting namely that when we simulated two hundred thirty million words for example right we looked at n-gram coverage and how many hallucinations out that and the new n-gram than how many of you course there's no way to tell except that is one way we set if and then n-gram that be hypothesis a simulated or how loose an aided shows up in the google five gram corpus then will say it was a good simulation and if it doesn't say was the bad simulation so if you look at that a one million words of wall street journal eighty five percent of them i'd and the group bill five gram so that's the are so real text has a a bad and goodness of eighty five percent and that two hundred thirty million words we simulated has the goodness of what's seventy two percent so we are mostly simulating good and but you're assimilate and order of magnitude okay let's thinks the speaker okay

VARIATIONAL APPROXIMATION OF LONG-SPAN LANGUAGE MODELS FOR LVCSR

Language Modeling

Přednášející: Sanjeev Khudanpur, Autoři: Anoop Deoras, Center for Language and Speech Processing, United States; Tomáš Mikolov, Stefan Kombrink, Martin Karafiát, Brno University of Technology, Czech Republic; Sanjeev Khudanpur, Center for Language and Speech Processing, United States