0:00:16"'kay" uh
0:00:17first a couple of quick disclaimers
0:00:19uh this is uh work mostly done by a look the us and that to much we love who give
0:00:24the previous talk
0:00:26uh a with a lot of help from a the fund and from
0:00:30no
0:00:30and i just have the pleasure of sort of pretending that that have something to do that and going to
0:00:35talk
0:00:36uh uh not actually couldn't come because of some uh i is a problems not on the check side but
0:00:41going back to the U S so he sends apologies
0:00:44and uh he even made the slide so apologies to then then V of the slides a two nice i
0:00:49don't think he's trying to hide any
0:00:52alright right
0:00:53so
0:00:54uh you had to much tell you how the and neural net for gives is great performance
0:00:58one of the issues with the model like that is that you know it has essentially
0:01:02at least theoretically infinite memory and and the other it does depend on the past five seven eight word
0:01:07so you really can't do lattice was coding with the model like this
0:01:10so the main idea about this paper is can we do something but the neural net put language model so
0:01:15that we can rescored scored is but
0:01:17and if you want the idea and a nutshell
0:01:19this whole variational approximation is a scary term i don't know how one came up with that
0:01:24it's actually very simple idea
0:01:25imagine for a moment that the really was a true language model that generate all the sentences that you are
0:01:30nice speak
0:01:31right
0:01:31how do we actually but but such a mark what we do is we take a lot of actual text
0:01:35sentences
0:01:37so it's a little bit like saying we sample from this true underlying distribution they get a bunch of sentence
0:01:42and then we approximate
0:01:43that and the line model with the markov chain like be do it second third for of them sort are
0:01:48the markov chain
0:01:49and that's it and them approximation of the true underlying model
0:01:52that you and i believe looks are not heads are better right
0:01:55i
0:01:56so that have is the same you to much as uh a neural net language model
0:02:00and pretend that that's a to model of language
0:02:02generate lots and lots of the data from that model instead of
0:02:05having human beings i to text
0:02:08and simply estimate a n-gram model from that
0:02:10that's essentially a long and short the paper
0:02:12right
0:02:13so i'll tell you how it works
0:02:17oh
0:02:18yeah it is a house statistical
0:02:21i
0:02:23okay
0:02:24i will trying to do this this way
0:02:26no that's okay it's not
0:02:28okay
0:02:29so a be get annotated speech you use the generative training procedure to create acoustic models
0:02:35we get some text because you didn't it to training to do a language model the combine them figure out
0:02:39some scale factor
0:02:41we get new speech we it to the decoder which has all these three models
0:02:45produce transcribed utterance
0:02:47and
0:02:48essentially via are implementing this formula P of a given W to the part mu P of W in return
0:02:53find the them that just standard when the last
0:02:57now the language model is typically as i said approximated using an n-gram
0:03:01so we take P of W light it as W being a whole sentence did this P of W I
0:03:05given W you want to write minus one
0:03:07the pretty approximated by some
0:03:10uh
0:03:10a like in the last couple of words in the history that gives rise to and the models
0:03:14and typically
0:03:16uh we use small and then small number of and n-grams in order to make the decoding feasible
0:03:21and the you and we get work quality so means instead create a so search space like a lattice this
0:03:27and we score the lattice using a bigger than n-gram
0:03:30again standard to practise everybody get knows there's nothing
0:03:34uh
0:03:36so this is asking can we go beyond n-grams gonna use something like to much as neural net to do
0:03:40lattice rescoring
0:03:42like
0:03:42at or we could perhaps talk about using more complex acoustic model
0:03:46and all these a feasible but uh
0:03:48lattices provided your models are tractable in the sense of being local
0:03:53because when phones even the have a they have a five phone context that's to local n-grams even for any
0:03:57nipples five hours to local
0:03:59but if you truly have a long span model that's not possible
0:04:02you can do it with word lattices
0:04:04so you tend to do it then best
0:04:07right
0:04:07so and best list have the advantage that you can deploy play really long span model but they do have
0:04:12a bias
0:04:13and this paper is about how to get past that
0:04:16so the real question is can we do but like to this is what we do with best
0:04:21so let's see how we go from
0:04:23so what's good about a word lattices the provide a light so space
0:04:27but and they had not biased by the new model that it will are used to rescore the lattices because
0:04:32they really have a lot of options and them
0:04:34but they do require the new models to be look
0:04:36and best less
0:04:38they don't require a new models to be local the you models can be really long span
0:04:42but they do offer a limited search space limited by and and more importantly that top and are chosen according
0:04:46to the old more
0:04:48so the choice of the top-n and hypotheses is by
0:04:50so
0:04:51in some other word
0:04:53uh
0:04:54and no what on this
0:04:55and present a paper at is do you
0:04:57and and this i cast that as opposed to going on right now which you might be able to catch
0:05:01after the session
0:05:02but i i shows how to essentially
0:05:04expand the search space how to search more than just the and the lattice
0:05:08right
0:05:09and that's like saying no let's do a lattice like stuff but and best less except let's some somehow try
0:05:13to make the end very large without really having a large
0:05:16so that saying let's do it the decoding these coding that some magic you can
0:05:20this that point goes in the other direction
0:05:22it's a is
0:05:23i somehow approximate
0:05:24the long span model
0:05:27and make it look
0:05:28so yes the neural net model has a long history but can i somehow get a local approximation of that
0:05:32model
0:05:33and that's for the steak
0:05:35alright right so let's see
0:05:37so this is sort of the outline of the top that's coming up
0:05:40so the first order of business what long span model of we approximate already told you
0:05:45this is the the could in that model that the much just presented and
0:05:50so
0:05:51you all know about it i one based time
0:05:53so what's is approximation going to be
0:05:55but i think of it this way
0:05:57that's so this is the space of or language models
0:05:59but than that he's the set of all we could a neural networks
0:06:02and let's just say this is set of all n-gram models like
0:06:05and let's say this the model you'd like
0:06:09well typically you can decode rescored lattices were then gram models that's so there's a checked against the
0:06:14what kind of the
0:06:17i think it's not a needs die out that's colour yeah
0:06:20okay anyway so that's tractable the blue one is not a
0:06:23so what we should do what we should really be using a tractable model which is as close as possible
0:06:28to the model we really like to
0:06:31right
0:06:32but i i and a model that the actually use for scoring based on the training data may not be
0:06:37that more
0:06:39are from the same training data you estimate run neural net model P
0:06:43and you estimate an n-gram model am
0:06:45and and may not be close to to start used start of the and which just close this to the
0:06:50out to your are long span language model whatever it might be
0:06:53so what we do do in this paper to saint of "'em" what happens if we use to stop
0:06:58what if we approximate this but the model
0:07:00P
0:07:01but the than
0:07:02the m-gram gram the n-gram model
0:07:05so that's how
0:07:06you wanna look at this
0:07:07so what is action and or skip all
0:07:13so it what happens when you try to use someone else's like
0:07:17okay
0:07:17so that that the approximations gonna work
0:07:19you're gonna look for a and i'm model Q
0:07:22among the set of all n-grams
0:07:24script you
0:07:25which is close this to this long span model in the sense of the them
0:07:31for enough everybody that
0:07:33alright right
0:07:34so
0:07:36what do you do but will essentially say
0:07:38the kl divergence between P and Q is basically just a sum what all X P X to X what's
0:07:43X X is all possible sentences in the war
0:07:47"'cause" these that let's a sentence level model
0:07:50and if you forget about the be lower P to
0:07:53what you really want as you want the Q
0:07:55that maximises
0:07:57the sum of all sentences in the universe of P X log to X where P X is the
0:08:01neural net probability of the sentence
0:08:03and Q X is the n-gram probability of the scent
0:08:07right
0:08:07of course you can get P X of every sentence in the universe
0:08:12but what you could do is just like we do with
0:08:15normal human language models we approximate them by synthesizing they from our model namely getting people to write down text
0:08:22and estimating an n-gram model from the text
0:08:24so what we do it
0:08:26we'll synthesized
0:08:27sentences using
0:08:33using this neural net language model and will simply estimate and then gram model from the synthesized date
0:08:40so the recipes is very simple you get your fancy long span language that is a prerequisite
0:08:44this fancy long i'm language model needs to be a generative more
0:08:49meaning you need to be able to simulate sentence using this model
0:08:53but the i don't and is ideal for that
0:08:56and you
0:08:56synthesized sentence
0:08:58and then once you have a huge enough corpus that you're comfortable estimating whatever n-gram you an estimate you can
0:09:03go ahead an estimated as if somebody gave you tons of X
0:09:07so sounds crazy yeah
0:09:09so let's see what it that
0:09:12so i'm gonna give you T sets of experiments first a baby one then the medium and then than the
0:09:16really muscular one
0:09:18so this is the baby experiment we start off at the penn treebank corpus
0:09:23uh and i'd have about a million words for training and about twenty thousand words for evaluation
0:09:29and the vocabularies made of the top ten thousand words and is the standard set
0:09:33and just to tell you how standard the says
0:09:35uh
0:09:36sure and yellow nick had the random forest language models on it cal buff present a structured language model run
0:09:42drawer bit somewhere a a language model as well
0:09:45then it's film no mary hard ball that that is a hard but i is not on it
0:09:49uh these fish block of and the say them and thought of the uh the costs and pins
0:09:53cash like language model all sorts of people of sound is that an exactly this corpus
0:09:57so in fact we didn't have to be that experiment fig simply take their set up and copy the number
0:10:02it's pretty standard
0:10:04so what we get on that
0:10:05so what we need of the estimated this uh a neural net actually to matched it
0:10:09and uh are then we simply that in simulation more you of having a million words of training text
0:10:14we generated two hundred thirty million words of training
0:10:17i two to thirty million at that how much we could create in a day or
0:10:20so we stop okay
0:10:22and then
0:10:23we simply estimate and and n-gram model from the simulated text
0:10:26so the and model that are generated different simulated text that called five eight P at i'd and then six
0:10:32a variational approximation of the neural net
0:10:34we can either generate bike bands or five gram models from
0:10:37and are on those that
0:10:40so the both a good trigram model with standard can nice smoothing has a perplexity of one for the yeah
0:10:46if you approximate the neural network by the synthetic method now this is generated a is to made don't two
0:10:51hundred thirty million words of data since a lot of data
0:10:54but its perplexity is only one fifty two which is sort of compatible to one for
0:10:58and a you if you interpolate that do you get one twenty four so maybe there's reason to be high
0:11:02or do we compare but
0:11:03so a a shoes a random forest language model had sort of it to work to
0:11:08like at look at the two previous words in the line
0:11:10in that and and we can compare that that's so that's one thirty to this one twenty four okay so
0:11:14far so
0:11:16uh
0:11:17you can also compare it but the car L or make other or or or or a on of hard
0:11:21for language models they look at previous words and syntactic heads and so on so think of them as five
0:11:26gram models because the look at for preceding words
0:11:29although these are not the four consecutive words use a based on some
0:11:33so if you do wanna do that we can compare it by stimulating a five gram or
0:11:38so i five gram can set nine models based based on the same one million words of training text of
0:11:42the perplexity of what one for T
0:11:44and this neural net has a perplexity of also about one forty but when you interpolate to get one twenty
0:11:50and so i again competitive at all the other model
0:11:53and then finally
0:11:54uh
0:11:55the i Z in clock of model is across cross it's models it looks at previous sentence of the compared
0:12:00to that we simply implement a very simple cache language model
0:12:04cash up all the words in the last in the sort of previous couple sentences
0:12:07and just to that but you kind model
0:12:10and you get
0:12:11a perplexity of hundred eleven
0:12:13mind you the system not as good as one or two which is what the much has ended so that's
0:12:16tell you that the exact neural net language model still better than the
0:12:20approximation be creating after of the at approximating
0:12:23it long as an out and then with the five
0:12:25the approximation already is pretty good and quite different from
0:12:28the and us to may just re million words of text
0:12:31so that two are working but
0:12:33okay so this is nice perplexity as that
0:12:35uh yeah the next experiment
0:12:38uh we were looking at the mit lectures data
0:12:40so for those of you who don't know the score corpus
0:12:43there are uh
0:12:44a few tens of hours of acoustic training data i thing some like twenty and then a couple of hours
0:12:48of evaluation data these a professors giving lectures and a couple of speakers
0:12:54so we have transcripts for about uh and a fifty words of text
0:12:58uh a speech and uh we have a lot of data from broadcast news which is out of domain main
0:13:03news and these that like to that it might be
0:13:05and uh so we basically said let's see what we can do with it
0:13:09we estimated a at and then from the in-domain data
0:13:12simulated twenty times more data that three million words of text and estimate and then gram from there
0:13:18and uh compared to
0:13:20uh the basic like nist allow model this would be a computation saying what can you do the simulated language
0:13:25model
0:13:26have "'cause" you don't wanna throw we all the broadcast news models because they do have a lot of good
0:13:29english of it for the in as well
0:13:31and a a lot of use
0:13:33a baseline lines you can choose your
0:13:35if you just use the ms so my model the estimated from the mit data
0:13:40plus interpolated of broadcast news you do unit as well as you can
0:13:44and you get a a of rates like twenty four point seven on one lecture the twenty two point four
0:13:49and uh the lecture
0:13:50you got a reasonable number
0:13:52uh then uh but with the big acoustic model are trained using that the less of these are fairly
0:13:58they can a better trained acoustic model that men but i they have some time and my
0:14:01i don't know if they have P made
0:14:04that is not that they do
0:14:06so yes so all the goodies that in there
0:14:08and if you score the top hundred using the for neural network because remote that's a full sentence model
0:14:14uh you get some reduction in a twenty four point one for the first set than the twenty two point
0:14:18for the other
0:14:19if you go much be but in the n-best list you do get a
0:14:22you you to go a bit more of an improvement
0:14:24almost to
0:14:26close to one percent point eight point nine
0:14:28and a lot of a pretty good but only ripple both also fifteen
0:14:32or so maybe they could we
0:14:33but sure
0:14:34or anyway that's what you get by doing and best rescoring and i said one the problems is that
0:14:38the and that is presented to you is good according to the original five four gram model
0:14:43uh and then what you can do as you can replace the original four gram model
0:14:48but this
0:14:48phony for that model estimated from simulated
0:14:52you much more text you three million words one for thousand words
0:14:55and all any you can see on the
0:14:57first line of the second block that if you simply used the variational approximation of the four gram
0:15:02of the can use an that to the four gram
0:15:04and decode but they
0:15:06you already have a point for
0:15:08point two percent reduction in word error rate
0:15:10what's more interesting is not if you simply score the hundred best which is much less work
0:15:15you can get almost all the game
0:15:17so this is starting to look given but
0:15:19so what you're saying is that not only do you get better than a output
0:15:24if you use the
0:15:25uh and got approximation of the new net language model
0:15:29you produce better because we you then rescored it but the for new in that language model
0:15:33you get
0:15:34uh did actions that much lower N
0:15:37"'cause" but you have to score and that of
0:15:39and
0:15:40and you would if you had
0:15:41the original
0:15:43so this is the medium size experiment if you met
0:15:45than that of a large an experiment
0:15:48this has to do with english uh conversational telephone speech and some uh
0:15:52meeting speech from the nist
0:15:54the uh D or seven
0:15:56evaluation data
0:15:58and again that are about five million words transcribed
0:16:01so that's are basic uh in the main training data
0:16:04and then uh we can either build a case of my model or we can create a neural net model
0:16:09and then synthesized
0:16:11another huge order of magnitude like for million words of text from it and then use that X to build
0:16:15the language model
0:16:16so again the original language model which is
0:16:19in blue
0:16:20and the
0:16:21fate line simulated data language model which is right
0:16:25and again the way that two gram five gram that indicates that be produced lattices using a bigram model
0:16:30but then be rescored coded using a
0:16:34a five gram model
0:16:35and there's recording was done and no so this uses the remote
0:16:40and uh
0:16:41on that
0:16:42uh he again again and again a bunch of is that's again you can decide what to baseline as i
0:16:46like to think of the is of my five gram lattice rescoring as the baseline
0:16:51a you get the uh on the cts data are you get twenty percent that of rate on the
0:16:55uh meetings data a good thirty two point four
0:16:58and if you D score these using a neural net you get down to twenty seven point one if you
0:17:02hundred best rescoring or twenty six point five if few thousand best
0:17:06and that sort of the
0:17:07kind of things you can get if you
0:17:09D using a standard n-gram and score using the neural
0:17:13but in signal standard n-gram you find out of place it but this new n-gram which is an approximation of
0:17:18the neural net
0:17:20yeah we get
0:17:22so you get a nice reduction just by using a different and them morton's a twenty eight point or becomes
0:17:26twenty seven point two
0:17:28and thirty two point four becomes thirty one point seven which is nice
0:17:32uh we don't get as much of a gain and least coding so maybe the done to good a job
0:17:35of meeting the
0:17:37that this is but uh
0:17:38a a there's a gain to be had
0:17:40it least in the first pass decoding
0:17:43so again uh to conclude
0:17:45uh
0:17:46the basically convinced us and hopefully have convinced you
0:17:50that if you have a very long span model
0:17:52but you can fit in your additional decoding
0:17:55the rather than just leave it to the end for scoring you might as though simulate some text for that
0:17:59didn't build an and them model of the simulated text
0:18:01because that already is better than just using it all then them
0:18:04and it might save you some work during these code
0:18:08uh
0:18:09we able would improve significantly lower an and that that's of course mainly coming from the on and then
0:18:14and uh
0:18:15before i conclude let me show you one more slide which is interesting does the one that the much sure
0:18:19that the end of a stop
0:18:21uh this is a a much larger training set because the one of the reviewers by the way set all
0:18:25this all night what does it scale to large data
0:18:28what happens when you a lot of is original training data
0:18:30so does also a that we have four hundred million words to trained initial and them
0:18:35and then once we train a neural net but that
0:18:37which takes whatever and ever but that's to much as problem
0:18:40uh we then simulate and a billion words of text
0:18:44and then but if and gram model are of that
0:18:46and as you can see the or ignored decoding that the four gram gives you for being point one at
0:18:50a rate
0:18:51and if you simply be decode using the approximated for gram based on the neural net
0:18:56you already get the been point
0:18:58a tells you that it just place saying
0:19:01you're all four gram model with the new month based on approximating a bit language model
0:19:06is already a good start and then the last number of what the uh showed you it's twelve point one
0:19:10if you score B
0:19:11uh lattices as sum and best list start of this
0:19:14in model
0:19:16me go back and say that's it
0:19:17and thank you
0:19:23i
0:19:25questions
0:19:32okay
0:19:39yes you long and model has to be good
0:19:47yeah
0:19:50yes
0:19:58okay
0:19:59oh
0:20:07yeah
0:20:09okay
0:20:14okay i can give you a quantity on set in terms of correlation but if you think of at the
0:20:18first a i a doesn't it is but seem funny because you repeat a question yes i
0:20:22i Z yeah it doesn't this looks like the standard bias variance trade off
0:20:27if you have lots of initial text to train the neural net like first of all this will work only
0:20:31the neural net is much better than the and gram models are trying to replace with the simulation yeah and
0:20:36it is yes
0:20:37you are can approximate not
0:20:40actual text with the four gram but some imagine text with the four gram
0:20:45so if the imagine text is not a good representation of actual that
0:20:49it has to be it that are not be good actual text
0:20:52it has to be a but a presentation of actual text than the for gram approximation of the actual
0:20:57so that two models competing here there's actual text which is a good representation of you language
0:21:02that the four gram which is a
0:21:03okay to presentation of language
0:21:05so you model has to be better than the for that
0:21:08and once you have a
0:21:10you taken get of the by
0:21:11and the and the simulation reduces bit
0:21:14okay
0:21:16i think request
0:21:24a i just wonder how big your
0:21:26one billion language model
0:21:28um
0:21:29is after
0:21:31you you made it an n-gram language model oh oh person to really one don't quite a bit by the
0:21:35way this is the um i'm i current think that is a
0:21:38no but yeah
0:21:39no i don't think of the number of of the double my fingers but yes there's is standard yeah stalled
0:21:44Q like pruning
0:21:46uh for all sorts of things
0:21:48so it is uh
0:21:50i don't the much where you remember how big that language model
0:21:54the
0:21:56the
0:22:11uh
0:22:11five million and fifty million
0:22:14yeah five million for decoding fifty million these coding
0:22:21i think
0:22:23i just have a question the bound the um on so
0:22:25um you don't have set how would you run on a away and you were where language model and a
0:22:30lattice rescoring right
0:22:31so
0:22:32sorry i can't hear you because of the double thing i here the i well as well as the reflections
0:22:36so
0:22:36so you you do have a a direct results using lattice rescoring with new right length model right
0:22:42a direct result of lattice is coding using neural net morgan sell
0:22:46you don't
0:22:47but i'll tell that table the for one the K and plus you and decoding with work
0:22:52no that a broadcast news so this is basically mission my model based on the broadcast news for hundred million
0:22:57words of broadcast news
0:22:59in the second one is
0:23:01can a some i'm model from the broadcast news interpolated with the
0:23:04for gram approximation of the new
0:23:07and a but and i would just and is
0:23:09"'cause" my be slightly a basic on what what's it was probably just seem implementation because the nature of the
0:23:14expansion news
0:23:15is having no back structures so this every every single context is forty it is that because that's to cool
0:23:20making the decoding or
0:23:22then then you want to be of the neural net uh you you or lattices was that C the neural
0:23:26then because of the recurrent current are is keeping information about the entire pot cost
0:23:31so when two parts um in the is you cannot model the angle can no okay
0:23:35yes
0:23:36okay
0:23:40but for one more question
0:23:42a two and a
0:23:45i
0:23:50uh uh how much do we have to synthesise i think the simulations
0:23:54then work for like you know jess
0:23:56a factor of two or factor of fine
0:23:59they started working menu you at least an order of magnitude
0:24:03but the original model yeah
0:24:05because after i it that is a statistic which is interesting namely that when we simulated two hundred thirty million
0:24:09words for example
0:24:11right we looked at n-gram coverage and how many hallucinations out that and the new n-gram than how many of
0:24:16you
0:24:17course there's no way to tell except that is one way
0:24:19we set if and then n-gram that be hypothesis a simulated
0:24:23or how loose an aided shows up in the google five gram corpus
0:24:26then will say it was a good simulation
0:24:28and if it doesn't say was the bad simulation
0:24:31so if you look at that a one million words of wall street journal
0:24:34eighty five percent of them i'd and the group bill five gram
0:24:38so that's
0:24:38the are so real text has a a bad and goodness of eighty five percent
0:24:42and that two hundred thirty million words we simulated has the goodness of what's seventy two percent
0:24:47so we are mostly simulating good and
0:24:49but you're assimilate and order of magnitude
0:24:52okay let's thinks the speaker
0:24:55okay