"'kay" uh
first a couple of quick disclaimers
uh this is uh work mostly done by a look the us and that to much we love who give
the previous talk
uh a with a lot of help from a the fund and from
no
and i just have the pleasure of sort of pretending that that have something to do that and going to
talk
uh uh not actually couldn't come because of some uh i is a problems not on the check side but
going back to the U S so he sends apologies
and uh he even made the slide so apologies to then then V of the slides a two nice i
don't think he's trying to hide any
alright right
so
uh you had to much tell you how the and neural net for gives is great performance
one of the issues with the model like that is that you know it has essentially
at least theoretically infinite memory and and the other it does depend on the past five seven eight word
so you really can't do lattice was coding with the model like this
so the main idea about this paper is can we do something but the neural net put language model so
that we can rescored scored is but
and if you want the idea and a nutshell
this whole variational approximation is a scary term i don't know how one came up with that
it's actually very simple idea
imagine for a moment that the really was a true language model that generate all the sentences that you are
nice speak
right
how do we actually but but such a mark what we do is we take a lot of actual text
sentences
so it's a little bit like saying we sample from this true underlying distribution they get a bunch of sentence
and then we approximate
that and the line model with the markov chain like be do it second third for of them sort are
the markov chain
and that's it and them approximation of the true underlying model
that you and i believe looks are not heads are better right
i
so that have is the same you to much as uh a neural net language model
and pretend that that's a to model of language
generate lots and lots of the data from that model instead of
having human beings i to text
and simply estimate a n-gram model from that
that's essentially a long and short the paper
right
so i'll tell you how it works
oh
yeah it is a house statistical
i
okay
i will trying to do this this way
no that's okay it's not
okay
so a be get annotated speech you use the generative training procedure to create acoustic models
we get some text because you didn't it to training to do a language model the combine them figure out
some scale factor
we get new speech we it to the decoder which has all these three models
produce transcribed utterance
and
essentially via are implementing this formula P of a given W to the part mu P of W in return
find the them that just standard when the last
now the language model is typically as i said approximated using an n-gram
so we take P of W light it as W being a whole sentence did this P of W I
given W you want to write minus one
the pretty approximated by some
uh
a like in the last couple of words in the history that gives rise to and the models
and typically
uh we use small and then small number of and n-grams in order to make the decoding feasible
and the you and we get work quality so means instead create a so search space like a lattice this
and we score the lattice using a bigger than n-gram
again standard to practise everybody get knows there's nothing
uh
so this is asking can we go beyond n-grams gonna use something like to much as neural net to do
lattice rescoring
like
at or we could perhaps talk about using more complex acoustic model
and all these a feasible but uh
lattices provided your models are tractable in the sense of being local
because when phones even the have a they have a five phone context that's to local n-grams even for any
nipples five hours to local
but if you truly have a long span model that's not possible
you can do it with word lattices
so you tend to do it then best
right
so and best list have the advantage that you can deploy play really long span model but they do have
a bias
and this paper is about how to get past that
so the real question is can we do but like to this is what we do with best
so let's see how we go from
so what's good about a word lattices the provide a light so space
but and they had not biased by the new model that it will are used to rescore the lattices because
they really have a lot of options and them
but they do require the new models to be look
and best less
they don't require a new models to be local the you models can be really long span
but they do offer a limited search space limited by and and more importantly that top and are chosen according
to the old more
so the choice of the top-n and hypotheses is by
so
in some other word
uh
and no what on this
and present a paper at is do you
and and this i cast that as opposed to going on right now which you might be able to catch
after the session
but i i shows how to essentially
expand the search space how to search more than just the and the lattice
right
and that's like saying no let's do a lattice like stuff but and best less except let's some somehow try
to make the end very large without really having a large
so that saying let's do it the decoding these coding that some magic you can
this that point goes in the other direction
it's a is
i somehow approximate
the long span model
and make it look
so yes the neural net model has a long history but can i somehow get a local approximation of that
model
and that's for the steak
alright right so let's see
so this is sort of the outline of the top that's coming up
so the first order of business what long span model of we approximate already told you
this is the the could in that model that the much just presented and
so
you all know about it i one based time
so what's is approximation going to be
but i think of it this way
that's so this is the space of or language models
but than that he's the set of all we could a neural networks
and let's just say this is set of all n-gram models like
and let's say this the model you'd like
well typically you can decode rescored lattices were then gram models that's so there's a checked against the
what kind of the
i think it's not a needs die out that's colour yeah
okay anyway so that's tractable the blue one is not a
so what we should do what we should really be using a tractable model which is as close as possible
to the model we really like to
right
but i i and a model that the actually use for scoring based on the training data may not be
that more
are from the same training data you estimate run neural net model P
and you estimate an n-gram model am
and and may not be close to to start used start of the and which just close this to the
out to your are long span language model whatever it might be
so what we do do in this paper to saint of "'em" what happens if we use to stop
what if we approximate this but the model
P
but the than
the m-gram gram the n-gram model
so that's how
you wanna look at this
so what is action and or skip all
so it what happens when you try to use someone else's like
okay
so that that the approximations gonna work
you're gonna look for a and i'm model Q
among the set of all n-grams
script you
which is close this to this long span model in the sense of the them
for enough everybody that
alright right
so
what do you do but will essentially say
the kl divergence between P and Q is basically just a sum what all X P X to X what's
X X is all possible sentences in the war
"'cause" these that let's a sentence level model
and if you forget about the be lower P to
what you really want as you want the Q
that maximises
the sum of all sentences in the universe of P X log to X where P X is the
neural net probability of the sentence
and Q X is the n-gram probability of the scent
right
of course you can get P X of every sentence in the universe
but what you could do is just like we do with
normal human language models we approximate them by synthesizing they from our model namely getting people to write down text
and estimating an n-gram model from the text
so what we do it
we'll synthesized
sentences using
using this neural net language model and will simply estimate and then gram model from the synthesized date
so the recipes is very simple you get your fancy long span language that is a prerequisite
this fancy long i'm language model needs to be a generative more
meaning you need to be able to simulate sentence using this model
but the i don't and is ideal for that
and you
synthesized sentence
and then once you have a huge enough corpus that you're comfortable estimating whatever n-gram you an estimate you can
go ahead an estimated as if somebody gave you tons of X
so sounds crazy yeah
so let's see what it that
so i'm gonna give you T sets of experiments first a baby one then the medium and then than the
really muscular one
so this is the baby experiment we start off at the penn treebank corpus
uh and i'd have about a million words for training and about twenty thousand words for evaluation
and the vocabularies made of the top ten thousand words and is the standard set
and just to tell you how standard the says
uh
sure and yellow nick had the random forest language models on it cal buff present a structured language model run
drawer bit somewhere a a language model as well
then it's film no mary hard ball that that is a hard but i is not on it
uh these fish block of and the say them and thought of the uh the costs and pins
cash like language model all sorts of people of sound is that an exactly this corpus
so in fact we didn't have to be that experiment fig simply take their set up and copy the number
it's pretty standard
so what we get on that
so what we need of the estimated this uh a neural net actually to matched it
and uh are then we simply that in simulation more you of having a million words of training text
we generated two hundred thirty million words of training
i two to thirty million at that how much we could create in a day or
so we stop okay
and then
we simply estimate and and n-gram model from the simulated text
so the and model that are generated different simulated text that called five eight P at i'd and then six
a variational approximation of the neural net
we can either generate bike bands or five gram models from
and are on those that
so the both a good trigram model with standard can nice smoothing has a perplexity of one for the yeah
if you approximate the neural network by the synthetic method now this is generated a is to made don't two
hundred thirty million words of data since a lot of data
but its perplexity is only one fifty two which is sort of compatible to one for
and a you if you interpolate that do you get one twenty four so maybe there's reason to be high
or do we compare but
so a a shoes a random forest language model had sort of it to work to
like at look at the two previous words in the line
in that and and we can compare that that's so that's one thirty to this one twenty four okay so
far so
uh
you can also compare it but the car L or make other or or or or a on of hard
for language models they look at previous words and syntactic heads and so on so think of them as five
gram models because the look at for preceding words
although these are not the four consecutive words use a based on some
so if you do wanna do that we can compare it by stimulating a five gram or
so i five gram can set nine models based based on the same one million words of training text of
the perplexity of what one for T
and this neural net has a perplexity of also about one forty but when you interpolate to get one twenty
and so i again competitive at all the other model
and then finally
uh
the i Z in clock of model is across cross it's models it looks at previous sentence of the compared
to that we simply implement a very simple cache language model
cash up all the words in the last in the sort of previous couple sentences
and just to that but you kind model
and you get
a perplexity of hundred eleven
mind you the system not as good as one or two which is what the much has ended so that's
tell you that the exact neural net language model still better than the
approximation be creating after of the at approximating
it long as an out and then with the five
the approximation already is pretty good and quite different from
the and us to may just re million words of text
so that two are working but
okay so this is nice perplexity as that
uh yeah the next experiment
uh we were looking at the mit lectures data
so for those of you who don't know the score corpus
there are uh
a few tens of hours of acoustic training data i thing some like twenty and then a couple of hours
of evaluation data these a professors giving lectures and a couple of speakers
so we have transcripts for about uh and a fifty words of text
uh a speech and uh we have a lot of data from broadcast news which is out of domain main
news and these that like to that it might be
and uh so we basically said let's see what we can do with it
we estimated a at and then from the in-domain data
simulated twenty times more data that three million words of text and estimate and then gram from there
and uh compared to
uh the basic like nist allow model this would be a computation saying what can you do the simulated language
model
have "'cause" you don't wanna throw we all the broadcast news models because they do have a lot of good
english of it for the in as well
and a a lot of use
a baseline lines you can choose your
if you just use the ms so my model the estimated from the mit data
plus interpolated of broadcast news you do unit as well as you can
and you get a a of rates like twenty four point seven on one lecture the twenty two point four
and uh the lecture
you got a reasonable number
uh then uh but with the big acoustic model are trained using that the less of these are fairly
they can a better trained acoustic model that men but i they have some time and my
i don't know if they have P made
that is not that they do
so yes so all the goodies that in there
and if you score the top hundred using the for neural network because remote that's a full sentence model
uh you get some reduction in a twenty four point one for the first set than the twenty two point
for the other
if you go much be but in the n-best list you do get a
you you to go a bit more of an improvement
almost to
close to one percent point eight point nine
and a lot of a pretty good but only ripple both also fifteen
or so maybe they could we
but sure
or anyway that's what you get by doing and best rescoring and i said one the problems is that
the and that is presented to you is good according to the original five four gram model
uh and then what you can do as you can replace the original four gram model
but this
phony for that model estimated from simulated
you much more text you three million words one for thousand words
and all any you can see on the
first line of the second block that if you simply used the variational approximation of the four gram
of the can use an that to the four gram
and decode but they
you already have a point for
point two percent reduction in word error rate
what's more interesting is not if you simply score the hundred best which is much less work
you can get almost all the game
so this is starting to look given but
so what you're saying is that not only do you get better than a output
if you use the
uh and got approximation of the new net language model
you produce better because we you then rescored it but the for new in that language model
you get
uh did actions that much lower N
"'cause" but you have to score and that of
and
and you would if you had
the original
so this is the medium size experiment if you met
than that of a large an experiment
this has to do with english uh conversational telephone speech and some uh
meeting speech from the nist
the uh D or seven
evaluation data
and again that are about five million words transcribed
so that's are basic uh in the main training data
and then uh we can either build a case of my model or we can create a neural net model
and then synthesized
another huge order of magnitude like for million words of text from it and then use that X to build
the language model
so again the original language model which is
in blue
and the
fate line simulated data language model which is right
and again the way that two gram five gram that indicates that be produced lattices using a bigram model
but then be rescored coded using a
a five gram model
and there's recording was done and no so this uses the remote
and uh
on that
uh he again again and again a bunch of is that's again you can decide what to baseline as i
like to think of the is of my five gram lattice rescoring as the baseline
a you get the uh on the cts data are you get twenty percent that of rate on the
uh meetings data a good thirty two point four
and if you D score these using a neural net you get down to twenty seven point one if you
hundred best rescoring or twenty six point five if few thousand best
and that sort of the
kind of things you can get if you
D using a standard n-gram and score using the neural
but in signal standard n-gram you find out of place it but this new n-gram which is an approximation of
the neural net
yeah we get
so you get a nice reduction just by using a different and them morton's a twenty eight point or becomes
twenty seven point two
and thirty two point four becomes thirty one point seven which is nice
uh we don't get as much of a gain and least coding so maybe the done to good a job
of meeting the
that this is but uh
a a there's a gain to be had
it least in the first pass decoding
so again uh to conclude
uh
the basically convinced us and hopefully have convinced you
that if you have a very long span model
but you can fit in your additional decoding
the rather than just leave it to the end for scoring you might as though simulate some text for that
didn't build an and them model of the simulated text
because that already is better than just using it all then them
and it might save you some work during these code
uh
we able would improve significantly lower an and that that's of course mainly coming from the on and then
and uh
before i conclude let me show you one more slide which is interesting does the one that the much sure
that the end of a stop
uh this is a a much larger training set because the one of the reviewers by the way set all
this all night what does it scale to large data
what happens when you a lot of is original training data
so does also a that we have four hundred million words to trained initial and them
and then once we train a neural net but that
which takes whatever and ever but that's to much as problem
uh we then simulate and a billion words of text
and then but if and gram model are of that
and as you can see the or ignored decoding that the four gram gives you for being point one at
a rate
and if you simply be decode using the approximated for gram based on the neural net
you already get the been point
a tells you that it just place saying
you're all four gram model with the new month based on approximating a bit language model
is already a good start and then the last number of what the uh showed you it's twelve point one
if you score B
uh lattices as sum and best list start of this
in model
me go back and say that's it
and thank you
i
questions
okay
yes you long and model has to be good
yeah
yes
okay
oh
yeah
okay
okay i can give you a quantity on set in terms of correlation but if you think of at the
first a i a doesn't it is but seem funny because you repeat a question yes i
i Z yeah it doesn't this looks like the standard bias variance trade off
if you have lots of initial text to train the neural net like first of all this will work only
the neural net is much better than the and gram models are trying to replace with the simulation yeah and
it is yes
you are can approximate not
actual text with the four gram but some imagine text with the four gram
so if the imagine text is not a good representation of actual that
it has to be it that are not be good actual text
it has to be a but a presentation of actual text than the for gram approximation of the actual
so that two models competing here there's actual text which is a good representation of you language
that the four gram which is a
okay to presentation of language
so you model has to be better than the for that
and once you have a
you taken get of the by
and the and the simulation reduces bit
okay
i think request
a i just wonder how big your
one billion language model
um
is after
you you made it an n-gram language model oh oh person to really one don't quite a bit by the
way this is the um i'm i current think that is a
no but yeah
no i don't think of the number of of the double my fingers but yes there's is standard yeah stalled
Q like pruning
uh for all sorts of things
so it is uh
i don't the much where you remember how big that language model
the
the
uh
five million and fifty million
yeah five million for decoding fifty million these coding
i think
i just have a question the bound the um on so
um you don't have set how would you run on a away and you were where language model and a
lattice rescoring right
so
sorry i can't hear you because of the double thing i here the i well as well as the reflections
so
so you you do have a a direct results using lattice rescoring with new right length model right
a direct result of lattice is coding using neural net morgan sell
you don't
but i'll tell that table the for one the K and plus you and decoding with work
no that a broadcast news so this is basically mission my model based on the broadcast news for hundred million
words of broadcast news
in the second one is
can a some i'm model from the broadcast news interpolated with the
for gram approximation of the new
and a but and i would just and is
"'cause" my be slightly a basic on what what's it was probably just seem implementation because the nature of the
expansion news
is having no back structures so this every every single context is forty it is that because that's to cool
making the decoding or
then then you want to be of the neural net uh you you or lattices was that C the neural
then because of the recurrent current are is keeping information about the entire pot cost
so when two parts um in the is you cannot model the angle can no okay
yes
okay
but for one more question
a two and a
i
uh uh how much do we have to synthesise i think the simulations
then work for like you know jess
a factor of two or factor of fine
they started working menu you at least an order of magnitude
but the original model yeah
because after i it that is a statistic which is interesting namely that when we simulated two hundred thirty million
words for example
right we looked at n-gram coverage and how many hallucinations out that and the new n-gram than how many of
you
course there's no way to tell except that is one way
we set if and then n-gram that be hypothesis a simulated
or how loose an aided shows up in the google five gram corpus
then will say it was a good simulation
and if it doesn't say was the bad simulation
so if you look at that a one million words of wall street journal
eighty five percent of them i'd and the group bill five gram
so that's
the are so real text has a a bad and goodness of eighty five percent
and that two hundred thirty million words we simulated has the goodness of what's seventy two percent
so we are mostly simulating good and
but you're assimilate and order of magnitude
okay let's thinks the speaker
okay