"'kay" uh

first a couple of quick disclaimers

uh this is uh work mostly done by a look the us and that to much we love who give

the previous talk

uh a with a lot of help from a the fund and from

no

and i just have the pleasure of sort of pretending that that have something to do that and going to

talk

uh uh not actually couldn't come because of some uh i is a problems not on the check side but

going back to the U S so he sends apologies

and uh he even made the slide so apologies to then then V of the slides a two nice i

don't think he's trying to hide any

alright right

so

uh you had to much tell you how the and neural net for gives is great performance

one of the issues with the model like that is that you know it has essentially

at least theoretically infinite memory and and the other it does depend on the past five seven eight word

so you really can't do lattice was coding with the model like this

so the main idea about this paper is can we do something but the neural net put language model so

that we can rescored scored is but

and if you want the idea and a nutshell

this whole variational approximation is a scary term i don't know how one came up with that

it's actually very simple idea

imagine for a moment that the really was a true language model that generate all the sentences that you are

nice speak

right

how do we actually but but such a mark what we do is we take a lot of actual text

sentences

so it's a little bit like saying we sample from this true underlying distribution they get a bunch of sentence

and then we approximate

that and the line model with the markov chain like be do it second third for of them sort are

the markov chain

and that's it and them approximation of the true underlying model

that you and i believe looks are not heads are better right

i

so that have is the same you to much as uh a neural net language model

and pretend that that's a to model of language

generate lots and lots of the data from that model instead of

having human beings i to text

and simply estimate a n-gram model from that

that's essentially a long and short the paper

right

so i'll tell you how it works

oh

yeah it is a house statistical

i

okay

i will trying to do this this way

no that's okay it's not

okay

so a be get annotated speech you use the generative training procedure to create acoustic models

we get some text because you didn't it to training to do a language model the combine them figure out

some scale factor

we get new speech we it to the decoder which has all these three models

produce transcribed utterance

and

essentially via are implementing this formula P of a given W to the part mu P of W in return

find the them that just standard when the last

now the language model is typically as i said approximated using an n-gram

so we take P of W light it as W being a whole sentence did this P of W I

given W you want to write minus one

the pretty approximated by some

uh

a like in the last couple of words in the history that gives rise to and the models

and typically

uh we use small and then small number of and n-grams in order to make the decoding feasible

and the you and we get work quality so means instead create a so search space like a lattice this

and we score the lattice using a bigger than n-gram

again standard to practise everybody get knows there's nothing

uh

so this is asking can we go beyond n-grams gonna use something like to much as neural net to do

lattice rescoring

like

at or we could perhaps talk about using more complex acoustic model

and all these a feasible but uh

lattices provided your models are tractable in the sense of being local

because when phones even the have a they have a five phone context that's to local n-grams even for any

nipples five hours to local

but if you truly have a long span model that's not possible

you can do it with word lattices

so you tend to do it then best

right

so and best list have the advantage that you can deploy play really long span model but they do have

a bias

and this paper is about how to get past that

so the real question is can we do but like to this is what we do with best

so let's see how we go from

so what's good about a word lattices the provide a light so space

but and they had not biased by the new model that it will are used to rescore the lattices because

they really have a lot of options and them

but they do require the new models to be look

and best less

they don't require a new models to be local the you models can be really long span

but they do offer a limited search space limited by and and more importantly that top and are chosen according

to the old more

so the choice of the top-n and hypotheses is by

so

in some other word

uh

and no what on this

and present a paper at is do you

and and this i cast that as opposed to going on right now which you might be able to catch

after the session

but i i shows how to essentially

expand the search space how to search more than just the and the lattice

right

and that's like saying no let's do a lattice like stuff but and best less except let's some somehow try

to make the end very large without really having a large

so that saying let's do it the decoding these coding that some magic you can

this that point goes in the other direction

it's a is

i somehow approximate

the long span model

and make it look

so yes the neural net model has a long history but can i somehow get a local approximation of that

model

and that's for the steak

alright right so let's see

so this is sort of the outline of the top that's coming up

so the first order of business what long span model of we approximate already told you

this is the the could in that model that the much just presented and

so

you all know about it i one based time

so what's is approximation going to be

but i think of it this way

that's so this is the space of or language models

but than that he's the set of all we could a neural networks

and let's just say this is set of all n-gram models like

and let's say this the model you'd like

well typically you can decode rescored lattices were then gram models that's so there's a checked against the

what kind of the

i think it's not a needs die out that's colour yeah

okay anyway so that's tractable the blue one is not a

so what we should do what we should really be using a tractable model which is as close as possible

to the model we really like to

right

but i i and a model that the actually use for scoring based on the training data may not be

that more

are from the same training data you estimate run neural net model P

and you estimate an n-gram model am

and and may not be close to to start used start of the and which just close this to the

out to your are long span language model whatever it might be

so what we do do in this paper to saint of "'em" what happens if we use to stop

what if we approximate this but the model

P

but the than

the m-gram gram the n-gram model

so that's how

you wanna look at this

so what is action and or skip all

so it what happens when you try to use someone else's like

okay

so that that the approximations gonna work

you're gonna look for a and i'm model Q

among the set of all n-grams

script you

which is close this to this long span model in the sense of the them

for enough everybody that

alright right

so

what do you do but will essentially say

the kl divergence between P and Q is basically just a sum what all X P X to X what's

X X is all possible sentences in the war

"'cause" these that let's a sentence level model

and if you forget about the be lower P to

what you really want as you want the Q

that maximises

the sum of all sentences in the universe of P X log to X where P X is the

neural net probability of the sentence

and Q X is the n-gram probability of the scent

right

of course you can get P X of every sentence in the universe

but what you could do is just like we do with

normal human language models we approximate them by synthesizing they from our model namely getting people to write down text

and estimating an n-gram model from the text

so what we do it

we'll synthesized

sentences using

using this neural net language model and will simply estimate and then gram model from the synthesized date

so the recipes is very simple you get your fancy long span language that is a prerequisite

this fancy long i'm language model needs to be a generative more

meaning you need to be able to simulate sentence using this model

but the i don't and is ideal for that

and you

synthesized sentence

and then once you have a huge enough corpus that you're comfortable estimating whatever n-gram you an estimate you can

go ahead an estimated as if somebody gave you tons of X

so sounds crazy yeah

so let's see what it that

so i'm gonna give you T sets of experiments first a baby one then the medium and then than the

really muscular one

so this is the baby experiment we start off at the penn treebank corpus

uh and i'd have about a million words for training and about twenty thousand words for evaluation

and the vocabularies made of the top ten thousand words and is the standard set

and just to tell you how standard the says

uh

sure and yellow nick had the random forest language models on it cal buff present a structured language model run

drawer bit somewhere a a language model as well

then it's film no mary hard ball that that is a hard but i is not on it

uh these fish block of and the say them and thought of the uh the costs and pins

cash like language model all sorts of people of sound is that an exactly this corpus

so in fact we didn't have to be that experiment fig simply take their set up and copy the number

it's pretty standard

so what we get on that

so what we need of the estimated this uh a neural net actually to matched it

and uh are then we simply that in simulation more you of having a million words of training text

we generated two hundred thirty million words of training

i two to thirty million at that how much we could create in a day or

so we stop okay

and then

we simply estimate and and n-gram model from the simulated text

so the and model that are generated different simulated text that called five eight P at i'd and then six

a variational approximation of the neural net

we can either generate bike bands or five gram models from

and are on those that

so the both a good trigram model with standard can nice smoothing has a perplexity of one for the yeah

if you approximate the neural network by the synthetic method now this is generated a is to made don't two

hundred thirty million words of data since a lot of data

but its perplexity is only one fifty two which is sort of compatible to one for

and a you if you interpolate that do you get one twenty four so maybe there's reason to be high

or do we compare but

so a a shoes a random forest language model had sort of it to work to

like at look at the two previous words in the line

in that and and we can compare that that's so that's one thirty to this one twenty four okay so

far so

uh

you can also compare it but the car L or make other or or or or a on of hard

for language models they look at previous words and syntactic heads and so on so think of them as five

gram models because the look at for preceding words

although these are not the four consecutive words use a based on some

so if you do wanna do that we can compare it by stimulating a five gram or

so i five gram can set nine models based based on the same one million words of training text of

the perplexity of what one for T

and this neural net has a perplexity of also about one forty but when you interpolate to get one twenty

and so i again competitive at all the other model

and then finally

uh

the i Z in clock of model is across cross it's models it looks at previous sentence of the compared

to that we simply implement a very simple cache language model

cash up all the words in the last in the sort of previous couple sentences

and just to that but you kind model

and you get

a perplexity of hundred eleven

mind you the system not as good as one or two which is what the much has ended so that's

tell you that the exact neural net language model still better than the

approximation be creating after of the at approximating

it long as an out and then with the five

the approximation already is pretty good and quite different from

the and us to may just re million words of text

so that two are working but

okay so this is nice perplexity as that

uh yeah the next experiment

uh we were looking at the mit lectures data

so for those of you who don't know the score corpus

there are uh

a few tens of hours of acoustic training data i thing some like twenty and then a couple of hours

of evaluation data these a professors giving lectures and a couple of speakers

so we have transcripts for about uh and a fifty words of text

uh a speech and uh we have a lot of data from broadcast news which is out of domain main

news and these that like to that it might be

and uh so we basically said let's see what we can do with it

we estimated a at and then from the in-domain data

simulated twenty times more data that three million words of text and estimate and then gram from there

and uh compared to

uh the basic like nist allow model this would be a computation saying what can you do the simulated language

model

have "'cause" you don't wanna throw we all the broadcast news models because they do have a lot of good

english of it for the in as well

and a a lot of use

a baseline lines you can choose your

if you just use the ms so my model the estimated from the mit data

plus interpolated of broadcast news you do unit as well as you can

and you get a a of rates like twenty four point seven on one lecture the twenty two point four

and uh the lecture

you got a reasonable number

uh then uh but with the big acoustic model are trained using that the less of these are fairly

they can a better trained acoustic model that men but i they have some time and my

i don't know if they have P made

that is not that they do

so yes so all the goodies that in there

and if you score the top hundred using the for neural network because remote that's a full sentence model

uh you get some reduction in a twenty four point one for the first set than the twenty two point

for the other

if you go much be but in the n-best list you do get a

you you to go a bit more of an improvement

almost to

close to one percent point eight point nine

and a lot of a pretty good but only ripple both also fifteen

or so maybe they could we

but sure

or anyway that's what you get by doing and best rescoring and i said one the problems is that

the and that is presented to you is good according to the original five four gram model

uh and then what you can do as you can replace the original four gram model

but this

phony for that model estimated from simulated

you much more text you three million words one for thousand words

and all any you can see on the

first line of the second block that if you simply used the variational approximation of the four gram

of the can use an that to the four gram

and decode but they

you already have a point for

point two percent reduction in word error rate

what's more interesting is not if you simply score the hundred best which is much less work

you can get almost all the game

so this is starting to look given but

so what you're saying is that not only do you get better than a output

if you use the

uh and got approximation of the new net language model

you produce better because we you then rescored it but the for new in that language model

you get

uh did actions that much lower N

"'cause" but you have to score and that of

and

and you would if you had

the original

so this is the medium size experiment if you met

than that of a large an experiment

this has to do with english uh conversational telephone speech and some uh

meeting speech from the nist

the uh D or seven

evaluation data

and again that are about five million words transcribed

so that's are basic uh in the main training data

and then uh we can either build a case of my model or we can create a neural net model

and then synthesized

another huge order of magnitude like for million words of text from it and then use that X to build

the language model

so again the original language model which is

in blue

and the

fate line simulated data language model which is right

and again the way that two gram five gram that indicates that be produced lattices using a bigram model

but then be rescored coded using a

a five gram model

and there's recording was done and no so this uses the remote

and uh

on that

uh he again again and again a bunch of is that's again you can decide what to baseline as i

like to think of the is of my five gram lattice rescoring as the baseline

a you get the uh on the cts data are you get twenty percent that of rate on the

uh meetings data a good thirty two point four

and if you D score these using a neural net you get down to twenty seven point one if you

hundred best rescoring or twenty six point five if few thousand best

and that sort of the

kind of things you can get if you

D using a standard n-gram and score using the neural

but in signal standard n-gram you find out of place it but this new n-gram which is an approximation of

the neural net

yeah we get

so you get a nice reduction just by using a different and them morton's a twenty eight point or becomes

twenty seven point two

and thirty two point four becomes thirty one point seven which is nice

uh we don't get as much of a gain and least coding so maybe the done to good a job

of meeting the

that this is but uh

a a there's a gain to be had

it least in the first pass decoding

so again uh to conclude

uh

the basically convinced us and hopefully have convinced you

that if you have a very long span model

but you can fit in your additional decoding

the rather than just leave it to the end for scoring you might as though simulate some text for that

didn't build an and them model of the simulated text

because that already is better than just using it all then them

and it might save you some work during these code

uh

we able would improve significantly lower an and that that's of course mainly coming from the on and then

and uh

before i conclude let me show you one more slide which is interesting does the one that the much sure

that the end of a stop

uh this is a a much larger training set because the one of the reviewers by the way set all

this all night what does it scale to large data

what happens when you a lot of is original training data

so does also a that we have four hundred million words to trained initial and them

and then once we train a neural net but that

which takes whatever and ever but that's to much as problem

uh we then simulate and a billion words of text

and then but if and gram model are of that

and as you can see the or ignored decoding that the four gram gives you for being point one at

a rate

and if you simply be decode using the approximated for gram based on the neural net

you already get the been point

a tells you that it just place saying

you're all four gram model with the new month based on approximating a bit language model

is already a good start and then the last number of what the uh showed you it's twelve point one

if you score B

uh lattices as sum and best list start of this

in model

me go back and say that's it

and thank you

i

questions

okay

yes you long and model has to be good

yeah

yes

okay

oh

yeah

okay

okay i can give you a quantity on set in terms of correlation but if you think of at the

first a i a doesn't it is but seem funny because you repeat a question yes i

i Z yeah it doesn't this looks like the standard bias variance trade off

if you have lots of initial text to train the neural net like first of all this will work only

the neural net is much better than the and gram models are trying to replace with the simulation yeah and

it is yes

you are can approximate not

actual text with the four gram but some imagine text with the four gram

so if the imagine text is not a good representation of actual that

it has to be it that are not be good actual text

it has to be a but a presentation of actual text than the for gram approximation of the actual

so that two models competing here there's actual text which is a good representation of you language

that the four gram which is a

okay to presentation of language

so you model has to be better than the for that

and once you have a

you taken get of the by

and the and the simulation reduces bit

okay

i think request

a i just wonder how big your

one billion language model

um

is after

you you made it an n-gram language model oh oh person to really one don't quite a bit by the

way this is the um i'm i current think that is a

no but yeah

no i don't think of the number of of the double my fingers but yes there's is standard yeah stalled

Q like pruning

uh for all sorts of things

so it is uh

i don't the much where you remember how big that language model

the

the

uh

five million and fifty million

yeah five million for decoding fifty million these coding

i think

i just have a question the bound the um on so

um you don't have set how would you run on a away and you were where language model and a

lattice rescoring right

so

sorry i can't hear you because of the double thing i here the i well as well as the reflections

so

so you you do have a a direct results using lattice rescoring with new right length model right

a direct result of lattice is coding using neural net morgan sell

you don't

but i'll tell that table the for one the K and plus you and decoding with work

no that a broadcast news so this is basically mission my model based on the broadcast news for hundred million

words of broadcast news

in the second one is

can a some i'm model from the broadcast news interpolated with the

for gram approximation of the new

and a but and i would just and is

"'cause" my be slightly a basic on what what's it was probably just seem implementation because the nature of the

expansion news

is having no back structures so this every every single context is forty it is that because that's to cool

making the decoding or

then then you want to be of the neural net uh you you or lattices was that C the neural

then because of the recurrent current are is keeping information about the entire pot cost

so when two parts um in the is you cannot model the angle can no okay

yes

okay

but for one more question

a two and a

i

uh uh how much do we have to synthesise i think the simulations

then work for like you know jess

a factor of two or factor of fine

they started working menu you at least an order of magnitude

but the original model yeah

because after i it that is a statistic which is interesting namely that when we simulated two hundred thirty million

words for example

right we looked at n-gram coverage and how many hallucinations out that and the new n-gram than how many of

you

course there's no way to tell except that is one way

we set if and then n-gram that be hypothesis a simulated

or how loose an aided shows up in the google five gram corpus

then will say it was a good simulation

and if it doesn't say was the bad simulation

so if you look at that a one million words of wall street journal

eighty five percent of them i'd and the group bill five gram

so that's

the are so real text has a a bad and goodness of eighty five percent

and that two hundred thirty million words we simulated has the goodness of what's seventy two percent

so we are mostly simulating good and

but you're assimilate and order of magnitude

okay let's thinks the speaker

okay