you hear me

and then um

a much

a where that up of and and

but its right

so i'll try to

trying to explain to you

we went to a frame

large

this is my

so better

uh so this is the outline of the talk L begin with an introduction and a motivation as to why

a model M as a religion and i P M

and of move on to the two main components which make uh the computation of model M uh extremely computationally

expensive and memory hard

and go one to some of the ideas we had for distributed training and how much gain we got a

memory and

uh speed or by using these ideas

so just to give a very brief overview uh everyone knows what exponential i'm gram models are and the power

and the ability to capture really complex dependencies

um these were first introduce we back and ninety six and subsequently again and two thousand two

so the definition really goes as

given a set of target symbols Y and inputs and but i

you can formulate the exponential language model in the following way

where do you see a numerator term which is the summation over all features which are active and the denominator

term which we refer to as the a normalisation some computation

which goes over a additional summation over all the target output

so X is the history and wise that target

and now you can imagine and a large scale is how extremely computationally intensive this is gonna be

so for the typical uh exponential n-gram model where any equals three you can think of three histories than null

tree the unigram and the bigram tree

and uh parameter estimation an exponential model basically has uh many objective functions of variations of regular stations on the

basic objective function

and the one that we use to and we formulated as of this following form

uh with an N one and L two regularization component and for some choice of a fun signal which are

determined magically a to have that

training these parameters there are the whole class of iterative update algorithms

and basically it's a comparison of the expected feature counts to the observed counts

and you update the parameters according so they you see the expression for the expected feature con

and you see a summation term over your training data D

which can be represented as a sequence of his trees and target play

if you look at that i've friends that i have their from a loaf in two thousand two

you'll find an exhaustive comparison of various iterative update algorithms and their advantages and disadvantages and so one

what we use internally is a generalized uh iterative scaling algorithm we've implemented

so you had a bad model i'm from our previous speaker

oh introduced in two thousand and nine based chan

um it basically has the following form which is think of it as a combination of two exponential n-gram model

and what we've seen is it consistently out performs any other language model and gives us one of the best

results and we've been seeing five percent relative cross

a wide range of domains and task

and

large corpora as well

let let's talk but not a in this particular a formulation where you have a like a three next

a long lines of the exponential n-gram model but in was equal and it three

so the key thing to note here is you have a class model and a word model both of which

can be estimated independently

well the class model has a history dependence on the previous class

as well as the word

the word model has this little term C J sitting here

which is the predicted class of the word

of the current where G I at you're trying to print

so the to and models can be trained independently of each other

and we know that model M we shown has significant

performance improvement

but then be no it comes at this steep computational cost

so typically something like a thirty hours of computation is required wired for a typical training dataset of hundred million

words

and it consumes memory of like ten gigabyte

so this is something that we want to read use if you want to scale to billions of words as

is common in many of the tasks to such just voice search or you could think of any kind of

large-scale broadcast news type

so there are two parts which we worked on in this engine approach to improving the problem station and fast

computation

the first one when to talk about is the normalisation some and then i'll come to the uh a feature

computation

um before we go to the parallelization will talk about some efficient computation of these two component

uh first consider the case where

uh you you remember the previous slides the formulation of the norm some computation

consider two we meant histories X one and X two

and if you were to compute a different show of those two histories the expression would look like something like

this

and here the out for um is just just summation major whole but all of the features that are actually

a lie

now think of any in frequent words

the uh only case where this feature going to be really act to is that all of the unigram feature

case for most cases you'll find out for one equal self or two

if you could efficiently order all your histories trees and and in a manner

in which you're reduced to small different she'll computations then it the whole process becomes very efficient

and so the key idea here is to sort the card biz

in a lexical graphical way

such that consecutive histories differ only in small number of word

and what this does for you it reduces the number of target words over which you have to some over

the final

normalisation and some computation

so what is this lexical

graphics sorting all we're doing is to make sure

same words and position one are a coding first and then the same words and position to what coding next

and it sort of like a nested history tree clustering

so you have your bike i'm gram an empty history and you can visualise the whole thing as a prefix

tree

where you have the non history three at the root node

and then wall of the sorted event histories are children of that node

each

uh and the time is a new wave band and you can tie them all up at the root node

and

when you do this you're basically a signing if you want to assign unique I D's to these event histories

or the low what or data and we get where I Ds and the higher order ones we get high

so when you do your different shall computation that i talked about a a you're only going to some different

channels over a small number of them so you computation cost comes down and lay from we'll talk about the

memory as well

um this is basically what i just covered of the different she'll can be computed efficiently

um so once you have the event history sort you have one more step which is just some but all

the events

now you can do the same lexical graphics sorting at the end time event level as well

and when you do that the only two items you keep track of are the alpha value

and the norm some value which is the summation of out but over all the target symbols you're interested in

and when you move from one event history to the next

you only want to identify the feature histories that are differ

that have actually for and this comes down to are starting again because then you know all that the do

what are data his trees are have a low or a unique I D's and the really only gonna to

different the higher order terms

and so this different she'll becomes zero for all the lower order his and exists only for the higher order

one

so it just an efficient way of storing your histories trees and your target symbols

such that your denominator term an exponential model can be computed a fish

no once you're do this this sort of spells over to the expectation uh computation of the features as well

because you can be like that numerator term um you as just this i'll for divided by Z

and you want to computed Z very effectively so far

so if you want to consider the case in the numerator term but for the expectation computation

where you have a sequence of history's trees D one D do

and you wrote the whole expression out this way

if you have

a one element in the event history which can be shared for the same target Y

that this expression reduces to the out for term coming out and just the second term remaining

and what this means is this becomes independent of Y

and you can share this across multiple targets um

so the so think of event histories uh

that you did a really are in the norm some computation

with such that you only a few of the for terms changes and because of the you got some good

i'd bang to just in the computation and it all just build over to the expectation computation

where it just reduces to a multiplication if you will of your running total that you kept so thought of

the norm some you want to it by the out for to a which also you only computed only for

those seep industries that change

so that's what we point out here that that it becomes a very efficient computation of the uh numerator term

so once we have a well the numerator and the denominator the parameter updates and an exponential uh language model

become very uh efficient

and you have well known algorithms that exist to do this kind of iterative scaling

at this point i wanna talk a little bit about the prior art here

this sort of uh

lexical graphics sorting was introduced by laugh for T ninety five

and it's one of the bases that we use for this efficient uh computation what for norm some and expected

feature computations

and uh

john woo and could on or in two thousand two and two thousand

used a similar prefix tree traversal type algorithm

but the difference between their work can our work are they sort of used eight propagation if you will of

norm some terms

from low but order to higher order n-grams

and the expectations from higher order to lower order n-grams so computational complexity wise these two algorithms that one we

talked about and what they proposed would be similar

except that our formulation is much simpler

and it relies purely on something as simple of sorting the corpus

and we only compute the different shells and again we don't need to store all of this in memory be

only need do

keep just a different jobs and memory so you don't have to this

store or what of the n-gram features

and given that this is a simpler formulation

you can think of a she adding information across multiple sets of n-gram features like the class prediction model that

we have we have to exponential models and model um so

um these of the sort of key differences between the

right art that's been proposed

um now how do you do with practically so far we talked about and efficient computation

now we wanna say how can you power

the norm some computation can i so you can do it in two ways

one way do you take your target vocabulary and use which across machines or one when you take your actual

or base

each machine compute over or your target vocabulary but only on a limited portion of the corpus

so let's take the first case where it it's the target what capillaries split

then the norm some can be computed by summing the normalization terms

or but each vocabulary partition

so what happens here is the expectation computation also gets split perfectly by splitting the target words a at each

machine

um what what you would need to do is you would need to manage all these partial normalisation terms

at the end of one computation and broadcasted back

so each one of the uh sleeve nodes can do the subsequent computation

so you would need to merge the norm sums in this case now if you got to the case of

where you split like corpus

you're splitting part of the corpus across different machine

so why you can do the norm some computation on a single machine and no i just required

for the expectation parts you would need to make sure that they have the right normalisation jobs so he that

way you have to have some sort of communications between slaves and masters

but now you have a combination of fire

prior had an efficient computation and the parallelization we talked about

so together they seem to work uh very well and reducing the memory and the time uh computational time

so in this table we show how if you what to do what direct implementation where you to know short

cut

um the computation time in seconds for it difficult hub four corpus with one point five million where is going

to be like a of the order four thousand second

i'd unigram caching is something that everybody does so that significantly brings the time down and we can produce that

even for the with our proposed approach which comes down to just

seconds

so this is a a need to wait to do the exponential model combination computation

the results are show you now are on the english hub four system

um this is uh

training part is about three hundred million words with with at K vocabulary

and the number of features that are a life for the class portion of the exponential model

and for the word a portion of the model are roughly two hundred

thirty eight million and two seventy six million

um if you take the comparison with the at a big L system which also we have results on that

is a one point six billion words

and you see

and you see it how much more number of histories trees and are a life for the class and the

uh weight based models

so here we give you an idea of the trade memory required during training

you see on all the four models the class and the word for english as well as at of big

yeah

uh the X axis talks about how much memory per node no discussions you

and the Y axis is basically what happens as the number of nodes

increases as you power lies for there

so use as expected you see a nice brought in the amount of memory can see

and what this means is you can train something with billions of words

a day as opposed to where early a weight was taking like weeks to do this

this is just the numbers right now and if you talk about training time are computational time also to significantly

reduced

you go down from a single case of something like two hundred and fifty hours

down to when you power lies it would just twenty machines you're down to like about fifty hours so that's

a nice uh improvement we get and training time as well as and memory

um so there is also related work on power of station and that of many it's that are being pursued

in the literature

originally nine to six it was introduced as a a a theoretical does all done how you can paralyse and

grams

uh distribute it um language model computation has since then been in per you had uh from uh both um

of than brian

a from two thousand seven

and for are not there's have investigated how you can do maximum entropy based models in a distributed environment as

well

so this work is uh along the lines of these two and beep shown how you can efficiently implement this

computation

i just some nodes on computational cost

uh we've used uh on normalized iterative scaling here and you see details and the two thousand nine paper

a but that are the choices you can do such just conjugate gradient L B F G S and are

prop and we didn't compare exactly how our parameter estimation goes but this but

you can

and we you know one of these as additional uh improvements that you could potentially get

um now the timing results are also based on a strict convergence criteria

we used a very tight convergence criterion i you could probably relaxed this the little depending on the task and

get further improvement

uh the computational cost just both the function of number of unique features and unique histories trees

and in a typical case we process like one point two million unique n-grams grams per second but five compute

nodes

that this can vary drastically crossed to whether it's a broadcast news task or E conversational task and so one

so again the relative gain you get will the pan on the task

but it basically if you don't do this the

possibility of scaling to a large corpus becomes almost uh in

vol

so to conclude we have an efficient training algorithm for exponential n-gram models which we believe we can extend to

other types of features thrown and and the model M frame right

and one addition on top of this if you were to paralyse it either on as a target vocabulary subset

or as the data subset

then you can lead to efficient use of memory as well as better computational cost

and if you want just split a christ data are you can increase that a lot more then you can

if you were to just

split the target for re

and that's all

questions for questions from

a simple question

which you not related to to but to to work some more and more

my or to use

to to model

uh i yes the results i presented here where a without any pruning

uh but we had a number is with

pruning feature pruning and be find that we take uh

we absolutely don't take much of

hit unless you go to pruning this features like low fifty se

is oh

the a one or two old

so we well that's another paper but uh and we have a uh tried a variety of pruning methods as

some of them which are simply based on a counting thing a threshold on counting thing

uh some of them uh

basically extending and gram pruning methodology to the exponential n-gram framework of model a and they all seem to work

in a similar way

so we have another forthcoming paper

a for abortion