0:00:15you hear me
0:00:18and then um
0:00:19a much
0:00:20a where that up of and and
0:00:22it
0:00:22but its right
0:00:24so i'll try to
0:00:27trying to explain to you
0:00:28we went to a frame
0:00:31large
0:00:33um
0:00:34this is my
0:00:36i
0:00:38so better
0:00:40uh so this is the outline of the talk L begin with an introduction and a motivation as to why
0:00:45a model M as a religion and i P M
0:00:47and of move on to the two main components which make uh the computation of model M uh extremely computationally
0:00:55expensive and memory hard
0:00:57and go one to some of the ideas we had for distributed training and how much gain we got a
0:01:02memory and
0:01:03uh speed or by using these ideas
0:01:06so just to give a very brief overview uh everyone knows what exponential i'm gram models are and the power
0:01:13and the ability to capture really complex dependencies
0:01:17um these were first introduce we back and ninety six and subsequently again and two thousand two
0:01:22um
0:01:23so the definition really goes as
0:01:25given a set of target symbols Y and inputs and but i
0:01:29you can formulate the exponential language model in the following way
0:01:34where do you see a numerator term which is the summation over all features which are active and the denominator
0:01:40term which we refer to as the a normalisation some computation
0:01:44which goes over a additional summation over all the target output
0:01:48so X is the history and wise that target
0:01:51and now you can imagine and a large scale is how extremely computationally intensive this is gonna be
0:01:57so for the typical uh exponential n-gram model where any equals three you can think of three histories than null
0:02:03tree the unigram and the bigram tree
0:02:06and uh parameter estimation an exponential model basically has uh many objective functions of variations of regular stations on the
0:02:15basic objective function
0:02:16and the one that we use to and we formulated as of this following form
0:02:21uh with an N one and L two regularization component and for some choice of a fun signal which are
0:02:27determined magically a to have that
0:02:29um
0:02:31training these parameters there are the whole class of iterative update algorithms
0:02:36and basically it's a comparison of the expected feature counts to the observed counts
0:02:41and you update the parameters according so they you see the expression for the expected feature con
0:02:47and you see a summation term over your training data D
0:02:52which can be represented as a sequence of his trees and target play
0:02:56um
0:02:57if you look at that i've friends that i have their from a loaf in two thousand two
0:03:01you'll find an exhaustive comparison of various iterative update algorithms and their advantages and disadvantages and so one
0:03:08what we use internally is a generalized uh iterative scaling algorithm we've implemented
0:03:15so you had a bad model i'm from our previous speaker
0:03:17oh introduced in two thousand and nine based chan
0:03:21um it basically has the following form which is think of it as a combination of two exponential n-gram model
0:03:28and what we've seen is it consistently out performs any other language model and gives us one of the best
0:03:34results and we've been seeing five percent relative cross
0:03:37a wide range of domains and task
0:03:39and
0:03:40large corpora as well
0:03:42um
0:03:43let let's talk but not a in this particular a formulation where you have a like a three next
0:03:49a long lines of the exponential n-gram model but in was equal and it three
0:03:53so the key thing to note here is you have a class model and a word model both of which
0:03:57can be estimated independently
0:03:59well the class model has a history dependence on the previous class
0:04:03as well as the word
0:04:05the word model has this little term C J sitting here
0:04:09which is the predicted class of the word
0:04:11of the current where G I at you're trying to print
0:04:15so the to and models can be trained independently of each other
0:04:18and we know that model M we shown has significant
0:04:23performance improvement
0:04:24but then be no it comes at this steep computational cost
0:04:27so typically something like a thirty hours of computation is required wired for a typical training dataset of hundred million
0:04:35words
0:04:36and it consumes memory of like ten gigabyte
0:04:39so this is something that we want to read use if you want to scale to billions of words as
0:04:43is common in many of the tasks to such just voice search or you could think of any kind of
0:04:48large-scale broadcast news type
0:04:52so there are two parts which we worked on in this engine approach to improving the problem station and fast
0:04:59computation
0:05:00the first one when to talk about is the normalisation some and then i'll come to the uh a feature
0:05:05computation
0:05:06um before we go to the parallelization will talk about some efficient computation of these two component
0:05:12uh first consider the case where
0:05:14uh you you remember the previous slides the formulation of the norm some computation
0:05:20consider two we meant histories X one and X two
0:05:23and if you were to compute a different show of those two histories the expression would look like something like
0:05:28this
0:05:29and here the out for um is just just summation major whole but all of the features that are actually
0:05:34a lie
0:05:35now think of any in frequent words
0:05:38the uh only case where this feature going to be really act to is that all of the unigram feature
0:05:43case for most cases you'll find out for one equal self or two
0:05:47if you could efficiently order all your histories trees and and in a manner
0:05:52in which you're reduced to small different she'll computations then it the whole process becomes very efficient
0:05:59and so the key idea here is to sort the card biz
0:06:03in a lexical graphical way
0:06:05such that consecutive histories differ only in small number of word
0:06:10and what this does for you it reduces the number of target words over which you have to some over
0:06:16the final
0:06:17normalisation and some computation
0:06:21so what is this lexical
0:06:23graphics sorting all we're doing is to make sure
0:06:26same words and position one are a coding first and then the same words and position to what coding next
0:06:32and it sort of like a nested history tree clustering
0:06:35so you have your bike i'm gram an empty history and you can visualise the whole thing as a prefix
0:06:41tree
0:06:42where you have the non history three at the root node
0:06:44and then wall of the sorted event histories are children of that node
0:06:50each
0:06:51uh and the time is a new wave band and you can tie them all up at the root node
0:06:55and
0:06:56when you do this you're basically a signing if you want to assign unique I D's to these event histories
0:07:03or the low what or data and we get where I Ds and the higher order ones we get high
0:07:07I
0:07:08so when you do your different shall computation that i talked about a a you're only going to some different
0:07:14channels over a small number of them so you computation cost comes down and lay from we'll talk about the
0:07:19memory as well
0:07:21um this is basically what i just covered of the different she'll can be computed efficiently
0:07:26um so once you have the event history sort you have one more step which is just some but all
0:07:32the events
0:07:33now you can do the same lexical graphics sorting at the end time event level as well
0:07:39and when you do that the only two items you keep track of are the alpha value
0:07:44and the norm some value which is the summation of out but over all the target symbols you're interested in
0:07:52and when you move from one event history to the next
0:07:56you only want to identify the feature histories that are differ
0:07:59that have actually for and this comes down to are starting again because then you know all that the do
0:08:05what are data his trees are have a low or a unique I D's and the really only gonna to
0:08:09different the higher order terms
0:08:11and so this different she'll becomes zero for all the lower order his and exists only for the higher order
0:08:17one
0:08:17so it just an efficient way of storing your histories trees and your target symbols
0:08:22such that your denominator term an exponential model can be computed a fish
0:08:27no once you're do this this sort of spells over to the expectation uh computation of the features as well
0:08:33because you can be like that numerator term um you as just this i'll for divided by Z
0:08:39and you want to computed Z very effectively so far
0:08:43so if you want to consider the case in the numerator term but for the expectation computation
0:08:49where you have a sequence of history's trees D one D do
0:08:53and you wrote the whole expression out this way
0:08:56if you have
0:08:57a one element in the event history which can be shared for the same target Y
0:09:02that this expression reduces to the out for term coming out and just the second term remaining
0:09:08and what this means is this becomes independent of Y
0:09:12and you can share this across multiple targets um
0:09:17so the so think of event histories uh
0:09:20that you did a really are in the norm some computation
0:09:23with such that you only a few of the for terms changes and because of the you got some good
0:09:29i'd bang to just in the computation and it all just build over to the expectation computation
0:09:34where it just reduces to a multiplication if you will of your running total that you kept so thought of
0:09:39the norm some you want to it by the out for to a which also you only computed only for
0:09:45those seep industries that change
0:09:48um
0:09:49so that's what we point out here that that it becomes a very efficient computation of the uh numerator term
0:09:55so once we have a well the numerator and the denominator the parameter updates and an exponential uh language model
0:10:01become very uh efficient
0:10:03and you have well known algorithms that exist to do this kind of iterative scaling
0:10:08um
0:10:10at this point i wanna talk a little bit about the prior art here
0:10:13this sort of uh
0:10:15lexical graphics sorting was introduced by laugh for T ninety five
0:10:18and it's one of the bases that we use for this efficient uh computation what for norm some and expected
0:10:25feature computations
0:10:26and uh
0:10:28john woo and could on or in two thousand two and two thousand
0:10:31used a similar prefix tree traversal type algorithm
0:10:36but the difference between their work can our work are they sort of used eight propagation if you will of
0:10:42norm some terms
0:10:43from low but order to higher order n-grams
0:10:45and the expectations from higher order to lower order n-grams so computational complexity wise these two algorithms that one we
0:10:52talked about and what they proposed would be similar
0:10:55except that our formulation is much simpler
0:10:58and it relies purely on something as simple of sorting the corpus
0:11:01and we only compute the different shells and again we don't need to store all of this in memory be
0:11:06only need do
0:11:08um
0:11:08keep just a different jobs and memory so you don't have to this
0:11:11store or what of the n-gram features
0:11:14and given that this is a simpler formulation
0:11:16you can think of a she adding information across multiple sets of n-gram features like the class prediction model that
0:11:23we have we have to exponential models and model um so
0:11:26um these of the sort of key differences between the
0:11:29right art that's been proposed
0:11:31um now how do you do with practically so far we talked about and efficient computation
0:11:37now we wanna say how can you power
0:11:40um
0:11:40the norm some computation can i so you can do it in two ways
0:11:44one way do you take your target vocabulary and use which across machines or one when you take your actual
0:11:50or base
0:11:51each machine compute over or your target vocabulary but only on a limited portion of the corpus
0:11:57so let's take the first case where it it's the target what capillaries split
0:12:01then the norm some can be computed by summing the normalization terms
0:12:06or but each vocabulary partition
0:12:08so what happens here is the expectation computation also gets split perfectly by splitting the target words a at each
0:12:15machine
0:12:16um what what you would need to do is you would need to manage all these partial normalisation terms
0:12:22at the end of one computation and broadcasted back
0:12:25so each one of the uh sleeve nodes can do the subsequent computation
0:12:31so you would need to merge the norm sums in this case now if you got to the case of
0:12:34where you split like corpus
0:12:36you're splitting part of the corpus across different machine
0:12:40so why you can do the norm some computation on a single machine and no i just required
0:12:45for the expectation parts you would need to make sure that they have the right normalisation jobs so he that
0:12:50way you have to have some sort of communications between slaves and masters
0:12:54but now you have a combination of fire
0:12:56prior had an efficient computation and the parallelization we talked about
0:13:00so together they seem to work uh very well and reducing the memory and the time uh computational time
0:13:07so in this table we show how if you what to do what direct implementation where you to know short
0:13:12cut
0:13:13um the computation time in seconds for it difficult hub four corpus with one point five million where is going
0:13:19to be like a of the order four thousand second
0:13:21i'd unigram caching is something that everybody does so that significantly brings the time down and we can produce that
0:13:27even for the with our proposed approach which comes down to just
0:13:31seconds
0:13:32so this is a a need to wait to do the exponential model combination computation
0:13:37um
0:13:38the results are show you now are on the english hub four system
0:13:42um this is uh
0:13:44uh
0:13:45training part is about three hundred million words with with at K vocabulary
0:13:49and the number of features that are a life for the class portion of the exponential model
0:13:53and for the word a portion of the model are roughly two hundred
0:13:57thirty eight million and two seventy six million
0:13:59um if you take the comparison with the at a big L system which also we have results on that
0:14:04is a one point six billion words
0:14:06and you see
0:14:19and you see it how much more number of histories trees and are a life for the class and the
0:14:24uh weight based models
0:14:26so here we give you an idea of the trade memory required during training
0:14:31you see on all the four models the class and the word for english as well as at of big
0:14:36yeah
0:14:37uh the X axis talks about how much memory per node no discussions you
0:14:41and the Y axis is basically what happens as the number of nodes
0:14:45increases as you power lies for there
0:14:47so use as expected you see a nice brought in the amount of memory can see
0:14:52and what this means is you can train something with billions of words
0:14:56in
0:14:56a day as opposed to where early a weight was taking like weeks to do this
0:15:01um
0:15:02this is just the numbers right now and if you talk about training time are computational time also to significantly
0:15:08reduced
0:15:09you go down from a single case of something like two hundred and fifty hours
0:15:13down to when you power lies it would just twenty machines you're down to like about fifty hours so that's
0:15:18a nice uh improvement we get and training time as well as and memory
0:15:24um so there is also related work on power of station and that of many it's that are being pursued
0:15:29in the literature
0:15:30originally nine to six it was introduced as a a a theoretical does all done how you can paralyse and
0:15:36grams
0:15:37uh distribute it um language model computation has since then been in per you had uh from uh both um
0:15:43of than brian
0:15:44a from two thousand seven
0:15:46and for are not there's have investigated how you can do maximum entropy based models in a distributed environment as
0:15:52well
0:15:53so this work is uh along the lines of these two and beep shown how you can efficiently implement this
0:15:59computation
0:16:01i just some nodes on computational cost
0:16:03uh we've used uh on normalized iterative scaling here and you see details and the two thousand nine paper
0:16:09a but that are the choices you can do such just conjugate gradient L B F G S and are
0:16:14prop and we didn't compare exactly how our parameter estimation goes but this but
0:16:19you can
0:16:20and we you know one of these as additional uh improvements that you could potentially get
0:16:25um now the timing results are also based on a strict convergence criteria
0:16:30we used a very tight convergence criterion i you could probably relaxed this the little depending on the task and
0:16:35get further improvement
0:16:37uh the computational cost just both the function of number of unique features and unique histories trees
0:16:42and in a typical case we process like one point two million unique n-grams grams per second but five compute
0:16:48nodes
0:16:48that this can vary drastically crossed to whether it's a broadcast news task or E conversational task and so one
0:16:55so again the relative gain you get will the pan on the task
0:16:58but it basically if you don't do this the
0:17:01possibility of scaling to a large corpus becomes almost uh in
0:17:05vol
0:17:06so to conclude we have an efficient training algorithm for exponential n-gram models which we believe we can extend to
0:17:13other types of features thrown and and the model M frame right
0:17:16and one addition on top of this if you were to paralyse it either on as a target vocabulary subset
0:17:23or as the data subset
0:17:25then you can lead to efficient use of memory as well as better computational cost
0:17:30and if you want just split a christ data are you can increase that a lot more then you can
0:17:34if you were to just
0:17:36split the target for re
0:17:38and that's all
0:17:46questions for questions from
0:17:58a simple question
0:17:59which you not related to to but to to work some more and more
0:18:06my or to use
0:18:08to to model
0:18:11uh i yes the results i presented here where a without any pruning
0:18:15uh but we had a number is with
0:18:17pruning feature pruning and be find that we take uh
0:18:22we absolutely don't take much of
0:18:24hit unless you go to pruning this features like low fifty se
0:18:27is oh
0:18:29the a one or two old
0:18:32so we well that's another paper but uh and we have a uh tried a variety of pruning methods as
0:18:39some of them which are simply based on a counting thing a threshold on counting thing
0:18:43uh some of them uh
0:18:45basically extending and gram pruning methodology to the exponential n-gram framework of model a and they all seem to work
0:18:52in a similar way
0:18:53so we have another forthcoming paper
0:19:01a for abortion