you hear me and then um a much a where that up of and and it but its right so i'll try to trying to explain to you we went to a frame large um this is my i so better uh so this is the outline of the talk L begin with an introduction and a motivation as to why a model M as a religion and i P M and of move on to the two main components which make uh the computation of model M uh extremely computationally expensive and memory hard and go one to some of the ideas we had for distributed training and how much gain we got a memory and uh speed or by using these ideas so just to give a very brief overview uh everyone knows what exponential i'm gram models are and the power and the ability to capture really complex dependencies um these were first introduce we back and ninety six and subsequently again and two thousand two um so the definition really goes as given a set of target symbols Y and inputs and but i you can formulate the exponential language model in the following way where do you see a numerator term which is the summation over all features which are active and the denominator term which we refer to as the a normalisation some computation which goes over a additional summation over all the target output so X is the history and wise that target and now you can imagine and a large scale is how extremely computationally intensive this is gonna be so for the typical uh exponential n-gram model where any equals three you can think of three histories than null tree the unigram and the bigram tree and uh parameter estimation an exponential model basically has uh many objective functions of variations of regular stations on the basic objective function and the one that we use to and we formulated as of this following form uh with an N one and L two regularization component and for some choice of a fun signal which are determined magically a to have that um training these parameters there are the whole class of iterative update algorithms and basically it's a comparison of the expected feature counts to the observed counts and you update the parameters according so they you see the expression for the expected feature con and you see a summation term over your training data D which can be represented as a sequence of his trees and target play um if you look at that i've friends that i have their from a loaf in two thousand two you'll find an exhaustive comparison of various iterative update algorithms and their advantages and disadvantages and so one what we use internally is a generalized uh iterative scaling algorithm we've implemented so you had a bad model i'm from our previous speaker oh introduced in two thousand and nine based chan um it basically has the following form which is think of it as a combination of two exponential n-gram model and what we've seen is it consistently out performs any other language model and gives us one of the best results and we've been seeing five percent relative cross a wide range of domains and task and large corpora as well um let let's talk but not a in this particular a formulation where you have a like a three next a long lines of the exponential n-gram model but in was equal and it three so the key thing to note here is you have a class model and a word model both of which can be estimated independently well the class model has a history dependence on the previous class as well as the word the word model has this little term C J sitting here which is the predicted class of the word of the current where G I at you're trying to print so the to and models can be trained independently of each other and we know that model M we shown has significant performance improvement but then be no it comes at this steep computational cost so typically something like a thirty hours of computation is required wired for a typical training dataset of hundred million words and it consumes memory of like ten gigabyte so this is something that we want to read use if you want to scale to billions of words as is common in many of the tasks to such just voice search or you could think of any kind of large-scale broadcast news type so there are two parts which we worked on in this engine approach to improving the problem station and fast computation the first one when to talk about is the normalisation some and then i'll come to the uh a feature computation um before we go to the parallelization will talk about some efficient computation of these two component uh first consider the case where uh you you remember the previous slides the formulation of the norm some computation consider two we meant histories X one and X two and if you were to compute a different show of those two histories the expression would look like something like this and here the out for um is just just summation major whole but all of the features that are actually a lie now think of any in frequent words the uh only case where this feature going to be really act to is that all of the unigram feature case for most cases you'll find out for one equal self or two if you could efficiently order all your histories trees and and in a manner in which you're reduced to small different she'll computations then it the whole process becomes very efficient and so the key idea here is to sort the card biz in a lexical graphical way such that consecutive histories differ only in small number of word and what this does for you it reduces the number of target words over which you have to some over the final normalisation and some computation so what is this lexical graphics sorting all we're doing is to make sure same words and position one are a coding first and then the same words and position to what coding next and it sort of like a nested history tree clustering so you have your bike i'm gram an empty history and you can visualise the whole thing as a prefix tree where you have the non history three at the root node and then wall of the sorted event histories are children of that node each uh and the time is a new wave band and you can tie them all up at the root node and when you do this you're basically a signing if you want to assign unique I D's to these event histories or the low what or data and we get where I Ds and the higher order ones we get high I so when you do your different shall computation that i talked about a a you're only going to some different channels over a small number of them so you computation cost comes down and lay from we'll talk about the memory as well um this is basically what i just covered of the different she'll can be computed efficiently um so once you have the event history sort you have one more step which is just some but all the events now you can do the same lexical graphics sorting at the end time event level as well and when you do that the only two items you keep track of are the alpha value and the norm some value which is the summation of out but over all the target symbols you're interested in and when you move from one event history to the next you only want to identify the feature histories that are differ that have actually for and this comes down to are starting again because then you know all that the do what are data his trees are have a low or a unique I D's and the really only gonna to different the higher order terms and so this different she'll becomes zero for all the lower order his and exists only for the higher order one so it just an efficient way of storing your histories trees and your target symbols such that your denominator term an exponential model can be computed a fish no once you're do this this sort of spells over to the expectation uh computation of the features as well because you can be like that numerator term um you as just this i'll for divided by Z and you want to computed Z very effectively so far so if you want to consider the case in the numerator term but for the expectation computation where you have a sequence of history's trees D one D do and you wrote the whole expression out this way if you have a one element in the event history which can be shared for the same target Y that this expression reduces to the out for term coming out and just the second term remaining and what this means is this becomes independent of Y and you can share this across multiple targets um so the so think of event histories uh that you did a really are in the norm some computation with such that you only a few of the for terms changes and because of the you got some good i'd bang to just in the computation and it all just build over to the expectation computation where it just reduces to a multiplication if you will of your running total that you kept so thought of the norm some you want to it by the out for to a which also you only computed only for those seep industries that change um so that's what we point out here that that it becomes a very efficient computation of the uh numerator term so once we have a well the numerator and the denominator the parameter updates and an exponential uh language model become very uh efficient and you have well known algorithms that exist to do this kind of iterative scaling um at this point i wanna talk a little bit about the prior art here this sort of uh lexical graphics sorting was introduced by laugh for T ninety five and it's one of the bases that we use for this efficient uh computation what for norm some and expected feature computations and uh john woo and could on or in two thousand two and two thousand used a similar prefix tree traversal type algorithm but the difference between their work can our work are they sort of used eight propagation if you will of norm some terms from low but order to higher order n-grams and the expectations from higher order to lower order n-grams so computational complexity wise these two algorithms that one we talked about and what they proposed would be similar except that our formulation is much simpler and it relies purely on something as simple of sorting the corpus and we only compute the different shells and again we don't need to store all of this in memory be only need do um keep just a different jobs and memory so you don't have to this store or what of the n-gram features and given that this is a simpler formulation you can think of a she adding information across multiple sets of n-gram features like the class prediction model that we have we have to exponential models and model um so um these of the sort of key differences between the right art that's been proposed um now how do you do with practically so far we talked about and efficient computation now we wanna say how can you power um the norm some computation can i so you can do it in two ways one way do you take your target vocabulary and use which across machines or one when you take your actual or base each machine compute over or your target vocabulary but only on a limited portion of the corpus so let's take the first case where it it's the target what capillaries split then the norm some can be computed by summing the normalization terms or but each vocabulary partition so what happens here is the expectation computation also gets split perfectly by splitting the target words a at each machine um what what you would need to do is you would need to manage all these partial normalisation terms at the end of one computation and broadcasted back so each one of the uh sleeve nodes can do the subsequent computation so you would need to merge the norm sums in this case now if you got to the case of where you split like corpus you're splitting part of the corpus across different machine so why you can do the norm some computation on a single machine and no i just required for the expectation parts you would need to make sure that they have the right normalisation jobs so he that way you have to have some sort of communications between slaves and masters but now you have a combination of fire prior had an efficient computation and the parallelization we talked about so together they seem to work uh very well and reducing the memory and the time uh computational time so in this table we show how if you what to do what direct implementation where you to know short cut um the computation time in seconds for it difficult hub four corpus with one point five million where is going to be like a of the order four thousand second i'd unigram caching is something that everybody does so that significantly brings the time down and we can produce that even for the with our proposed approach which comes down to just seconds so this is a a need to wait to do the exponential model combination computation um the results are show you now are on the english hub four system um this is uh uh training part is about three hundred million words with with at K vocabulary and the number of features that are a life for the class portion of the exponential model and for the word a portion of the model are roughly two hundred thirty eight million and two seventy six million um if you take the comparison with the at a big L system which also we have results on that is a one point six billion words and you see and you see it how much more number of histories trees and are a life for the class and the uh weight based models so here we give you an idea of the trade memory required during training you see on all the four models the class and the word for english as well as at of big yeah uh the X axis talks about how much memory per node no discussions you and the Y axis is basically what happens as the number of nodes increases as you power lies for there so use as expected you see a nice brought in the amount of memory can see and what this means is you can train something with billions of words in a day as opposed to where early a weight was taking like weeks to do this um this is just the numbers right now and if you talk about training time are computational time also to significantly reduced you go down from a single case of something like two hundred and fifty hours down to when you power lies it would just twenty machines you're down to like about fifty hours so that's a nice uh improvement we get and training time as well as and memory um so there is also related work on power of station and that of many it's that are being pursued in the literature originally nine to six it was introduced as a a a theoretical does all done how you can paralyse and grams uh distribute it um language model computation has since then been in per you had uh from uh both um of than brian a from two thousand seven and for are not there's have investigated how you can do maximum entropy based models in a distributed environment as well so this work is uh along the lines of these two and beep shown how you can efficiently implement this computation i just some nodes on computational cost uh we've used uh on normalized iterative scaling here and you see details and the two thousand nine paper a but that are the choices you can do such just conjugate gradient L B F G S and are prop and we didn't compare exactly how our parameter estimation goes but this but you can and we you know one of these as additional uh improvements that you could potentially get um now the timing results are also based on a strict convergence criteria we used a very tight convergence criterion i you could probably relaxed this the little depending on the task and get further improvement uh the computational cost just both the function of number of unique features and unique histories trees and in a typical case we process like one point two million unique n-grams grams per second but five compute nodes that this can vary drastically crossed to whether it's a broadcast news task or E conversational task and so one so again the relative gain you get will the pan on the task but it basically if you don't do this the possibility of scaling to a large corpus becomes almost uh in vol so to conclude we have an efficient training algorithm for exponential n-gram models which we believe we can extend to other types of features thrown and and the model M frame right and one addition on top of this if you were to paralyse it either on as a target vocabulary subset or as the data subset then you can lead to efficient use of memory as well as better computational cost and if you want just split a christ data are you can increase that a lot more then you can if you were to just split the target for re and that's all questions for questions from a simple question which you not related to to but to to work some more and more my or to use to to model uh i yes the results i presented here where a without any pruning uh but we had a number is with pruning feature pruning and be find that we take uh we absolutely don't take much of hit unless you go to pruning this features like low fifty se is oh the a one or two old so we well that's another paper but uh and we have a uh tried a variety of pruning methods as some of them which are simply based on a counting thing a threshold on counting thing uh some of them uh basically extending and gram pruning methodology to the exponential n-gram framework of model a and they all seem to work in a similar way so we have another forthcoming paper a for abortion