0:00:15 | you hear me |
---|---|

0:00:18 | and then um |

0:00:19 | a much |

0:00:20 | a where that up of and and |

0:00:22 | it |

0:00:22 | but its right |

0:00:24 | so i'll try to |

0:00:27 | trying to explain to you |

0:00:28 | we went to a frame |

0:00:31 | large |

0:00:33 | um |

0:00:34 | this is my |

0:00:36 | i |

0:00:38 | so better |

0:00:40 | uh so this is the outline of the talk L begin with an introduction and a motivation as to why |

0:00:45 | a model M as a religion and i P M |

0:00:47 | and of move on to the two main components which make uh the computation of model M uh extremely computationally |

0:00:55 | expensive and memory hard |

0:00:57 | and go one to some of the ideas we had for distributed training and how much gain we got a |

0:01:02 | memory and |

0:01:03 | uh speed or by using these ideas |

0:01:06 | so just to give a very brief overview uh everyone knows what exponential i'm gram models are and the power |

0:01:13 | and the ability to capture really complex dependencies |

0:01:17 | um these were first introduce we back and ninety six and subsequently again and two thousand two |

0:01:22 | um |

0:01:23 | so the definition really goes as |

0:01:25 | given a set of target symbols Y and inputs and but i |

0:01:29 | you can formulate the exponential language model in the following way |

0:01:34 | where do you see a numerator term which is the summation over all features which are active and the denominator |

0:01:40 | term which we refer to as the a normalisation some computation |

0:01:44 | which goes over a additional summation over all the target output |

0:01:48 | so X is the history and wise that target |

0:01:51 | and now you can imagine and a large scale is how extremely computationally intensive this is gonna be |

0:01:57 | so for the typical uh exponential n-gram model where any equals three you can think of three histories than null |

0:02:03 | tree the unigram and the bigram tree |

0:02:06 | and uh parameter estimation an exponential model basically has uh many objective functions of variations of regular stations on the |

0:02:15 | basic objective function |

0:02:16 | and the one that we use to and we formulated as of this following form |

0:02:21 | uh with an N one and L two regularization component and for some choice of a fun signal which are |

0:02:27 | determined magically a to have that |

0:02:29 | um |

0:02:31 | training these parameters there are the whole class of iterative update algorithms |

0:02:36 | and basically it's a comparison of the expected feature counts to the observed counts |

0:02:41 | and you update the parameters according so they you see the expression for the expected feature con |

0:02:47 | and you see a summation term over your training data D |

0:02:52 | which can be represented as a sequence of his trees and target play |

0:02:56 | um |

0:02:57 | if you look at that i've friends that i have their from a loaf in two thousand two |

0:03:01 | you'll find an exhaustive comparison of various iterative update algorithms and their advantages and disadvantages and so one |

0:03:08 | what we use internally is a generalized uh iterative scaling algorithm we've implemented |

0:03:15 | so you had a bad model i'm from our previous speaker |

0:03:17 | oh introduced in two thousand and nine based chan |

0:03:21 | um it basically has the following form which is think of it as a combination of two exponential n-gram model |

0:03:28 | and what we've seen is it consistently out performs any other language model and gives us one of the best |

0:03:34 | results and we've been seeing five percent relative cross |

0:03:37 | a wide range of domains and task |

0:03:39 | and |

0:03:40 | large corpora as well |

0:03:42 | um |

0:03:43 | let let's talk but not a in this particular a formulation where you have a like a three next |

0:03:49 | a long lines of the exponential n-gram model but in was equal and it three |

0:03:53 | so the key thing to note here is you have a class model and a word model both of which |

0:03:57 | can be estimated independently |

0:03:59 | well the class model has a history dependence on the previous class |

0:04:03 | as well as the word |

0:04:05 | the word model has this little term C J sitting here |

0:04:09 | which is the predicted class of the word |

0:04:11 | of the current where G I at you're trying to print |

0:04:15 | so the to and models can be trained independently of each other |

0:04:18 | and we know that model M we shown has significant |

0:04:23 | performance improvement |

0:04:24 | but then be no it comes at this steep computational cost |

0:04:27 | so typically something like a thirty hours of computation is required wired for a typical training dataset of hundred million |

0:04:35 | words |

0:04:36 | and it consumes memory of like ten gigabyte |

0:04:39 | so this is something that we want to read use if you want to scale to billions of words as |

0:04:43 | is common in many of the tasks to such just voice search or you could think of any kind of |

0:04:48 | large-scale broadcast news type |

0:04:52 | so there are two parts which we worked on in this engine approach to improving the problem station and fast |

0:04:59 | computation |

0:05:00 | the first one when to talk about is the normalisation some and then i'll come to the uh a feature |

0:05:05 | computation |

0:05:06 | um before we go to the parallelization will talk about some efficient computation of these two component |

0:05:12 | uh first consider the case where |

0:05:14 | uh you you remember the previous slides the formulation of the norm some computation |

0:05:20 | consider two we meant histories X one and X two |

0:05:23 | and if you were to compute a different show of those two histories the expression would look like something like |

0:05:28 | this |

0:05:29 | and here the out for um is just just summation major whole but all of the features that are actually |

0:05:34 | a lie |

0:05:35 | now think of any in frequent words |

0:05:38 | the uh only case where this feature going to be really act to is that all of the unigram feature |

0:05:43 | case for most cases you'll find out for one equal self or two |

0:05:47 | if you could efficiently order all your histories trees and and in a manner |

0:05:52 | in which you're reduced to small different she'll computations then it the whole process becomes very efficient |

0:05:59 | and so the key idea here is to sort the card biz |

0:06:03 | in a lexical graphical way |

0:06:05 | such that consecutive histories differ only in small number of word |

0:06:10 | and what this does for you it reduces the number of target words over which you have to some over |

0:06:16 | the final |

0:06:17 | normalisation and some computation |

0:06:21 | so what is this lexical |

0:06:23 | graphics sorting all we're doing is to make sure |

0:06:26 | same words and position one are a coding first and then the same words and position to what coding next |

0:06:32 | and it sort of like a nested history tree clustering |

0:06:35 | so you have your bike i'm gram an empty history and you can visualise the whole thing as a prefix |

0:06:41 | tree |

0:06:42 | where you have the non history three at the root node |

0:06:44 | and then wall of the sorted event histories are children of that node |

0:06:50 | each |

0:06:51 | uh and the time is a new wave band and you can tie them all up at the root node |

0:06:55 | and |

0:06:56 | when you do this you're basically a signing if you want to assign unique I D's to these event histories |

0:07:03 | or the low what or data and we get where I Ds and the higher order ones we get high |

0:07:07 | I |

0:07:08 | so when you do your different shall computation that i talked about a a you're only going to some different |

0:07:14 | channels over a small number of them so you computation cost comes down and lay from we'll talk about the |

0:07:19 | memory as well |

0:07:21 | um this is basically what i just covered of the different she'll can be computed efficiently |

0:07:26 | um so once you have the event history sort you have one more step which is just some but all |

0:07:32 | the events |

0:07:33 | now you can do the same lexical graphics sorting at the end time event level as well |

0:07:39 | and when you do that the only two items you keep track of are the alpha value |

0:07:44 | and the norm some value which is the summation of out but over all the target symbols you're interested in |

0:07:52 | and when you move from one event history to the next |

0:07:56 | you only want to identify the feature histories that are differ |

0:07:59 | that have actually for and this comes down to are starting again because then you know all that the do |

0:08:05 | what are data his trees are have a low or a unique I D's and the really only gonna to |

0:08:09 | different the higher order terms |

0:08:11 | and so this different she'll becomes zero for all the lower order his and exists only for the higher order |

0:08:17 | one |

0:08:17 | so it just an efficient way of storing your histories trees and your target symbols |

0:08:22 | such that your denominator term an exponential model can be computed a fish |

0:08:27 | no once you're do this this sort of spells over to the expectation uh computation of the features as well |

0:08:33 | because you can be like that numerator term um you as just this i'll for divided by Z |

0:08:39 | and you want to computed Z very effectively so far |

0:08:43 | so if you want to consider the case in the numerator term but for the expectation computation |

0:08:49 | where you have a sequence of history's trees D one D do |

0:08:53 | and you wrote the whole expression out this way |

0:08:56 | if you have |

0:08:57 | a one element in the event history which can be shared for the same target Y |

0:09:02 | that this expression reduces to the out for term coming out and just the second term remaining |

0:09:08 | and what this means is this becomes independent of Y |

0:09:12 | and you can share this across multiple targets um |

0:09:17 | so the so think of event histories uh |

0:09:20 | that you did a really are in the norm some computation |

0:09:23 | with such that you only a few of the for terms changes and because of the you got some good |

0:09:29 | i'd bang to just in the computation and it all just build over to the expectation computation |

0:09:34 | where it just reduces to a multiplication if you will of your running total that you kept so thought of |

0:09:39 | the norm some you want to it by the out for to a which also you only computed only for |

0:09:45 | those seep industries that change |

0:09:48 | um |

0:09:49 | so that's what we point out here that that it becomes a very efficient computation of the uh numerator term |

0:09:55 | so once we have a well the numerator and the denominator the parameter updates and an exponential uh language model |

0:10:01 | become very uh efficient |

0:10:03 | and you have well known algorithms that exist to do this kind of iterative scaling |

0:10:08 | um |

0:10:10 | at this point i wanna talk a little bit about the prior art here |

0:10:13 | this sort of uh |

0:10:15 | lexical graphics sorting was introduced by laugh for T ninety five |

0:10:18 | and it's one of the bases that we use for this efficient uh computation what for norm some and expected |

0:10:25 | feature computations |

0:10:26 | and uh |

0:10:28 | john woo and could on or in two thousand two and two thousand |

0:10:31 | used a similar prefix tree traversal type algorithm |

0:10:36 | but the difference between their work can our work are they sort of used eight propagation if you will of |

0:10:42 | norm some terms |

0:10:43 | from low but order to higher order n-grams |

0:10:45 | and the expectations from higher order to lower order n-grams so computational complexity wise these two algorithms that one we |

0:10:52 | talked about and what they proposed would be similar |

0:10:55 | except that our formulation is much simpler |

0:10:58 | and it relies purely on something as simple of sorting the corpus |

0:11:01 | and we only compute the different shells and again we don't need to store all of this in memory be |

0:11:06 | only need do |

0:11:08 | um |

0:11:08 | keep just a different jobs and memory so you don't have to this |

0:11:11 | store or what of the n-gram features |

0:11:14 | and given that this is a simpler formulation |

0:11:16 | you can think of a she adding information across multiple sets of n-gram features like the class prediction model that |

0:11:23 | we have we have to exponential models and model um so |

0:11:26 | um these of the sort of key differences between the |

0:11:29 | right art that's been proposed |

0:11:31 | um now how do you do with practically so far we talked about and efficient computation |

0:11:37 | now we wanna say how can you power |

0:11:40 | um |

0:11:40 | the norm some computation can i so you can do it in two ways |

0:11:44 | one way do you take your target vocabulary and use which across machines or one when you take your actual |

0:11:50 | or base |

0:11:51 | each machine compute over or your target vocabulary but only on a limited portion of the corpus |

0:11:57 | so let's take the first case where it it's the target what capillaries split |

0:12:01 | then the norm some can be computed by summing the normalization terms |

0:12:06 | or but each vocabulary partition |

0:12:08 | so what happens here is the expectation computation also gets split perfectly by splitting the target words a at each |

0:12:15 | machine |

0:12:16 | um what what you would need to do is you would need to manage all these partial normalisation terms |

0:12:22 | at the end of one computation and broadcasted back |

0:12:25 | so each one of the uh sleeve nodes can do the subsequent computation |

0:12:31 | so you would need to merge the norm sums in this case now if you got to the case of |

0:12:34 | where you split like corpus |

0:12:36 | you're splitting part of the corpus across different machine |

0:12:40 | so why you can do the norm some computation on a single machine and no i just required |

0:12:45 | for the expectation parts you would need to make sure that they have the right normalisation jobs so he that |

0:12:50 | way you have to have some sort of communications between slaves and masters |

0:12:54 | but now you have a combination of fire |

0:12:56 | prior had an efficient computation and the parallelization we talked about |

0:13:00 | so together they seem to work uh very well and reducing the memory and the time uh computational time |

0:13:07 | so in this table we show how if you what to do what direct implementation where you to know short |

0:13:12 | cut |

0:13:13 | um the computation time in seconds for it difficult hub four corpus with one point five million where is going |

0:13:19 | to be like a of the order four thousand second |

0:13:21 | i'd unigram caching is something that everybody does so that significantly brings the time down and we can produce that |

0:13:27 | even for the with our proposed approach which comes down to just |

0:13:31 | seconds |

0:13:32 | so this is a a need to wait to do the exponential model combination computation |

0:13:37 | um |

0:13:38 | the results are show you now are on the english hub four system |

0:13:42 | um this is uh |

0:13:44 | uh |

0:13:45 | training part is about three hundred million words with with at K vocabulary |

0:13:49 | and the number of features that are a life for the class portion of the exponential model |

0:13:53 | and for the word a portion of the model are roughly two hundred |

0:13:57 | thirty eight million and two seventy six million |

0:13:59 | um if you take the comparison with the at a big L system which also we have results on that |

0:14:04 | is a one point six billion words |

0:14:06 | and you see |

0:14:19 | and you see it how much more number of histories trees and are a life for the class and the |

0:14:24 | uh weight based models |

0:14:26 | so here we give you an idea of the trade memory required during training |

0:14:31 | you see on all the four models the class and the word for english as well as at of big |

0:14:36 | yeah |

0:14:37 | uh the X axis talks about how much memory per node no discussions you |

0:14:41 | and the Y axis is basically what happens as the number of nodes |

0:14:45 | increases as you power lies for there |

0:14:47 | so use as expected you see a nice brought in the amount of memory can see |

0:14:52 | and what this means is you can train something with billions of words |

0:14:56 | in |

0:14:56 | a day as opposed to where early a weight was taking like weeks to do this |

0:15:01 | um |

0:15:02 | this is just the numbers right now and if you talk about training time are computational time also to significantly |

0:15:08 | reduced |

0:15:09 | you go down from a single case of something like two hundred and fifty hours |

0:15:13 | down to when you power lies it would just twenty machines you're down to like about fifty hours so that's |

0:15:18 | a nice uh improvement we get and training time as well as and memory |

0:15:24 | um so there is also related work on power of station and that of many it's that are being pursued |

0:15:29 | in the literature |

0:15:30 | originally nine to six it was introduced as a a a theoretical does all done how you can paralyse and |

0:15:36 | grams |

0:15:37 | uh distribute it um language model computation has since then been in per you had uh from uh both um |

0:15:43 | of than brian |

0:15:44 | a from two thousand seven |

0:15:46 | and for are not there's have investigated how you can do maximum entropy based models in a distributed environment as |

0:15:52 | well |

0:15:53 | so this work is uh along the lines of these two and beep shown how you can efficiently implement this |

0:15:59 | computation |

0:16:01 | i just some nodes on computational cost |

0:16:03 | uh we've used uh on normalized iterative scaling here and you see details and the two thousand nine paper |

0:16:09 | a but that are the choices you can do such just conjugate gradient L B F G S and are |

0:16:14 | prop and we didn't compare exactly how our parameter estimation goes but this but |

0:16:19 | you can |

0:16:20 | and we you know one of these as additional uh improvements that you could potentially get |

0:16:25 | um now the timing results are also based on a strict convergence criteria |

0:16:30 | we used a very tight convergence criterion i you could probably relaxed this the little depending on the task and |

0:16:35 | get further improvement |

0:16:37 | uh the computational cost just both the function of number of unique features and unique histories trees |

0:16:42 | and in a typical case we process like one point two million unique n-grams grams per second but five compute |

0:16:48 | nodes |

0:16:48 | that this can vary drastically crossed to whether it's a broadcast news task or E conversational task and so one |

0:16:55 | so again the relative gain you get will the pan on the task |

0:16:58 | but it basically if you don't do this the |

0:17:01 | possibility of scaling to a large corpus becomes almost uh in |

0:17:05 | vol |

0:17:06 | so to conclude we have an efficient training algorithm for exponential n-gram models which we believe we can extend to |

0:17:13 | other types of features thrown and and the model M frame right |

0:17:16 | and one addition on top of this if you were to paralyse it either on as a target vocabulary subset |

0:17:23 | or as the data subset |

0:17:25 | then you can lead to efficient use of memory as well as better computational cost |

0:17:30 | and if you want just split a christ data are you can increase that a lot more then you can |

0:17:34 | if you were to just |

0:17:36 | split the target for re |

0:17:38 | and that's all |

0:17:46 | questions for questions from |

0:17:58 | a simple question |

0:17:59 | which you not related to to but to to work some more and more |

0:18:06 | my or to use |

0:18:08 | to to model |

0:18:11 | uh i yes the results i presented here where a without any pruning |

0:18:15 | uh but we had a number is with |

0:18:17 | pruning feature pruning and be find that we take uh |

0:18:22 | we absolutely don't take much of |

0:18:24 | hit unless you go to pruning this features like low fifty se |

0:18:27 | is oh |

0:18:29 | the a one or two old |

0:18:32 | so we well that's another paper but uh and we have a uh tried a variety of pruning methods as |

0:18:39 | some of them which are simply based on a counting thing a threshold on counting thing |

0:18:43 | uh some of them uh |

0:18:45 | basically extending and gram pruning methodology to the exponential n-gram framework of model a and they all seem to work |

0:18:52 | in a similar way |

0:18:53 | so we have another forthcoming paper |

0:19:01 | a for abortion |