Speech Transcript - EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

a matched call as me uh so okay so i start uh this work is set extension of uh of our previous paper from last in speech a which also about the recurrence a work a language model and you know we really show some some more details of to do how to train this model are efficiently comparison jane the standard to network language language models so this is some a introduction uh basically a a network language models work uh like let's say better than the standard uh because of models because they can uh automatically uh share some parameters between similar board so they are from some kind of soft clustering in low dimensional space so in some sense they are similar to "'cause" based models uh the good think about this neural network language models is does is that they are quite a a simple to implement and do me do not need to a uh deal it's for example smooth thing and even the training colour grid means usually just the standard propagation algorithm which is very well known and described and the actually what we have shown recently uh was that uh the recurrent architecture is uh a but it if it with the feed forward at architecture which is right like nice because uh actually the recount architecture is start to carry much more powerful because it allows the model to remember some kind of information in the hidden layer and so we do don't the build and and got model of that some a limited a history but actually the model learns the easter it from the data uh we will see that later in the pictures so in this uh presentation i dollar uh describe a a a is back propagation trip times like likely which is uh a very old algorithm for training recurrent do networks uh and i dense son as uh a speed-up technique that is actually uh very similar or to to the previous presentation just our technique is i would say much simpler and uh then uh a results uh are to combining a uh many and am iced uh a you know network models and uh how this uh a text perplexity and also some comparison with other techniques and so i'm sun uh a results that are not in the original paper because we obtain the them light you out that a paper was written and this is about some large uh i C R thus but a lot of more data then i i we show here in the simple uh simple examples so the the model looks like this basically we have a yeah some input layer and output layer that have the same time a general it as the as the vocabulary a that is that W and uh uh why and between the uh two layers there is one in a that as sexually much lower dimensionality uh let's say uh a one hundred dollar two on the king or owns and uh actually we would to not uh considered the a recurrent uh uh parameters the recurrent rate uh uh that's uh that it is uh uh that are between the hidden don't layer uh like between a itself uh then a then the network would be a uh just uh a standard bigram a network a language model but actually these so parameters give uh the model uh the power to remember some history and use it efficiently uh so uh uh so actually in the previous paper we are using just the the normal back propagation for training such network but uh here i will show that a bit to back propagation should time we can uh i get to actually better results which should be even more of the use on a corrected a based language models are the the usual architecture doesn't really work well uh not only for the recurrence met were the but also for the for for work and uh uh actually how would does uh back propagation through time work uh it works by unfolding the recount part of the network uh in time so that be obtain some and deep uh cute for of our to network i which is some kind of approximation of the recurrent part of the network and we train this by uh using the standard back propagation just the to be you have uh many more hidden layer so looks basically like a is uh you be imagine the the original bigram network and now we know that there are recurrent connections in the hidden air like that it is connected to a it a to itself just that these connections are do you light in time so that a when B uh trained the network we we compute D a vector in the in the output layer and propagated back using the back back propagation and uh we full the network in times so basically uh we can uh we can go one step back in time and we can see that the the the uh uh activation values in the you don't their of our depending on the state of the of the input layer and on the state of the previous uh uh a on the state of the hidden are in the previous times that and so on so basically we can unfold this network work for a for two steps in time and the obtain a future for of to the approximation of the a recurrent neural networks so this is the idea of the algorithm works and is a little bit tricky to implement it correctly but otherwise it is quite straightforward like would say uh do other extension that the describe in our paper is uh a factorization of the output layer which is uh basically something very similar to some class based uh a language models like but a a like but you sure a good did in his paper ten years ago just in our case of V do not really extend this approach by uh by using some to use and so on a uh we keep a the the approach simple simpler and actually meet be make it even simpler by not even a computing can classes but use just a factorisation uh that is based just on the frequency of the word so basically we do frequency binning of the of the vocabulary to obtain these so let's see classes and otherwise uh the approach is very similar to what one would two what was in the sim a the previous presentation so we uh so we basically computes first the the the probability distribution over to glass lay that's can be very small let's say i i just uh a one hundred out what unit and then we compute just the the probability distribution for the that to be a long to to this class layer other guys the model stays of the same so we do not need to compute probability distribution over the role of put layered that can be a say ten thousand boards but we will be computing cut the of the is just for much less so this this can provide speed-up up in some cases even more than hundred times see the if the ought to look of it is very large so this this technique is very nice we do not need to introduce a any and a short lists or any to use and actually it is quite surprising surprising that something think as simple as this works but will see in the is result that the does so uh uh are uh basic set up to that to be a uh described more close in the paper is uh penn treebank uh uh a part of the wall street journal corpus and to we use the the same settings things as also the other researchers so that we can directly compared the result which G extended now our on a going work but you will you it's simply here so uh this is the importance of of the back propagation true time training at D so or do to the results on this corpus and you can see that uh the blue curve or i should stop maybe with the baseline which is the green you know a line that's modified of name uh a five gram and the blue curve is uh when be trained for models of it we do a different amount of uh of uh steps for the back propagation through time algorithm and we can see that uh the average joe of uh of these oh of these four models is actually put it in the graph we can see that the more uh steps we go in time back uh the better of the final model is as the evolution of the model is uh still the same it this not affected by the training for uh when we actually combine this models like that we use a linear interpolation to go this model models we can see that uh the results are better but the affect of of using better training algorithms stays so uh this still obtain quite significant improvement here it this about ten person perplexity ended to to be even more if we would to use more training data this is just a boat that one a in word oh of training data uh here the a we show that if we combine actually more than a let's say for models we can still serve some improvement even after the vol combination of models interpolated with the at the back of model uh for the for to combine the neural nets uh be used just uh a no interpolation bit cool rates for each model but the weight of the lack of model is you want on the validation data this is why the car fits slightly noisy the at one uh but you can basically see that we can obtain some very small improvements up tree going for more than four model and that these uh this uh networks uh are direct the different just in the a random initialization of the of the weights uh here is the comparison that i was already introducing uh to other techniques uh so the the baseline of can be the five gram perplexity like the one hundred forty one at that that first row and then a uh uh a random forest the that is solar interpolated to a this a baseline uh achieves a perplexed the reduction somewhat less than ten percent and structured language models work uh actually better than a random forest so on this up and we can see that the all the neural networks as a language models work even better than that the standard before about you know not uh are are about time points and perplexed perplexity better than the structured language models then uh the produce at best technique on this set up was from a a up the money and both syntactic you know at work a language model that that's actually even more features that are like uh um linguistically motivated and uh we can see that if V train just the there are uh just using the standard back propagation uh that of the ev trained a recurrence a network you can obtain better results on this a top than a bit the uh usual of fit for working on that work and we train it uh by back propagation through time B obtain uh a a large improvement in the end that's all these are a results are are all lovely uh after combination with the with the of model and then when we train a several different models the obtain again a quite significant improvement uh actually we have some ongoing work and we are able to no a perplexity on this that that this lower than at oh to combining a lot of different think you you i technique uh so the the factorisation of the output the of wood layer that i have described before but it's uh it's to right significant speed-up speech is quite a use and we can see here that also the uh the cost of perplexity like yeah like uh because the we we make some assumptions that are are to a completely true and that the approach is very simple that a the the is also do not degrade very much even if we go for let's say hundred is then if we go to to even less cost is the result bill go even uh again better because actually uh the model for for a number of classes to one and that the size of the of a is uh each row to the the real origin model uh so the optimal volume is about square root of the size of the vocabulary but like the optimal value to obtain the maximum speed up of course you can make some compromise and we can go for a little lot more classes to obtain some less efficient threes of the less efficient a a network that has uh but accuracy uh what we did not have in the paper is what happens if we would uh actually and more data because the previews speed experiments of it just one min in oh works in the training data here we show a graph on a uh english you give or to are we used up to start to six min in of fort and you can see that's for this or kinda know that works of the difference and jane's the back of models is actually increasing with more data which is like uh opposite it of what we can see for most of the other loss and week a language modeling techniques that work only for small amounts of data and when we increase this uh the amount of the training data then a uh actually all the improvements and to vanish so this is not the case so uh next speed did a lot of small modifications of two even improve the accuracy and the speed and one of these things is actually dynamic a relation that gonna be used for adaptation of the models is uh extension nor simplification of our previous approach that we have described in are lost interspeech paper and uh it basically works uh well like uh that to be trained a network even during uh the testing phase uh but in this case we just three train the network on the on the one on the one best oh during recognition uh then also we show uh the paper or the at the paper of every show combination in comparison of a recurrent two networks with many other advanced language modeling technique she's uh which leads to know more than fifty percent cent reductions of perplexity you james some stand back of uh no a of language models and to on some even a large data uh than this penn treebank corpus a are able to get even more than fifty percent reduction perplexity uh we have also some uh some S our experiments and the results and on some easy so that uh that is uh uh uh able to use you know its some very basic acoustic models you are able to obtain a almost and the person the reduction of the word error rate uh and on a much harder and larger set of which is it the same as the one that was use the a last year on J two workshops summer workshop we apple thing almost and person the reduction of the board the rate to jane's baker a four gram model which is actually i can even include the results for the model and on the the on this sistine which uh whites a reduction from thirteen starting point one to twelve point five means that this there can sing at work is about the wise but you're in a what they're rate reduction on this up then a model um and also uh yeah all these expense can be very P it as a uh we made it to look at at this available on this this setting and the think should be also in the paper so i would say to yes all these experiments can can be repeated just it takes a lot of time so thanks for attention oh for questions yeah yeah uh just a second uh just table yeah so uh which numbers you mean uh this is a to be the uh can be a this is a combination be the bic model with the baseline model a it out combination i'm not sure if a a it is in the paper but uh basically it would be like the debate of the recurrence let's work in the combination is uh usually a on this set about those your point seven or zero point eight so it would be that it a better than the than the baseline i think it was around one other when something and the questions

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

Language Modeling

Presented by: Tomáš Mikolov, Author(s): Tomáš Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, Brno University of Technology, Czech Republic; Sanjeev Khudanpur, The Johns Hopkins University, United States