0:00:13a matched call
0:00:14as me
0:00:15uh so okay so i start
0:00:17uh this work is set extension of uh
0:00:20of our previous paper from last in speech
0:00:23a which also about the recurrence a work a language model and
0:00:27you know we really show some
0:00:29some more details of to do how to train this model are efficiently
0:00:33comparison
0:00:34jane the standard to network
0:00:36language language models so this is some
0:00:38a introduction
0:00:40uh basically a a network language models work
0:00:43uh like let's say better than the standard
0:00:46uh because of models because they can
0:00:48uh automatically
0:00:50uh share some parameters between similar board
0:00:53so they are from some kind of soft clustering in
0:00:56low dimensional space
0:00:58so in some sense they are similar to "'cause" based models
0:01:03uh
0:01:03the good think about this neural network language models is does
0:01:06is that they are quite
0:01:08a a simple to implement and
0:01:10do me do not need to
0:01:12a
0:01:13uh deal it's for example smooth thing
0:01:15and even the training colour grid means usually just the standard propagation algorithm which is
0:01:20very well known and described
0:01:23and the
0:01:24actually what we have shown recently
0:01:26uh was that uh the recurrent architecture is
0:01:30uh a but it if it with the feed forward at architecture
0:01:34which is
0:01:35right
0:01:36like nice because uh actually the recount architecture is start to carry much more powerful because it allows the model
0:01:42to remember some kind of information in the hidden layer
0:01:45and so we do don't the build and and got model of that some a limited
0:01:49a history but actually the model learns the easter it from the data
0:01:54uh we will see that later in the pictures
0:01:57so in this uh presentation
0:01:59i dollar
0:02:00uh describe a a a is back propagation trip times like likely which is uh a very old algorithm for
0:02:05training recurrent do networks
0:02:07uh and
0:02:08i dense son as uh a speed-up technique that is actually
0:02:12uh very similar or
0:02:13to to the previous presentation
0:02:15just our technique is i would say much simpler
0:02:18and uh then uh a results uh are to combining a uh
0:02:22many and am iced uh a you know network models and
0:02:26uh how this uh a text perplexity
0:02:29and also some comparison with other techniques and
0:02:32so i'm
0:02:33sun
0:02:34uh a results that are not in the original paper because we obtain the them light you out that a
0:02:40paper was written and this is
0:02:41about some large
0:02:42uh i C R
0:02:44thus
0:02:46but a lot of more data then
0:02:48i i we show
0:02:49here in the simple
0:02:51uh simple examples
0:02:52so the the model looks like this
0:02:55basically we have a
0:02:57yeah some input layer
0:02:58and output layer that have the same time a general it as the as the vocabulary
0:03:03a that is that W and uh uh why and between the
0:03:06uh two layers there is one
0:03:08in a that as sexually much lower dimensionality
0:03:11uh let's say
0:03:12uh a one hundred dollar two on the king or owns and
0:03:15uh actually we would to not uh considered the a recurrent uh
0:03:19uh parameters the recurrent rate
0:03:22uh
0:03:23uh that's uh that it is uh
0:03:25uh that are between the hidden don't layer
0:03:27uh
0:03:28like between a itself
0:03:30uh then a then the network would be a uh just uh a standard bigram a network
0:03:35a language model but actually these so parameters give uh the model
0:03:39uh the power to remember some history and use it efficiently
0:03:44uh so
0:03:46uh
0:03:47uh so actually in the previous paper we are using just the the normal back propagation for training such network
0:03:53but
0:03:54uh here i will show that a bit to back propagation should time we can
0:03:57uh i get to actually better results which should be even more of the use on a corrected a
0:04:02based language models are
0:04:03the the usual architecture doesn't really work well
0:04:07uh not only for the recurrence met were the but also for the for for work
0:04:12and uh
0:04:13uh actually
0:04:14how would does uh back propagation through time work uh it works by unfolding the recount part of the network
0:04:20uh in time
0:04:21so that be obtain some and deep uh cute for of our to network
0:04:25i which is some kind of approximation of the recurrent part of the network
0:04:29and we train this by
0:04:31uh using the standard back propagation
0:04:33just the to be you have uh many more hidden layer
0:04:38so
0:04:39looks basically like a is uh you be imagine the the original bigram network and now we know that there
0:04:44are
0:04:45recurrent connections in the hidden air
0:04:47like that it is connected to a it a to itself just that these connections are
0:04:51do you light in time
0:04:53so that a when B
0:04:54uh trained the network
0:04:56we we compute D a vector in the in the output layer and propagated back using the back back propagation
0:05:02and
0:05:03uh
0:05:04we full the network in times so basically
0:05:07uh we can
0:05:08uh
0:05:09we can go one step back in time and we can see that the the
0:05:12the uh uh activation values
0:05:15in the you don't their of our depending on the
0:05:18state of the
0:05:20of the input layer and on the state of the previous
0:05:23uh uh a on the state of the hidden are in the previous times that
0:05:27and
0:05:27so on
0:05:29so basically we can unfold this network work for a for two steps in time
0:05:34and the obtain a future for of to the approximation of the a recurrent neural networks so
0:05:39this is the idea of the algorithm works
0:05:41and
0:05:43is a little bit tricky to implement it correctly but otherwise it is quite straightforward like would say
0:05:49uh do other extension that the describe in our paper is uh a factorization of the output layer
0:05:55which is uh basically something very similar to some
0:05:58class based uh a language models like but a
0:06:01a like but you sure a good did in his paper
0:06:04ten years ago
0:06:05just in our case of V do not really
0:06:08extend this approach by
0:06:10uh by using some to use and so on a
0:06:12uh we keep a the the approach simple simpler
0:06:15and actually meet be make it even simpler by not even a computing can classes but use just
0:06:21a factorisation
0:06:22uh that is based just on the frequency of the word so basically we do frequency binning of the
0:06:28of the
0:06:29vocabulary
0:06:30to obtain these so let's see classes
0:06:32and
0:06:33otherwise uh the approach is very similar to what one would two
0:06:37what was in the sim a the previous presentation
0:06:40so we
0:06:41uh so we basically computes first the
0:06:44the the probability distribution over to
0:06:47glass lay that's can be very small let's say
0:06:49i i just uh a one hundred out what unit
0:06:52and then we compute just the the probability distribution
0:06:55for the that to be a long to to this class layer other guys the model stays of the same
0:07:01so we do not need to compute probability distribution over
0:07:04the role of put layered that can be a say ten thousand boards but
0:07:08we will be computing cut the of the is just for
0:07:11much less
0:07:15so this this can provide
0:07:16speed-up up in some cases even more than hundred times see the if the ought to look of it is
0:07:20very large so
0:07:21this
0:07:22this technique is very nice
0:07:23we do not need to introduce a any
0:07:25and a short lists or any to use and
0:07:28actually it is quite surprising surprising that
0:07:31something think as simple as this works but will see in the is result that the does
0:07:35so
0:07:36uh uh are uh basic set up to that to be a uh described more close in the paper is
0:07:42uh
0:07:42penn treebank uh
0:07:44uh a part of the wall street journal corpus
0:07:47and to we use the the same settings things as
0:07:50also the other researchers
0:07:52so that we can directly compared the result
0:07:55which G extended now our
0:07:57on a going work but
0:07:59you will
0:07:59you it's simply here
0:08:01so
0:08:01uh this is the importance of of the back propagation true time training at D so
0:08:06or do to the results on this corpus
0:08:08and you can see that uh
0:08:10the blue curve or i should stop maybe with the baseline which is the green you know a line
0:08:15that's
0:08:15modified of name
0:08:17uh a five gram
0:08:18and the blue curve is uh when be trained for models of it
0:08:23we do a different amount of uh of uh steps for the back propagation through time algorithm
0:08:28and we can see that
0:08:30uh the average joe of uh of these
0:08:32oh of these four models is actually put it in the graph we can see that
0:08:36the more uh steps we go
0:08:38in time back
0:08:39uh the better of the final model is
0:08:41as the evolution of the model is uh still the same it this not affected by the training for
0:08:47uh when we actually combine this models like that we use a linear interpolation to go this model models we
0:08:52can see that
0:08:53uh the results are better but the affect of
0:08:56of using better training algorithms stays so
0:08:59uh this still obtain
0:09:00quite significant improvement here it this about ten person
0:09:04perplexity ended to to be even more if we would to use more training data this is just a boat
0:09:09that
0:09:10one a in word
0:09:11oh of training data
0:09:14uh here
0:09:15the
0:09:16a we show that
0:09:17if we combine actually more than a let's say for models
0:09:20we can still serve some improvement even after the vol
0:09:24combination of models
0:09:25interpolated with the
0:09:26at the back of model
0:09:28uh for the for
0:09:29to combine the neural nets
0:09:31uh be used just uh a no interpolation bit
0:09:34cool rates for each model
0:09:35but the weight of the lack of model is you want on the validation data
0:09:39this is why the car fits
0:09:40slightly noisy
0:09:42the at one
0:09:43uh but
0:09:44you can basically see that
0:09:45we can obtain some very small improvements up tree going
0:09:48for more than four model
0:09:53and that these uh this uh networks uh
0:09:55are direct the different just in the
0:09:57a random initialization of the
0:09:59of the weights
0:10:03uh here
0:10:04is the comparison that i was already
0:10:06introducing
0:10:07uh to other techniques uh so
0:10:09the the baseline of can be the five gram
0:10:12perplexity like the one hundred forty one
0:10:15at that that first
0:10:16row
0:10:17and then a
0:10:18uh uh a random forest the that is solar interpolated to a this a baseline
0:10:23uh achieves
0:10:24a perplexed the reduction somewhat less than ten percent
0:10:27and structured language models
0:10:29work uh actually better than a random forest so on this up
0:10:33and
0:10:34we can see that the all the neural networks as a language models work even better than that
0:10:39the standard before about you know not
0:10:41uh are are about
0:10:43time points and perplexed perplexity better than the structured language models
0:10:46then
0:10:47uh the produce at best technique on this set up was from a a up the money and both syntactic
0:10:52you know at work a language model that that's
0:10:54actually
0:10:55even more features that are
0:10:57like uh
0:10:58um
0:10:59linguistically motivated
0:11:01and uh we can see that if V train
0:11:04just the there are
0:11:05uh just using the standard back propagation
0:11:08uh that of the ev trained a recurrence a network
0:11:11you can obtain better results on this a top than a bit the
0:11:15uh usual of fit for working on that work
0:11:17and we train it uh by back propagation through time B
0:11:20obtain
0:11:21uh a a large improvement in the end that's all these are a results are are all lovely
0:11:26uh after combination with the with the of model
0:11:29and then when we train a several
0:11:32different models the obtain again a quite significant improvement
0:11:35uh actually we have some
0:11:37ongoing work and we are able to
0:11:40no
0:11:40a perplexity on this that that this lower than at
0:11:44oh to combining a lot of
0:11:45different think you
0:11:46you i technique
0:11:50uh so
0:11:51the
0:11:52the factorisation of the output the of wood layer that i have described before
0:11:57but it's uh it's to right
0:11:59significant speed-up speech is quite a use and we can see here that also the
0:12:04uh the
0:12:05cost
0:12:05of perplexity
0:12:07like yeah
0:12:08like uh
0:12:09because the we we make some assumptions that are are to a completely true and that the approach is very
0:12:13simple
0:12:14that a the
0:12:15the is also do not degrade very much even if we go for let's say hundred is then if we
0:12:21go to to even less cost is the result
0:12:23bill go even
0:12:25uh again better because actually
0:12:27uh the model for
0:12:29for a number of classes to one
0:12:31and that the size of the of a is uh each row to the the real origin model
0:12:36uh so the optimal volume is about
0:12:38square root of the size of the vocabulary
0:12:41but like the optimal value to obtain the maximum speed up of course
0:12:44you can make some
0:12:46compromise and we can go for a little lot more classes to
0:12:50obtain some
0:12:51less efficient threes of the less efficient
0:12:54a a network that has uh but accuracy
0:13:00uh what we did not have in the paper is
0:13:02what happens if we would uh actually
0:13:04and more data
0:13:06because the previews
0:13:07speed experiments of it just one min in
0:13:09oh works in the training data
0:13:11here we show a graph on a
0:13:13uh english you give or to are we used up to start to six min in of fort
0:13:19and you can see
0:13:20that's for this or kinda know that works of the difference
0:13:23and jane's the back of models
0:13:24is actually increasing with more data
0:13:27which is like
0:13:28uh opposite it of what we can see for most of the other loss and week a language modeling techniques
0:13:33that work only for
0:13:35small amounts of data and
0:13:36when we increase this uh the amount of the training data
0:13:39then a
0:13:40uh actually all the improvements
0:13:42and to vanish so this is not the case
0:13:48so
0:13:48uh
0:13:49next speed did
0:13:50a lot of
0:13:51small modifications of two
0:13:52even improve the accuracy and the speed and
0:13:55one of these things is actually dynamic a relation that gonna be
0:13:58used for adaptation of the models is
0:14:01uh extension nor simplification of our previous approach that we have
0:14:05described in
0:14:06are lost interspeech paper
0:14:09and
0:14:09uh it basically works uh well like uh that to be trained a network even during uh the testing phase
0:14:16uh but in this case we just three train the network on the
0:14:20on the one on the one best
0:14:22oh during recognition
0:14:24uh then also
0:14:26we show
0:14:27uh the paper or the at the paper of every show combination in comparison of a recurrent two networks with
0:14:33many other
0:14:34advanced language modeling technique
0:14:36she's uh
0:14:37which leads to know more than fifty percent cent reductions of perplexity
0:14:42you james
0:14:42some stand back of uh
0:14:44no a of language models
0:14:46and to on some even a large data uh than this penn treebank corpus a are able to
0:14:52get even more than fifty percent
0:14:54reduction
0:14:56perplexity
0:14:57uh we have also some
0:14:58uh some S our experiments and the results and on
0:15:03some easy so that uh
0:15:05that is uh uh uh able to use you know its some very basic acoustic models you are able to
0:15:10obtain a almost and the person the reduction of the word error rate
0:15:13uh and on a much harder and larger
0:15:17set of which is
0:15:18it the same as the one that was
0:15:21use the a last year
0:15:23on J two workshops summer workshop
0:15:25we apple thing almost and person the reduction of the
0:15:29board the rate to jane's
0:15:30baker a four gram model
0:15:33which is actually i can even include the results for the
0:15:36model and on the the on this sistine
0:15:39which uh whites
0:15:40a reduction from
0:15:42thirteen starting point one to
0:15:44twelve point five
0:15:45means that
0:15:46this there can sing at work is about
0:15:49the wise but you're
0:15:50in a what they're rate reduction on this up
0:15:53then a model um
0:15:55and also
0:15:57uh
0:15:58yeah all these expense can be very P it as a
0:16:01uh we made it to look at at this available on this
0:16:04this setting
0:16:05and the think should be also in the paper
0:16:07so
0:16:09i would say to
0:16:10yes
0:16:11all these experiments can can be repeated just
0:16:13it takes a lot of time
0:16:15so
0:16:16thanks for attention
0:16:24oh for questions
0:16:40yeah
0:16:42yeah
0:16:47uh
0:16:48just a second
0:16:55uh just table
0:16:58yeah
0:16:59so uh which numbers
0:17:01you mean
0:17:04uh this is a to be the uh can be a this is a combination be the bic model
0:17:08with the baseline model
0:17:11a it out combination
0:17:13i'm not sure if a a it is in the paper but uh basically it would be
0:17:17like the debate of the recurrence let's work in the combination is uh usually a on this set about those
0:17:22your point seven or zero point eight
0:17:24so it would be that it a better than the than the baseline i think it was around one other
0:17:29when something
0:17:41and the questions