i the my name is yeah or and i'm going to present our common work on uh structured output layer neural network language model and we are presenting the work done it a he's an is and in france so first said like to introduce briefly the new all network language models then i move on to a hierarchical models that actually motivate a structured out layer model and fine we present the core of these priests george the so we network language which so the neural network language models but we talk about usual n-gram language models we know that they have very successful they've introduced several decades ago but the drawbacks of these models also well known one of these drawbacks S for sparsity issues and the lack of generalization so one of the major reasons for this is that conventional and models they use flat vocabulary so each what is so shade it with an index and the vocabulary and this way the models did do not make use of the hidden and semantic relationships that a between different words so the neural network language models were introduced to estimate n-gram probabilities and continuous space about ten years ago these models was successfully uh applied to speech recognition why should they work in you a network language models because similar words expected you have similar feature vectors in continuous space and so the probability function is this smart smooth function of feature values and so if we have similar features that you use this small change in probability for a similar words with to male feature so just the brief overview the deer a uh you network language models a train and used the first we represent each word in the vocabulary as one of and vector so with all these yours except for on one index that is one then we project this there this this vector the continuous space so we add a second layer that is fully connected that is called context layer or projection there and if we work on the four gram level for example we feed to them you will network the history that is three previous words so we project three previous words and continuous space and we showing to back vectors in continuous space to obtain the projection the the context vector for the history as we work with the uh you network model so as we have the the vector for the history the the layer for the he's trip projection layer then we add the hidden layer with a a on non any uh non linearity so with a hyperbolic tangent activation function that source to create feature vector for that word to be predicted in continuous space prediction space then we also need the output layer to estimate probabilities for all words given the history so and this there are we use the softmax function and all the parameters of the neural network must be learned during training so the key points are the these neural network language models that use projection continuous space that reduces these sparsity issues so these and the projection prediction a lawrence small ten in practice significant and systematic improvements both and speech recognition and machine translation task what or we you all network language models we used to complement conventional and one language models so that interpolate with that so the point everybody should use it but there is a small problem it wouldn't bit small training sets the training time is very big is are large and we do with large training sets it's even larger why is it so long well we look at just one inference what do we have to do first we have to project uh project histories you know to do this this is just a metrics or selection operation then imagine that we have two hundred notes and uh the uh projection vector and we have the history of three so all together we have six hundred and we had two hundred notes an output there so what we have to do we have to perform metrics multiple uh multiplication with this fellows and then what we have to do a you have to perform another metrics multiplication that depends on the size of the output layer that is two hundred and on the size of the output we a a sort of the hidden layer and the size of the output the very so we look at the complexity complexity issues we can see that the input vocabulary can can be as large as we want as it doesn't at uh well in complexity O complexity then increase in the you order that's not drastically increase the complexity as it can be it it most in year the increase the problem is the in the output layer in the output vocabulary size if the a cable large then the training and inference time well there are a lot but uh uh with the lower so there are a number of usual tricks that i used to speed up training and inference one part of these streets deal with the resampling of training data and using different portions of training data in each a a book of neural network language model training and using batch training mode that is propagating and grams sharing the same histories so we spend less time this way and that the type of uh tricks is reducing in the output vocabulary that is the think that is called the short you network language models so we use the neural network to predict only the K most frequent words normally it's up to uh uh twenty thousand words and we use the conventional now a n-gram language model to to pretty a to give the probabilities for all the rest right so in this scheme we have to keep the conventional and gram language model to for back of if and for re normalisation because we have to re normalise the probabilities so that they sum up to one for each one now would like to a to talk about a high models that actually we introduced two tackle this problem of dealing with a a large output vocabularies one of the first ideas is was in a a a a was dealing with us begin a maximum entropy models so that just is the same problem uh a previous ah a talk was about this section so what was proposed to bounty ten years ago was just set of complete directly the conditional probability make use of clustering of words so that we introduce classes into computation and then if we have for example what of a the rate of ten thousand words and we have we cluster them and one hundred classes and the that each of these classes has exactly one hundred words inside so instead of doing normalisation over ten thousand words we have to do to normalisation well were one hundred out so can be you have to do only normalisation on only over two hundred dollars so we can read the computation by fifty that was the idea then this idea was uh uh this idea in spite the i think the uh they work on hierarchical probabilistic neural network language models so a at is to cluster the output vocabulary at the output layer of the neural network and pretty words it's spots in the clustering tree the clustering in this work what's constrained by wordnet semantic where and the so when in this uh frame or we predict but to work exactly in the output layer but the next beat in high are T in the in the in the tree in the clustering to so we uh at each node we uh uh pretty the beat that is zero or one left to right in the binary tree given the code for these node and the history so there is one parameter at it and the calculation that is the the D binary code of the note the way we can we have to get a get to it the experiments were or the experimental results are shown on uh quite small brown corpus with the ten thousand words vocabulary significant speed-up which shown like two orders of mike menu two but wasn't perplexed the same time probably the loss and perplexity was do to using what net semantic care so in the work uh called scalable hierarchical distributed language model the automatic clustering was used instead of wordnet the model itself was implemented as what bill your model uh without nonlinearity and want to many what class mapping was important so the work the long to more than one the results were reported on a large dataset with the uh eighteen thousand words for cable E a perplexity improvements over a and were model was shown speed-up of course and similar performance to an or here can what be linear no i'm going to check about in the major part of uh uh of this work that is structured output layer neural network language models so what i the main idea yeah and the structured output layer neural network language model first if we compare it with the hierarchical models have just been talking about a bit trees be i used to cluster the output vocabulary and not binary anymore so we actually because of these we use baltic we multiple multiple output layers a the of a of neural network with the softmax in each so i will talk in detail the bit later about this then we do not perform clustering for frequent words so we still do you some ideas from the short least neural networks so we keep the short list without clustering and we cluster are only not frequent words and then we use uh what we think is efficient clustering scheme so we use what or word vectors in projections space for clustering the task is to improve state-of-the-art in C speech to text system that makes use already of short a you wanna work which models that is characterised by large vocabulary and the baseline and uh and n-gram language model trained on be lance words so what clustering are we do it first we had still shake each frequent word with the single class and then of a cluster or infrequent word in this way as we use uh in our research a clustering trees that that not binary the as opposed to binary clustering trees our clustering trees are what shall so normally in our experiments the depth of the trees i'd the three or four here is uh you can see the um the formal of for computation of the probability so actually in each chief of this clustering tree we and up with the uh we end up with the uh with the word is a single class so at each that we have soft max function at the upper level we have the short least words that a not classified so each word in each note in each class and then the node for infrequent frequent out of short least words and then would you clustering and we add up at the lower level again with one word per class so if we represent present our model in this more convenient way we can say that normally the neural network they have one out a player in our scheme we have one out of there that is the first layer that deals with a frequent words and then it has a the layers that do you with a sub classes in the clustering tree and each output layer as of marks function so if we have more classes the clustering we have more output layers uh in our neural net were the training great the way the uh we train our structure out layer neural network language model so first we train is standard neural network language model with a short as an out so it's a short list you wanna network each model what we train it on three with three you box so normally we use fifteen twenty bucks to train the train fully so now it's really trained with three but that what we do we reduce the dimension of the context space using the you principal component analysis and in now experiments the final the ten and then we perform a recursive came uh uh means a word clustering based on these distribute it representation and used by the continuous space except for the words in short is because we do not have to class right and finally we train the whole model the results we report in this paper are on uh mentoring gale task so we use links to mentoring speech system that is uh are rice by uh fifty six thousand vocabulary this is a for word work the so what we do first we do the segmentation of change data in words using the uh maximum length approach and then we train our word based language models on this and the baseline let's a language models in train on three point two billion words this just train it on many subcomponent lm static interpolated together with the interpolation weight Q in turn have to it then we train for neural network and i using at each iteration about twenty five million words after resampling because at each iteration we sampled different different in the table you can see results a a on a mentoring gale task first with the baseline four gram of them and then when this baseline for grant that's right i'm is interpolated with new all network language models of the of different type so we have the eight thousand word uh for gram that network language model and twelve thousand short least words some short least you on a language model and structural out there and you a bunch mode so what we can see that it's that so and a lamb consistently the out performs the short list based neural network language models not to say of the the the baseline the base four gram language model and what we can also see that the improvement for four grams is about zero point two zero point one and when we speech to six grams scenario with uh uh and we not language models the gain we get from a on neural network language models i be better i bit larger so we gain in uh between zero point three zero point two over the our best short least neural network language model why we use uh the short list of uh eight thousand and twelve thousand so this is normally the shortest we use in our uh now our experiments an our systems and also when we train our soul neural network language model uh we use the short list of eight thousand words to train part that one be clustered and we use four thousand classes at the upper level so this model can they in complex it's pretty much the same is the short list uh model with a twelve thousand words in short what i the conclusions the so neural network language model if two is a combination actually of neural network and class based language model then it can deal with that of the cab others of are B tree size is so on this research they've a but there was fifty uh fifty six a thousand words but uh we you have recently around uh on the experiments cable or of three hundred thousand then speech recognition improvements are achieved on large scale task and over very challenging baselines and then what we have also noted that structured output layer neural networks improve better for longer contact and that's questions input there i yeah yeah but then okay okay so here you mean but this is the operation we do at this point is just a metrics row selection we do not have any uh do have to do any multiplication so at this point and that it has nothing to do with a a a with the increase of can of complex so if you look here so the you what we do we have to do we have to do just metrics or selection yeah sure a yeah i no but that's this this part is trained very fast so it's like or can we discuss it later because a a a a can of full uh and this then in in the questions at time for one more hmmm yeah thanks um um i just have to quick questions so on your was that how much just one or whether you have results could do set in the range of number of open or knows range use from something or small few thousand to twenty thousand oh you experiments did you try to go up to twenty thousand to see what happens and "'cause" you you close and class based upon their on configuration bases spanning across internal vocabulary but it to but the um the ones the eight king and twelve is not so you you can it to be more and what does that happen and do you do you uh the the maximum when which tried it was set twelve thousand K because already with twelve south then K in a and the output vocabulary of have more like thirty then the training time is too large okay so to long you don't have a a a a a a experiment one no way and also so just one basic extreme cases so if you don't not split the clustering read it may be just class to or the other words um out of the show only to one plots and how does that model fair okay kings you are a multiple class i'm configuration we prefer to use these configuration because actually that's not the story that's another paper but uh i we we prefer to keep short least part a stable because in the in other experiments what we do we use much more data to to learn to to train the out of should least part but this is another sorry so uh_huh yeah i don't know i have never tried okay case i the speaker okay