0:00:17the my name is
0:00:18yeah or and i'm going to present our common work on uh
0:00:21structured output layer neural network language model
0:00:25and we are
0:00:26presenting the work done it a he's an is and in france
0:00:30so first said like to introduce briefly the new all network language models
0:00:34then i move on to a hierarchical models that actually motivate a structured out layer model
0:00:39and fine we present the core of these priests george the so we network language which
0:00:46so the neural network language models
0:00:48but we talk about usual n-gram language models
0:00:51we know that they have very successful they've introduced several decades ago
0:00:55but the drawbacks of these models also well known
0:00:59one of
0:01:00these drawbacks S for sparsity issues and the lack of generalization
0:01:05one of the major reasons for this
0:01:07is that
0:01:08conventional and models
0:01:10they use
0:01:11flat vocabulary
0:01:12so each what is
0:01:13so shade it with an index and the vocabulary
0:01:16and this way
0:01:17the models
0:01:18did do not make use of the hidden and semantic relationships
0:01:21that a between different words
0:01:25so the neural network language models
0:01:28were introduced
0:01:29to estimate n-gram probabilities and continuous space
0:01:32about ten years ago
0:01:34these models was successfully
0:01:36uh applied to speech recognition
0:01:38why should they work
0:01:40in you a network language models
0:01:41because similar words expected you have
0:01:43similar feature vectors in continuous space
0:01:47and so the probability function is this smart smooth function of feature values
0:01:51and so if we have
0:01:54similar features
0:01:55that you use
0:01:56this small change in probability for a similar words with to male feature
0:02:02so just the brief overview the deer a uh
0:02:05you network language models
0:02:07a train and used
0:02:09the first we represent each word in the vocabulary as
0:02:12one of and vector so with all these yours except for on
0:02:16one index that is one
0:02:20we project
0:02:21this there
0:02:22this this vector
0:02:24the continuous space
0:02:26so we add a second layer that is fully connected
0:02:29that is called context layer or projection there
0:02:32and if we work on the four gram level for example
0:02:36we feed to them you will network
0:02:38the history that is three previous words
0:02:41so we project three previous words and continuous space
0:02:46we showing
0:02:47to back vectors in continuous space to obtain
0:02:50the projection the the context vector
0:02:52for the history
0:02:56as we work with the uh
0:02:57you network model
0:02:58so as we have the the vector
0:03:01for the history the the layer for the he's trip projection layer
0:03:04then we add the
0:03:06hidden layer
0:03:08with a a on non any uh non linearity so with a hyperbolic tangent activation function
0:03:14that source to create
0:03:16feature vector for that word to be predicted
0:03:18in continuous space
0:03:20prediction space
0:03:22we also need the output layer to estimate probabilities
0:03:26for all words given the history
0:03:29so and this there are we use the softmax function
0:03:34and all the parameters of the neural network must be learned during training
0:03:39so the key points are the
0:03:42these neural network language models that use projection continuous space
0:03:46that reduces these sparsity issues
0:03:49so these
0:03:50and the
0:03:51projection prediction a lawrence small ten
0:03:55in practice
0:03:56significant and systematic improvements
0:03:59and speech recognition and machine translation task
0:04:02what or
0:04:03we you all network language models
0:04:05we used to complement
0:04:06conventional and one language models
0:04:08so that interpolate with that
0:04:12so the point
0:04:13everybody should use it
0:04:16there is a small problem
0:04:17it wouldn't bit small training sets
0:04:20the training time
0:04:22is very big is are large
0:04:23and we do with large training sets
0:04:26it's even larger
0:04:29is it so long
0:04:31well we look at just one inference
0:04:34what do we have to do
0:04:37we have to
0:04:39project uh
0:04:40project histories
0:04:41you know to do this
0:04:43this is just a metrics or selection
0:04:48imagine that we have
0:04:49two hundred notes
0:04:50and uh the uh projection vector
0:04:53and we have the history of three so all together we have six hundred
0:04:56and we had two hundred notes an output there
0:04:59so what we have to do we have to perform metrics
0:05:01multiple uh multiplication
0:05:04with this fellows
0:05:06and then what we have to do a you have to perform another metrics multiplication
0:05:10that depends on the size of the output layer
0:05:12that is
0:05:13two hundred
0:05:15on the size
0:05:16of the output we a a sort of the hidden layer
0:05:19and the size of the output the very
0:05:22so we look at the complexity complexity issues
0:05:26we can see that the input vocabulary can can be as large as we want
0:05:31as it doesn't at
0:05:34well in complexity O complexity
0:05:36then increase in the you order
0:05:38that's not drastically increase the complexity as it can be it
0:05:41it most in year
0:05:43the increase
0:05:44the problem is the in the output layer in the output vocabulary size
0:05:48if the a cable large then the training and inference time
0:05:52well there are a lot but uh uh with the lower
0:05:57so there are a number of
0:05:58usual tricks
0:06:00that i used to
0:06:01speed up training and inference
0:06:04one part of these streets deal with the resampling of training data
0:06:09and using different portions of training data in each
0:06:12a a book of neural network language model training
0:06:16and using batch training mode
0:06:18that is propagating
0:06:20and grams sharing the same histories
0:06:22so we spend less time this way
0:06:25and that the type of uh tricks is
0:06:27reducing in the output vocabulary
0:06:29that is
0:06:30the think that is called the short
0:06:33you network language models
0:06:34so we use the neural network
0:06:37to predict only the K most frequent words
0:06:40normally it's
0:06:41up to
0:06:42uh uh twenty thousand words
0:06:45and we use the conventional
0:06:47now a n-gram language model
0:06:50to pretty a to give the probabilities for all the rest
0:06:54so in this scheme
0:06:56we have to keep the conventional
0:06:59and gram language model
0:07:00to for back of if and for re normalisation because we have to re normalise the probabilities
0:07:06so that they sum up to one for each one
0:07:11now would like to a to talk about a high models
0:07:14that actually
0:07:16we introduced two
0:07:17tackle this problem of
0:07:19dealing with a a large output vocabularies
0:07:23one of the first ideas is was in a a a a was dealing with us begin a maximum entropy
0:07:29so that
0:07:31the same problem
0:07:32uh a previous ah a talk was about this section
0:07:36what was proposed to bounty ten years ago was just set of complete directly
0:07:40the conditional probability
0:07:42make use of clustering of words
0:07:45so that
0:07:47we introduce classes into computation
0:07:50and then if we have for example
0:07:52what of a the rate of ten thousand
0:07:56and we have
0:07:56we cluster them and one hundred classes
0:07:58and the that
0:08:00each of these classes has exactly one hundred words inside
0:08:03so instead of doing normalisation over ten thousand words
0:08:07we have to do to normalisation
0:08:09well were
0:08:10one hundred out
0:08:12can be
0:08:13you have to do only
0:08:15normalisation on only over two hundred dollars
0:08:18so we can read
0:08:20the computation by fifty
0:08:23that was the idea
0:08:24then this idea was uh
0:08:27uh this idea in spite
0:08:29the i think the uh they work on hierarchical probabilistic neural network language models
0:08:37a at is to cluster the output vocabulary
0:08:41at the output layer of the neural network
0:08:43and pretty
0:08:45it's spots in the clustering tree
0:08:49the clustering in this work
0:08:51what's constrained by wordnet semantic where
0:08:54and the
0:08:56so when
0:08:57in this uh frame or we predict
0:08:59but to work
0:09:00exactly in the output layer
0:09:01but the next beat in high are T in the in the in the tree in the clustering to
0:09:07so we uh at each node
0:09:10we uh uh pretty
0:09:12the beat that is zero or one left to right
0:09:14in the binary tree
0:09:17the code
0:09:19for these node
0:09:20and the history
0:09:21so there is one parameter at it
0:09:24and the calculation that is the the D binary code
0:09:27of the note
0:09:28the way we can we have to get a
0:09:31get to it
0:09:32the experiments were or the experimental results are shown on uh quite small brown corpus with the ten thousand words
0:09:40significant speed-up which shown
0:09:43two orders of mike menu two
0:09:45but wasn't perplexed
0:09:48the same time
0:09:50probably the loss and perplexity was do to using what net
0:09:53semantic care
0:09:55so in the work
0:09:57uh called scalable hierarchical distributed language model
0:10:01the automatic clustering was used instead of
0:10:05the model itself was
0:10:07implemented as what bill your model
0:10:10uh without nonlinearity
0:10:12and want to many what class mapping
0:10:14was important so the work
0:10:16the long to more than one
0:10:19the results were
0:10:20reported on a large dataset with the uh eighteen thousand
0:10:24words for cable E
0:10:26a perplexity improvements over a and were model was shown
0:10:30speed-up of course
0:10:31and similar performance to an or here can what be linear
0:10:38no i'm going to check about in the major part of uh uh of this work that is structured output
0:10:43layer neural network language models
0:10:45so what i the main idea
0:10:47yeah and the structured output layer neural network language model
0:10:52if we
0:10:54compare it with the hierarchical models have just been talking about a
0:10:58bit trees be i used to cluster
0:10:59the output vocabulary and not binary anymore
0:11:03so we actually
0:11:04because of these we use
0:11:06baltic we multiple
0:11:07multiple output layers
0:11:09a the of a of neural network with the softmax in each so i will talk
0:11:14in detail the bit later about this
0:11:16then we do not
0:11:18perform clustering for frequent words
0:11:21so we
0:11:22still do you some ideas from the short least
0:11:25neural networks
0:11:26so we keep the short list
0:11:28without clustering and we cluster are only not frequent words
0:11:32and then we use uh what we think is efficient clustering scheme
0:11:36so we use what or word vectors in projections space
0:11:39for clustering
0:11:41the task
0:11:42is to improve state-of-the-art in C speech to text system
0:11:46that makes use already of short a you wanna work which models
0:11:49that is characterised by large vocabulary and the baseline and uh and n-gram language model trained on be lance words
0:11:57so what clustering
0:11:59are we do it
0:11:59first we had still shake each frequent word with the single class
0:12:04and then of a cluster
0:12:05or infrequent word
0:12:09in this way
0:12:12as we use
0:12:14uh in our research
0:12:16a clustering trees that that not binary
0:12:20as opposed to binary clustering trees our clustering trees are
0:12:23what shall
0:12:24so normally in our experiments the depth of the trees i'd the three or four
0:12:31here is uh you can see the um
0:12:33the formal of for computation of the probability so actually in each chief of this clustering tree we and up
0:12:40with the
0:12:43we end up with the uh
0:12:44with the word
0:12:45is a single class
0:12:46so at each that we have soft max function
0:12:49at the upper level we have the
0:12:51short least words
0:12:52that a
0:12:56so each word in each note in each class
0:12:59and then the node
0:13:01infrequent frequent out of short least words
0:13:03and then would you clustering
0:13:05and we add up at the lower level
0:13:09one word per class
0:13:13so if we
0:13:14represent present our model
0:13:16in this more convenient way
0:13:18we can say that normally
0:13:20the neural network they have one out a player
0:13:24in our scheme we have
0:13:25one out of there that is
0:13:28the first layer
0:13:29that deals with a frequent words
0:13:32and then
0:13:33it has a the layers
0:13:35that do you
0:13:36with a sub classes
0:13:38in the clustering tree
0:13:40and each output layer
0:13:42as of marks
0:13:45so if we have
0:13:47classes the clustering we have
0:13:50output layers
0:13:52uh in our neural net were
0:13:56the training great
0:13:58the way the uh we train our structure out layer neural network language model
0:14:03so first we train is standard
0:14:05neural network language model with a short
0:14:07as an out
0:14:08so it's a short list you wanna network each model
0:14:11what we train it on three
0:14:13with three you box
0:14:14so normally we use
0:14:16fifteen twenty bucks to train the train fully
0:14:19so now it's really
0:14:21with three but
0:14:23that what we do
0:14:24we reduce the dimension of the context space
0:14:27using the you principal component analysis
0:14:30and in now experiments the final the ten
0:14:34and then we perform a recursive came uh uh means a word clustering based on these distribute it
0:14:40representation and used
0:14:42by the continuous space
0:14:44except for the words in short is because we do not have to class
0:14:49and finally we train the whole model
0:14:55the results we report in this paper
0:14:57are on uh mentoring gale task
0:15:00so we use links to mentoring speech system
0:15:03that is
0:15:04uh are rice by uh fifty six thousand vocabulary
0:15:08this is a for word work the so
0:15:10what we do first we do the segmentation of change data
0:15:14in words
0:15:15using the uh
0:15:17maximum length approach and then
0:15:20we train our word based language models on this
0:15:23and the baseline let's a language models in train on
0:15:26three point two billion words
0:15:28this just train it on many
0:15:30subcomponent lm static interpolated together
0:15:33with the interpolation weight
0:15:34Q in turn have to it
0:15:37then we train for neural network
0:15:39and i
0:15:43at each iteration about twenty five million words
0:15:46after resampling
0:15:48because at each iteration we sampled different different
0:15:52in the table you can see results
0:15:54a a on a mentoring gale task
0:15:58first with the baseline four gram of them
0:16:00and then
0:16:01when this baseline
0:16:03for grant that's right i'm is interpolated
0:16:06with new all network language models of the of different type
0:16:10we have the
0:16:11eight thousand word
0:16:13uh for gram that network language model and twelve thousand
0:16:16short least
0:16:17words some short least you on a language model
0:16:19and structural out there and you a bunch mode
0:16:23so what we can see that it's that
0:16:25so and a lamb
0:16:27consistently the out performs the
0:16:30short list based neural network language models
0:16:32not to say of the the the baseline
0:16:36the base four gram language model
0:16:39what we can also see that the improvement
0:16:42for four grams
0:16:44is about zero point two zero point one
0:16:47and when we speech to six grams scenario with uh uh and we not language models
0:16:52the gain we get from a on neural network language models
0:16:55i be better i bit larger so we gain in uh between zero point three
0:17:00zero point two
0:17:01over the our best
0:17:03short least neural network language model
0:17:07why we use uh the short list of uh
0:17:10eight thousand and twelve thousand so this is normally the shortest we use in our uh now our experiments an
0:17:15our systems
0:17:16and also
0:17:17when we train our soul neural network language model
0:17:23we use
0:17:24the short list of eight thousand words
0:17:26to train
0:17:28that one be clustered
0:17:30and we use
0:17:31four thousand classes at the
0:17:33upper level
0:17:34so this model
0:17:36can they
0:17:37in complex it's pretty much the same
0:17:39is the short list
0:17:41uh model with a twelve
0:17:43words in short
0:17:46what i the conclusions
0:17:51so neural network language model
0:17:53if two
0:17:54is a combination actually of neural network
0:17:57and class based
0:17:58language model
0:18:02it can deal with that of the cab others of
0:18:05are B tree size is
0:18:06so on this research they've a but there was
0:18:09fifty uh fifty six a thousand words
0:18:12but uh we you have recently around uh on the experiments cable or of
0:18:16three hundred thousand
0:18:19speech recognition improvements are achieved on large scale task
0:18:23and over very challenging baselines
0:18:26and then what we have also noted that
0:18:29structured output layer neural networks
0:18:32for longer contact
0:18:37and that's
0:18:54input there
0:19:01yeah but then okay
0:19:14here you mean
0:19:16but this is the operation we do at this point
0:19:19is just a metrics row selection
0:19:21we do not have any uh do have to do any multiplication
0:19:25so at this point
0:19:27and that it has nothing to do with a a a with the increase of can of complex
0:19:31so if you look here
0:19:37so the you what we do we have to do we have to do just metrics or selection
0:19:55yeah sure
0:20:00no but that's this this part is trained very fast
0:20:03so it's like
0:20:25or can we discuss it later because a a a a can of full uh and this then
0:20:30in in the questions
0:20:31at time for one more
0:20:38yeah thanks
0:20:40um um i just have to quick questions so on your was that how much just one or whether you
0:20:44have results could do set in the range of number of open or knows range use from something or small
0:20:49few thousand to twenty thousand
0:20:51oh you experiments did you try to go up to twenty thousand to see what happens
0:20:56and "'cause" you you close and class based upon their on configuration bases spanning across internal vocabulary
0:21:02but it to but the um the ones the eight king and twelve is not
0:21:06so you you can it to be more and what does that happen and do you do you uh the
0:21:11the maximum when which tried it was set twelve thousand K because
0:21:14already with twelve south then K in a and the output vocabulary of have more
0:21:19like thirty
0:21:20then the training time is too large
0:21:23okay so to long you don't have a a a a a a experiment one no way
0:21:26and also so just one basic extreme cases so if you don't not
0:21:30split the clustering read it may be just class to or the other words um out of the show only
0:21:35to one plots
0:21:36and how does that model fair okay kings you are a multiple class i'm configuration
0:21:42we prefer to use these configuration because
0:21:45actually that's not the story that's another paper
0:21:48but uh i
0:21:49we we prefer to keep short least
0:21:52a stable
0:21:53because in the in other experiments what we do we use much more data to to learn to to train
0:21:58the out of should least part
0:22:01but this is
0:22:01another sorry
0:22:11i don't know i have never tried
0:22:13okay case
0:22:15i the speaker okay