Přepis řeči - STRUCTURED OUTPUT LAYER NEURAL NETWORK LANGUAGE MODEL

0:00:14	i
0:00:17	the my name is
0:00:18	yeah or and i'm going to present our common work on uh
0:00:21	structured output layer neural network language model
0:00:25	and we are
0:00:26	presenting the work done it a he's an is and in france
0:00:30	so first said like to introduce briefly the new all network language models
0:00:34	then i move on to a hierarchical models that actually motivate a structured out layer model
0:00:39	and fine we present the core of these priests george the so we network language which
0:00:46	so the neural network language models
0:00:48	but we talk about usual n-gram language models
0:00:51	we know that they have very successful they've introduced several decades ago
0:00:55	but the drawbacks of these models also well known
0:00:59	one of
0:01:00	these drawbacks S for sparsity issues and the lack of generalization
0:01:04	so
0:01:05	one of the major reasons for this
0:01:07	is that
0:01:08	conventional and models
0:01:10	they use
0:01:11	flat vocabulary
0:01:12	so each what is
0:01:13	so shade it with an index and the vocabulary
0:01:16	and this way
0:01:17	the models
0:01:18	did do not make use of the hidden and semantic relationships
0:01:21	that a between different words
0:01:25	so the neural network language models
0:01:28	were introduced
0:01:29	to estimate n-gram probabilities and continuous space
0:01:32	about ten years ago
0:01:34	these models was successfully
0:01:36	uh applied to speech recognition
0:01:38	why should they work
0:01:40	in you a network language models
0:01:41	because similar words expected you have
0:01:43	similar feature vectors in continuous space
0:01:47	and so the probability function is this smart smooth function of feature values
0:01:51	and so if we have
0:01:54	similar features
0:01:55	that you use
0:01:56	this small change in probability for a similar words with to male feature
0:02:02	so just the brief overview the deer a uh
0:02:05	you network language models
0:02:07	a train and used
0:02:09	the first we represent each word in the vocabulary as
0:02:12	one of and vector so with all these yours except for on
0:02:16	one index that is one
0:02:19	then
0:02:20	we project
0:02:21	this there
0:02:22	this this vector
0:02:24	the continuous space
0:02:26	so we add a second layer that is fully connected
0:02:29	that is called context layer or projection there
0:02:32	and if we work on the four gram level for example
0:02:36	we feed to them you will network
0:02:38	the history that is three previous words
0:02:41	so we project three previous words and continuous space
0:02:45	and
0:02:46	we showing
0:02:47	to back vectors in continuous space to obtain
0:02:50	the projection the the context vector
0:02:52	for the history
0:02:56	as we work with the uh
0:02:57	you network model
0:02:58	so as we have the the vector
0:03:01	for the history the the layer for the he's trip projection layer
0:03:04	then we add the
0:03:06	hidden layer
0:03:08	with a a on non any uh non linearity so with a hyperbolic tangent activation function
0:03:14	that source to create
0:03:16	feature vector for that word to be predicted
0:03:18	in continuous space
0:03:20	prediction space
0:03:21	then
0:03:22	we also need the output layer to estimate probabilities
0:03:26	for all words given the history
0:03:29	so and this there are we use the softmax function
0:03:34	and all the parameters of the neural network must be learned during training
0:03:39	so the key points are the
0:03:42	these neural network language models that use projection continuous space
0:03:46	that reduces these sparsity issues
0:03:49	so these
0:03:50	and the
0:03:51	projection prediction a lawrence small ten
0:03:55	in practice
0:03:56	significant and systematic improvements
0:03:58	both
0:03:59	and speech recognition and machine translation task
0:04:02	what or
0:04:03	we you all network language models
0:04:05	we used to complement
0:04:06	conventional and one language models
0:04:08	so that interpolate with that
0:04:12	so the point
0:04:13	everybody should use it
0:04:15	but
0:04:16	there is a small problem
0:04:17	it wouldn't bit small training sets
0:04:20	the training time
0:04:22	is very big is are large
0:04:23	and we do with large training sets
0:04:26	it's even larger
0:04:28	why
0:04:29	is it so long
0:04:31	well we look at just one inference
0:04:34	what do we have to do
0:04:36	first
0:04:37	we have to
0:04:39	project uh
0:04:40	project histories
0:04:41	you know to do this
0:04:43	this is just a metrics or selection
0:04:45	operation
0:04:47	then
0:04:48	imagine that we have
0:04:49	two hundred notes
0:04:50	and uh the uh projection vector
0:04:53	and we have the history of three so all together we have six hundred
0:04:56	and we had two hundred notes an output there
0:04:59	so what we have to do we have to perform metrics
0:05:01	multiple uh multiplication
0:05:04	with this fellows
0:05:06	and then what we have to do a you have to perform another metrics multiplication
0:05:10	that depends on the size of the output layer
0:05:12	that is
0:05:13	two hundred
0:05:14	and
0:05:15	on the size
0:05:16	of the output we a a sort of the hidden layer
0:05:19	and the size of the output the very
0:05:22	so we look at the complexity complexity issues
0:05:26	we can see that the input vocabulary can can be as large as we want
0:05:31	as it doesn't at
0:05:33	uh
0:05:34	well in complexity O complexity
0:05:36	then increase in the you order
0:05:38	that's not drastically increase the complexity as it can be it
0:05:41	it most in year
0:05:43	the increase
0:05:44	the problem is the in the output layer in the output vocabulary size
0:05:48	if the a cable large then the training and inference time
0:05:52	well there are a lot but uh uh with the lower
0:05:57	so there are a number of
0:05:58	usual tricks
0:06:00	that i used to
0:06:01	speed up training and inference
0:06:04	one part of these streets deal with the resampling of training data
0:06:09	and using different portions of training data in each
0:06:12	a a book of neural network language model training
0:06:16	and using batch training mode
0:06:18	that is propagating
0:06:20	and grams sharing the same histories
0:06:22	so we spend less time this way
0:06:25	and that the type of uh tricks is
0:06:27	reducing in the output vocabulary
0:06:29	that is
0:06:30	the think that is called the short
0:06:33	you network language models
0:06:34	so we use the neural network
0:06:37	to predict only the K most frequent words
0:06:40	normally it's
0:06:41	up to
0:06:42	uh uh twenty thousand words
0:06:45	and we use the conventional
0:06:47	now a n-gram language model
0:06:49	to
0:06:50	to pretty a to give the probabilities for all the rest
0:06:53	right
0:06:54	so in this scheme
0:06:56	we have to keep the conventional
0:06:59	and gram language model
0:07:00	to for back of if and for re normalisation because we have to re normalise the probabilities
0:07:06	so that they sum up to one for each one
0:07:11	now would like to a to talk about a high models
0:07:14	that actually
0:07:16	we introduced two
0:07:17	tackle this problem of
0:07:19	dealing with a a large output vocabularies
0:07:23	one of the first ideas is was in a a a a was dealing with us begin a maximum entropy
0:07:27	models
0:07:29	so that
0:07:30	just
0:07:30	is
0:07:31	the same problem
0:07:32	uh a previous ah a talk was about this section
0:07:36	so
0:07:36	what was proposed to bounty ten years ago was just set of complete directly
0:07:40	the conditional probability
0:07:42	make use of clustering of words
0:07:45	so that
0:07:47	we introduce classes into computation
0:07:50	and then if we have for example
0:07:52	what of a the rate of ten thousand
0:07:55	words
0:07:56	and we have
0:07:56	we cluster them and one hundred classes
0:07:58	and the that
0:08:00	each of these classes has exactly one hundred words inside
0:08:03	so instead of doing normalisation over ten thousand words
0:08:07	we have to do to normalisation
0:08:09	well were
0:08:10	one hundred out
0:08:12	so
0:08:12	can be
0:08:13	you have to do only
0:08:15	normalisation on only over two hundred dollars
0:08:18	so we can read
0:08:20	the computation by fifty
0:08:23	that was the idea
0:08:24	then this idea was uh
0:08:27	uh this idea in spite
0:08:29	the i think the uh they work on hierarchical probabilistic neural network language models
0:08:34	so
0:08:37	a at is to cluster the output vocabulary
0:08:41	at the output layer of the neural network
0:08:43	and pretty
0:08:44	words
0:08:45	it's spots in the clustering tree
0:08:49	the clustering in this work
0:08:51	what's constrained by wordnet semantic where
0:08:54	and the
0:08:56	so when
0:08:57	in this uh frame or we predict
0:08:59	but to work
0:09:00	exactly in the output layer
0:09:01	but the next beat in high are T in the in the in the tree in the clustering to
0:09:07	so we uh at each node
0:09:10	we uh uh pretty
0:09:12	the beat that is zero or one left to right
0:09:14	in the binary tree
0:09:16	given
0:09:17	the code
0:09:19	for these node
0:09:20	and the history
0:09:21	so there is one parameter at it
0:09:24	and the calculation that is the the D binary code
0:09:27	of the note
0:09:28	the way we can we have to get a
0:09:31	get to it
0:09:32	the experiments were or the experimental results are shown on uh quite small brown corpus with the ten thousand words
0:09:39	vocabulary
0:09:40	significant speed-up which shown
0:09:42	like
0:09:43	two orders of mike menu two
0:09:45	but wasn't perplexed
0:09:48	the same time
0:09:50	probably the loss and perplexity was do to using what net
0:09:53	semantic care
0:09:55	so in the work
0:09:57	uh called scalable hierarchical distributed language model
0:10:01	the automatic clustering was used instead of
0:10:04	wordnet
0:10:05	the model itself was
0:10:07	implemented as what bill your model
0:10:10	uh without nonlinearity
0:10:12	and want to many what class mapping
0:10:14	was important so the work
0:10:16	the long to more than one
0:10:19	the results were
0:10:20	reported on a large dataset with the uh eighteen thousand
0:10:24	words for cable E
0:10:26	a perplexity improvements over a and were model was shown
0:10:30	speed-up of course
0:10:31	and similar performance to an or here can what be linear
0:10:38	no i'm going to check about in the major part of uh uh of this work that is structured output
0:10:43	layer neural network language models
0:10:45	so what i the main idea
0:10:47	yeah and the structured output layer neural network language model
0:10:52	first
0:10:52	if we
0:10:54	compare it with the hierarchical models have just been talking about a
0:10:58	bit trees be i used to cluster
0:10:59	the output vocabulary and not binary anymore
0:11:03	so we actually
0:11:04	because of these we use
0:11:06	baltic we multiple
0:11:07	multiple output layers
0:11:09	a the of a of neural network with the softmax in each so i will talk
0:11:14	in detail the bit later about this
0:11:16	then we do not
0:11:18	perform clustering for frequent words
0:11:21	so we
0:11:22	still do you some ideas from the short least
0:11:25	neural networks
0:11:26	so we keep the short list
0:11:28	without clustering and we cluster are only not frequent words
0:11:32	and then we use uh what we think is efficient clustering scheme
0:11:36	so we use what or word vectors in projections space
0:11:39	for clustering
0:11:41	the task
0:11:42	is to improve state-of-the-art in C speech to text system
0:11:46	that makes use already of short a you wanna work which models
0:11:49	that is characterised by large vocabulary and the baseline and uh and n-gram language model trained on be lance words
0:11:57	so what clustering
0:11:59	are we do it
0:11:59	first we had still shake each frequent word with the single class
0:12:04	and then of a cluster
0:12:05	or infrequent word
0:12:09	in this way
0:12:12	as we use
0:12:14	uh in our research
0:12:16	a clustering trees that that not binary
0:12:19	the
0:12:20	as opposed to binary clustering trees our clustering trees are
0:12:23	what shall
0:12:24	so normally in our experiments the depth of the trees i'd the three or four
0:12:31	here is uh you can see the um
0:12:33	the formal of for computation of the probability so actually in each chief of this clustering tree we and up
0:12:40	with the
0:12:41	uh
0:12:43	we end up with the uh
0:12:44	with the word
0:12:45	is a single class
0:12:46	so at each that we have soft max function
0:12:49	at the upper level we have the
0:12:51	short least words
0:12:52	that a
0:12:53	not
0:12:55	classified
0:12:56	so each word in each note in each class
0:12:59	and then the node
0:13:00	for
0:13:01	infrequent frequent out of short least words
0:13:03	and then would you clustering
0:13:05	and we add up at the lower level
0:13:07	again
0:13:08	with
0:13:09	one word per class
0:13:13	so if we
0:13:14	represent present our model
0:13:16	in this more convenient way
0:13:18	we can say that normally
0:13:20	the neural network they have one out a player
0:13:24	in our scheme we have
0:13:25	one out of there that is
0:13:28	the first layer
0:13:29	that deals with a frequent words
0:13:32	and then
0:13:33	it has a the layers
0:13:35	that do you
0:13:36	with a sub classes
0:13:38	in the clustering tree
0:13:40	and each output layer
0:13:42	as of marks
0:13:43	function
0:13:45	so if we have
0:13:46	more
0:13:47	classes the clustering we have
0:13:49	more
0:13:50	output layers
0:13:52	uh in our neural net were
0:13:56	the training great
0:13:58	the way the uh we train our structure out layer neural network language model
0:14:03	so first we train is standard
0:14:05	neural network language model with a short
0:14:07	as an out
0:14:08	so it's a short list you wanna network each model
0:14:11	what we train it on three
0:14:13	with three you box
0:14:14	so normally we use
0:14:16	fifteen twenty bucks to train the train fully
0:14:19	so now it's really
0:14:20	trained
0:14:21	with three but
0:14:23	that what we do
0:14:24	we reduce the dimension of the context space
0:14:27	using the you principal component analysis
0:14:30	and in now experiments the final the ten
0:14:34	and then we perform a recursive came uh uh means a word clustering based on these distribute it
0:14:40	representation and used
0:14:42	by the continuous space
0:14:44	except for the words in short is because we do not have to class
0:14:48	right
0:14:49	and finally we train the whole model
0:14:55	the results we report in this paper
0:14:57	are on uh mentoring gale task
0:15:00	so we use links to mentoring speech system
0:15:03	that is
0:15:04	uh are rice by uh fifty six thousand vocabulary
0:15:08	this is a for word work the so
0:15:10	what we do first we do the segmentation of change data
0:15:14	in words
0:15:15	using the uh
0:15:17	maximum length approach and then
0:15:20	we train our word based language models on this
0:15:23	and the baseline let's a language models in train on
0:15:26	three point two billion words
0:15:28	this just train it on many
0:15:30	subcomponent lm static interpolated together
0:15:33	with the interpolation weight
0:15:34	Q in turn have to it
0:15:37	then we train for neural network
0:15:39	and i
0:15:42	using
0:15:43	at each iteration about twenty five million words
0:15:46	after resampling
0:15:48	because at each iteration we sampled different different
0:15:52	in the table you can see results
0:15:54	a a on a mentoring gale task
0:15:58	first with the baseline four gram of them
0:16:00	and then
0:16:01	when this baseline
0:16:03	for grant that's right i'm is interpolated
0:16:06	with new all network language models of the of different type
0:16:09	so
0:16:10	we have the
0:16:11	eight thousand word
0:16:13	uh for gram that network language model and twelve thousand
0:16:16	short least
0:16:17	words some short least you on a language model
0:16:19	and structural out there and you a bunch mode
0:16:23	so what we can see that it's that
0:16:25	so and a lamb
0:16:27	consistently the out performs the
0:16:30	short list based neural network language models
0:16:32	not to say of the the the baseline
0:16:36	the base four gram language model
0:16:38	and
0:16:39	what we can also see that the improvement
0:16:42	for four grams
0:16:44	is about zero point two zero point one
0:16:47	and when we speech to six grams scenario with uh uh and we not language models
0:16:52	the gain we get from a on neural network language models
0:16:55	i be better i bit larger so we gain in uh between zero point three
0:17:00	zero point two
0:17:01	over the our best
0:17:03	short least neural network language model
0:17:07	why we use uh the short list of uh
0:17:10	eight thousand and twelve thousand so this is normally the shortest we use in our uh now our experiments an
0:17:15	our systems
0:17:16	and also
0:17:17	when we train our soul neural network language model
0:17:21	uh
0:17:23	we use
0:17:24	the short list of eight thousand words
0:17:26	to train
0:17:27	part
0:17:28	that one be clustered
0:17:30	and we use
0:17:31	four thousand classes at the
0:17:33	upper level
0:17:34	so this model
0:17:36	can they
0:17:37	in complex it's pretty much the same
0:17:39	is the short list
0:17:41	uh model with a twelve
0:17:42	thousand
0:17:43	words in short
0:17:46	what i the conclusions
0:17:49	the
0:17:51	so neural network language model
0:17:53	if two
0:17:54	is a combination actually of neural network
0:17:57	and class based
0:17:58	language model
0:18:01	then
0:18:02	it can deal with that of the cab others of
0:18:05	are B tree size is
0:18:06	so on this research they've a but there was
0:18:09	fifty uh fifty six a thousand words
0:18:12	but uh we you have recently around uh on the experiments cable or of
0:18:16	three hundred thousand
0:18:18	then
0:18:19	speech recognition improvements are achieved on large scale task
0:18:23	and over very challenging baselines
0:18:26	and then what we have also noted that
0:18:29	structured output layer neural networks
0:18:31	improve
0:18:32	better
0:18:32	for longer contact
0:18:37	and that's
0:18:45	questions
0:18:54	input there
0:18:56	i
0:18:58	yeah
0:19:01	yeah but then okay
0:19:03	okay
0:19:13	so
0:19:14	here you mean
0:19:16	but this is the operation we do at this point
0:19:19	is just a metrics row selection
0:19:21	we do not have any uh do have to do any multiplication
0:19:25	so at this point
0:19:27	and that it has nothing to do with a a a with the increase of can of complex
0:19:31	so if you look here
0:19:37	so the you what we do we have to do we have to do just metrics or selection
0:19:55	yeah sure
0:19:56	a
0:19:57	yeah
0:19:59	i
0:20:00	no but that's this this part is trained very fast
0:20:03	so it's like
0:20:25	or can we discuss it later because a a a a can of full uh and this then
0:20:30	in in the questions
0:20:31	at time for one more
0:20:35	hmmm
0:20:38	yeah thanks
0:20:40	um um i just have to quick questions so on your was that how much just one or whether you
0:20:44	have results could do set in the range of number of open or knows range use from something or small
0:20:49	few thousand to twenty thousand
0:20:51	oh you experiments did you try to go up to twenty thousand to see what happens
0:20:56	and "'cause" you you close and class based upon their on configuration bases spanning across internal vocabulary
0:21:02	but it to but the um the ones the eight king and twelve is not
0:21:06	so you you can it to be more and what does that happen and do you do you uh the
0:21:11	the maximum when which tried it was set twelve thousand K because
0:21:14	already with twelve south then K in a and the output vocabulary of have more
0:21:19	like thirty
0:21:20	then the training time is too large
0:21:23	okay so to long you don't have a a a a a a experiment one no way
0:21:26	and also so just one basic extreme cases so if you don't not
0:21:30	split the clustering read it may be just class to or the other words um out of the show only
0:21:35	to one plots
0:21:36	and how does that model fair okay kings you are a multiple class i'm configuration
0:21:42	we prefer to use these configuration because
0:21:45	actually that's not the story that's another paper
0:21:48	but uh i
0:21:49	we we prefer to keep short least
0:21:52	part
0:21:52	a stable
0:21:53	because in the in other experiments what we do we use much more data to to learn to to train
0:21:58	the out of should least part
0:22:01	but this is
0:22:01	another sorry
0:22:02	so
0:22:05	uh_huh
0:22:08	yeah
0:22:11	i don't know i have never tried
0:22:13	okay case
0:22:15	i the speaker okay

STRUCTURED OUTPUT LAYER NEURAL NETWORK LANGUAGE MODEL

Language Modeling

Přednášející: Ilya Oparin, Autoři: Hai Son Le, LIMSI CNRS / Uni. Paris-Sud, France; Ilya Oparin, LIMSI CNRS, France; Alexandre Allauzen, LIMSI CNRS / Uni. Paris-Sud, France; Jean-Luc Gauvain, LIMSI CNRS, France; Francois Yvon, LIMSI CNRS / Uni. Paris-Sud, France