Speech Transcript - LEARNING A BETTER REPRESENTATION OF SPEECH SOUND WAVES USING RESTRICTED BOLTZMANN MACHINES

0:00:13	okay so
0:00:14	um
0:00:15	we next have have a talk
0:00:17	um
0:00:17	but not D J lead who
0:00:19	is from jeff pensions group
0:00:21	um
0:00:22	and both not a model we're very excited to a did not need to be part of the session
0:00:27	um in part because there's
0:00:28	oh but set of methods that have been um
0:00:31	widely used over the last um five to ten years
0:00:35	for doing stuff with the images and and and machine vision um a and a sort of things are for
0:00:39	as deep belief networks and
0:00:41	um
0:00:42	a have a really been used in sound very much and so just group as a very recently over the
0:00:46	past couple of years
0:00:47	and starting to apply these methods
0:00:49	uh to do stuff with with audio and so not it's gonna be talking a little bit about this and
0:00:53	these methods are really extensions of things that work
0:00:55	i initially log back and the eighties
0:00:57	um and have really been revived with a set of new training methods
0:01:01	um in the last ten years
0:01:02	so not deep
0:01:04	um i'd like to thank that josh and malcolm for given see them and presents of this work today
0:01:09	um um so that like does make josh mentioned
0:01:12	there's been a lot of development um recently
0:01:15	in
0:01:16	a a generative models for high-dimensional data
0:01:19	and uh hit some examples
0:01:21	of some of um
0:01:22	the samples generated from such models
0:01:25	i actually
0:01:26	seem to have
0:01:27	a
0:01:27	um here some examples of uh did just generated from the first uh
0:01:32	do you believe network
0:01:33	which was published in two thousand six
0:01:36	um i'd like to point out that these are actually samples from a model
0:01:39	rather than we constructions of particular data cases
0:01:42	so the models quite powerful
0:01:44	and um
0:01:45	just
0:01:46	it has very high peaks at real data point
0:01:50	or some samples from
0:01:52	um i recently published model
0:01:54	which was a gated M R F
0:01:56	and this uh model was trained a natural image patches
0:02:00	and so you can see this model to really good at um modeling what's short in wrong long-range correlations an
0:02:05	images
0:02:06	um
0:02:08	as an example of um
0:02:10	motion sequence
0:02:11	a so models of also been developed for um
0:02:14	um motion sequences and the this case um the training data was joint angles
0:02:20	from motion capture
0:02:24	so okay so it's been seen
0:02:26	that um features from these generative setups
0:02:29	are also very good at discriminative task
0:02:31	and um that makes sense on an intuitive level uh what's good for
0:02:35	um generating spats big types of patterns it's probably good at recognising those patterns
0:02:40	and um
0:02:41	yeah actually start show an example of of features that were
0:02:44	um
0:02:45	yeah a good at generating a
0:02:47	textures are sound textures and those could be used for recognizing these star
0:02:52	um if models to been used widely for vision tasks
0:02:55	but they have been made it uh quite as much into sound yet and so
0:02:59	um for this uh work we wanted to see
0:03:02	if we could um
0:03:03	use these models for raw speech signals
0:03:05	and uh see if the features that the learn
0:03:08	where useful than a generative step
0:03:10	a in a discrimination task
0:03:13	so our goal um specifically is uh given raw speech signals
0:03:18	we want to build a generative model for sub sequences
0:03:22	of uh
0:03:23	these signals of six point two five miliseconds lang
0:03:26	were using timit
0:03:28	which is sampled at sixteen khz
0:03:30	so we had a have
0:03:31	data vectors which were a hundred samples long
0:03:34	and in in the vector we're actually modeling
0:03:37	it's a hundred dimensional vector
0:03:38	uh whose entries are the intensities
0:03:40	of the raw sound uh
0:03:42	sample
0:03:45	okay so here's a quick out of that talk um
0:03:48	oh could we talk about a R B M's
0:03:50	and then now um which the restricted boltzmann machines that's a generate model we use for this paper
0:03:56	i and L shows some results from the generative model
0:03:59	and all talk about to the application of the features to
0:04:03	phone recognition on timit
0:04:07	so so for sex should uh
0:04:08	oh i would like to talk about why we wanted to use raw signals themselves
0:04:13	um
0:04:14	have
0:04:15	the first the reason was that we didn't one make any assumptions about uh
0:04:19	the signal them
0:04:20	such just stationarity of the single within a
0:04:22	a single frame
0:04:24	um
0:04:25	secondly we were motivated by speech synthesis
0:04:28	and uh in that the domain
0:04:29	being able to model draw speech signals what allows eventually
0:04:33	to be able to generate a
0:04:35	realistic signals without having to satisfy phase constraint
0:04:39	um
0:04:40	we also want to be able to discover
0:04:43	a a pattern which
0:04:44	um
0:04:45	and their relative onset times
0:04:47	i with our model and we think that should probably be helpful
0:04:50	uh in uh discriminating between certain point
0:04:55	the last reason
0:04:56	is a because we now can
0:04:58	and that sounds a little facetious but um
0:05:01	it's
0:05:02	probably the most important a motivation for using raw signals
0:05:06	um traditional encoding such as mfccs of been around for quite some time now
0:05:11	a within
0:05:12	the same might of time computational resources have uh
0:05:15	but to that have the time
0:05:17	um um at the same time uh
0:05:20	a um
0:05:21	a lot of data is now available to train really powerful models
0:05:25	and also uh machine learning get made a lot of
0:05:28	progress in being able to pick out features
0:05:30	and building
0:05:31	really good models from data alone
0:05:33	and so that's why um we wanted to try and do this
0:05:36	straight on off
0:05:41	a a week
0:05:42	it's a quick outline of uh
0:05:44	um i'm restricted both machines
0:05:46	so um are stress a restricted boltzmann machine or an B M
0:05:50	is it and directed graphical model
0:05:52	and it has uh two layers
0:05:54	of uh node
0:05:55	the bottom one which is the visible
0:05:58	uh layer
0:05:59	a a points the dimensions of the data that's observe
0:06:02	and the top clear
0:06:03	is that a known it or the hidden variables and these are basically latent variables but ryan explained the data
0:06:10	um um there's a
0:06:11	set of interaction weights connecting these two layers
0:06:15	and um
0:06:16	the architecture such that uh
0:06:18	there's part part i connectivity
0:06:20	which implies that given the visible note all the hidden nodes are independent of each other
0:06:24	and the opposite is true of of with the balls when the hidden variables are known
0:06:30	and since uh it's and nine directed graphical model
0:06:35	ah
0:06:37	well
0:06:37	okay
0:06:38	um
0:06:39	it's a nine directed graphical model and so there is an energy function associated with uh
0:06:44	and given configuration of the visible and hidden states
0:06:48	and um
0:06:49	the energy for a given state governs is probability through the boltzmann distribution
0:06:54	um
0:06:55	what i trying show shown this model
0:06:57	in a the set of iterations here
0:06:59	was um
0:07:01	the uh exactly equation for a a a cost in binary R are M
0:07:05	which is the R be "'em" that use for a
0:07:08	the scenario where we have real valued signals
0:07:11	and binary hidden note
0:07:13	a let me see if i couldn't
0:07:15	get out of the slideshow show
0:07:17	okay
0:07:20	sorry
0:07:22	uh
0:07:26	i never mind
0:07:27	um
0:07:28	the questions not actually that a many ways
0:07:30	so um
0:07:32	um
0:07:32	the important point to note about the question is there's a term and there
0:07:35	which are looks at the interaction between the configuration of the hidden variables
0:07:40	and uh the isn't will over all
0:07:47	something really interesting about uh this model is that the priors are quite complicated
0:07:51	um because they involve a sum of um at an exponential uh
0:07:56	number of configuration
0:07:58	and on the posteriors on the other hand
0:08:00	are are are quite simple
0:08:02	so given visible data
0:08:04	the hidden variables are all independent of each other
0:08:07	and uh
0:08:07	they turn on with a probability which is equal to the sigmoid of the input
0:08:12	in that hidden node
0:08:13	and the input is essentially the dot product of uh
0:08:16	the visible data and the uh set a weight connecting a hidden node to
0:08:21	the data
0:08:23	and so
0:08:23	in that since this is a very powerful model and it's different from
0:08:27	um
0:08:28	other generative models where the prior to are independent but posteriors are very hard to calm
0:08:32	to compute
0:08:33	so a part of this model is
0:08:35	that it had very uh
0:08:37	a rich priors but very easy posters
0:08:42	"'kay" so the maximum likelihood uh
0:08:44	estimation
0:08:45	of of uh a models it's is this is really complicated because uh
0:08:49	the gradient of the log probably is really hard to compute exactly
0:08:52	um fortunately uh jeff
0:08:55	in discovered about that that ago
0:08:57	that an algorithm called contrastive divergence
0:09:00	would be used to train these models
0:09:02	um
0:09:03	uh
0:09:04	a pretty well
0:09:05	and that's the model where a that's the algorithm or using the learned the parameters
0:09:11	one last
0:09:11	a point about uh
0:09:13	the model where using
0:09:14	um binary hidden units
0:09:16	uh
0:09:17	are not very optimal for raw speech signals
0:09:20	and the reason that is is
0:09:22	that um speech patterns can present
0:09:25	speech signals in have the same pattern over many
0:09:28	different um orders of magnitude
0:09:30	but by new units can only turn on one out but intensity level
0:09:35	so for this
0:09:36	paper we used an alternate of a type of a unit
0:09:39	all the steps sigmoidal model unit
0:09:41	and uh but in and have the power but property that
0:09:44	it can create
0:09:45	um i but intensity at almost any all
0:09:48	and now i want
0:09:49	talk talk much about those units but there's more information but that yeah in this paper that's referent
0:09:55	here
0:09:58	okay
0:10:01	peers the experimental setup
0:10:02	um like is that we were looking at um six point two five miliseconds of speech
0:10:07	and that course points two hundred samples
0:10:10	so for each sample
0:10:11	we have a variable in the visible data
0:10:14	so are are B M has hundred visible note at the bottom
0:10:17	um and we couple that
0:10:19	with hundred twenty of this step they model units
0:10:22	um
0:10:23	in the R M
0:10:24	uh D signal itself what selected randomly from the timit database
0:10:29	and um what's presented to the model
0:10:32	a tool on average the model that scene and use sub segment for about thirty time
0:10:40	um here some of the features that were learned by the model
0:10:44	a a on the left side
0:10:45	uh we see the a actual features
0:10:48	and
0:10:49	i just a reminder these use just the weights a connecting the
0:10:52	visible data to the hidden units
0:10:54	so for each hidden unit we have a pattern
0:10:57	and that hidden unit turns on optimal when the data presents the this particular pattern associated but
0:11:03	so uh i you can see there's a lot of different types of patterns that are learned
0:11:07	a let me a go through a few of them very quickly
0:11:10	here is uh
0:11:11	a pattern that's uh
0:11:13	but to pick out really low frequencies
0:11:15	um
0:11:16	maybe like
0:11:17	but F zero for or something
0:11:19	or pitch
0:11:21	oh here's some patterns
0:11:22	uh that uh
0:11:23	pick up
0:11:25	for some features that pickup patterns which are
0:11:28	um
0:11:28	slightly higher frequency
0:11:31	and here's others that are
0:11:33	intermediate level frequency
0:11:38	and then some
0:11:39	that are really high frequencies
0:11:42	there's some other really interesting ones
0:11:44	which are these patterns that seem to have composite um frequency characteristics there is a low frequency component and high
0:11:52	frequency component
0:11:53	and we think that the might be
0:11:55	picking up but uh
0:11:56	a fricatives
0:12:02	okay
0:12:03	so not we're blind the model
0:12:04	we and
0:12:05	uh reconstruct
0:12:07	signals from the a posterior activities of the hidden units that themselves
0:12:11	so if take um ten frames of signal
0:12:14	and we project that signal into the hidden unit
0:12:17	i'm showing be activities of the hidden units here in
0:12:20	log scale
0:12:21	now only shown twenty of the uh i hundred do any units that we actually trained
0:12:27	um and you can then take these posterior activities of the hidden units and project them back
0:12:31	two visible space to reconstruct a raw signal
0:12:38	this is um
0:12:39	yeah similar in flavour
0:12:40	to the previous talk except were
0:12:42	using a parametric model to do this
0:12:47	okay uh if you look at the reconstruction and a much larger scale
0:12:51	this is
0:12:51	six six twenty five miliseconds
0:12:53	of raw signal
0:12:55	and uh
0:12:56	you can see in the
0:12:58	a heat map
0:12:59	the patterns present a high dimensional pattern and
0:13:02	uh the heat map them still
0:13:10	here some samples from the model itself
0:13:12	um
0:13:13	there sixteen samples
0:13:15	in these samples um
0:13:17	five of them are quite similar to the other
0:13:20	um but
0:13:21	they're different from the other eleven
0:13:26	shoes
0:13:27	um
0:13:28	so i think we're
0:13:29	have a pretty good
0:13:30	um
0:13:31	model for at least
0:13:32	small scale signal
0:13:34	so many switch now to the application
0:13:38	of these features to phone recognition
0:13:40	and so the set
0:13:41	that we have
0:13:42	is uh we have uh
0:13:45	one hundred twenty five
0:13:46	millisecond to pry speech
0:13:48	and we want to be able to uh
0:13:50	uh use the features that we learned to predict
0:13:52	the you phoneme labels
0:13:54	that we got from a a find model
0:13:56	um so we are use the features that we learn to encode this signal you know talk about how we
0:14:01	did that in the next slide
0:14:02	and then uh we to be encoded features and put "'em" in the neural network
0:14:07	and used back propagation two
0:14:09	and the mapping to the phoneme label
0:14:15	or the set of of uh
0:14:17	how we did the encoding
0:14:18	so we uh
0:14:19	a use the convolutional set here in the way works is
0:14:23	we first that the first frame
0:14:26	at the first sample of an utterance
0:14:28	and we compute the posterior means
0:14:30	we then to move it by one sample and we do this computation again
0:14:35	we then do this for the entire um
0:14:37	utterance
0:14:38	and so
0:14:40	raymond for high dimensional data
0:14:42	but this is a little too high dimensional
0:14:44	because with a surely by the signal by hundred twenty tie
0:14:48	um
0:14:49	so um
0:14:51	what what we now do would be sub sample these hidden units
0:14:54	um so that
0:14:55	we sub sample each feature for twenty five miliseconds
0:14:58	of signal
0:15:00	and the subsampling helps
0:15:01	in smoothing out the signal as well
0:15:04	yeah i have to point out that convolutional set of sub be quite useful vision task
0:15:08	and
0:15:09	think our results
0:15:10	just that
0:15:11	the same
0:15:12	um
0:15:12	for the set setup
0:15:14	okay so we have a a
0:15:16	subsampled a frame for twenty five miliseconds
0:15:19	with then advance it
0:15:21	by ten miliseconds
0:15:23	and we do this for the entire utterance of well
0:15:26	so um for any given um
0:15:29	speech
0:15:30	of twenty of uh one twenty five miliseconds we take the eleven frames that
0:15:35	um
0:15:36	man that signal
0:15:37	and we can cut it all of that into one vector
0:15:40	and that's the encoding coding that's put into the neural net
0:15:46	or
0:15:47	have some shoes
0:15:48	um
0:15:49	the features were first uh log transformed
0:15:52	after we are created the entire set of features
0:15:55	and we also added
0:15:56	delta and acceleration of the fact vectors to the coding
0:16:02	here's a little bit about the baseline M
0:16:05	um yeah if fine model was just an hmm
0:16:08	i train on mfccs
0:16:10	uh there were sixty one phoneme classes with three states
0:16:13	for each class
0:16:14	and we used the bigram language model
0:16:16	a forced alignment
0:16:17	to the test data
0:16:18	uh and the training data was used to generate the label
0:16:24	so we use this
0:16:25	this stand it a standard method for D putting the posterior probabilities
0:16:28	um
0:16:29	just similar to what done in tandem like approach is and this
0:16:33	a convert posterior probability predictions two
0:16:35	generative probability probabilities which and then be decoded viterbi code
0:16:41	a it's a summary of the results for different configurations
0:16:45	um
0:16:45	of our setup
0:16:47	uh we used the for
0:16:49	these set of this set of experiments to hidden
0:16:51	layer neural network
0:16:53	and um
0:16:55	what we found was
0:16:56	uh if we use more hidden that network hidden units in the neural network that we got better result
0:17:02	um
0:17:02	we found that
0:17:03	uh
0:17:04	if we use shorter sampling windows
0:17:06	and acceleration uh and um
0:17:09	uh
0:17:09	shifting windows and we got better results as well
0:17:12	also adding the a delta and acceleration parameters help
0:17:16	as well
0:17:17	we find that uh with one twenty hidden units we got the best of
0:17:22	we combine all these four lessons and train one neural network with
0:17:26	two layers
0:17:27	with four thousand units in each layer
0:17:29	and uh
0:17:30	we use delta and acceleration parameters
0:17:33	and we used the subsampling sampling window of ten miliseconds with an net five milliseconds
0:17:37	and uh hundred twenty hidden units
0:17:39	you R are M
0:17:40	with that we got twenty two point eight percent
0:17:43	rows a on the test data
0:17:45	uh
0:17:46	for the phoneme error rate
0:17:47	on timit
0:17:49	um and then we uh to get to a uh
0:17:52	uh
0:17:52	for their uh a neural network and with a dbn pre-training we were able to further reduce it down to
0:17:57	twenty one point eight
0:18:00	a so uh here's the conclusions
0:18:02	um
0:18:03	the speech signal talks
0:18:05	um
0:18:06	which you learning can
0:18:07	uh discover meaningful features from data alone
0:18:11	and uh i think for their uh work in looking for high dimensional encodings is justified
0:18:17	and uh for future work
0:18:18	we aim to build better generative models
0:18:20	oh
0:18:21	and with that
0:18:22	acknowledgements
0:18:23	and uh
0:18:24	um
0:18:25	oh of the four K
0:18:35	yeah question
0:18:39	hi a angel your talk L
0:18:40	and a question uh i one to the reasons why people shy in speech from time to make features as
0:18:45	the high sensitivity to noise
0:18:47	as opposed to just the raw being raw or not
0:18:50	so oh how are you going to address that
0:18:53	um
0:18:54	i think my answer to that is
0:18:56	we just need enough data in eventually will be able to figure that out
0:19:00	i but that's the the bad that's key because that it noise comes all the some forms you have pink
0:19:05	noise of white you have kind i Z
0:19:07	i mean it's it's just a have data
0:19:09	you know we could have some the recognition problem
0:19:12	if it's just that but it
0:19:14	one a
0:19:15	so i i think it's not just the data it's
0:19:17	it's models that go with the data and so if you
0:19:21	use
0:19:21	um
0:19:22	some of these powerful generative models and build trying bill didn't for their assumptions about the characteristics of noise
0:19:28	then
0:19:29	hopefully you will learn to um pick out noise
0:19:31	and separate that from
0:19:33	um
0:19:34	real signal
0:19:35	so in the case of the features we learnt
0:19:37	um
0:19:38	if you actually look at um the types of features we learned we learned to ignore
0:19:43	sort of high frequency components
0:19:45	it's so if you look at the reconstruction
0:19:47	a signal
0:19:49	here
0:19:49	you'll find that some of the aspects of the
0:19:52	fricative
0:19:53	or sub rest in the reconstruction
0:19:55	so it's burning to pick out um
0:19:57	more of the
0:19:58	a vocal tract information then it's
0:20:01	a trying to get a noise
0:20:03	and
0:20:03	um in a sense that's also speak yes but
0:20:06	uh the point is that it able to try and separate out what's noise
0:20:10	uh
0:20:10	from what
0:20:11	what's not
0:20:13	at the shoe obvious but you question
0:20:15	well only because also wait some feature
0:20:17	is
0:20:18	oh you so as to not also more useless
0:20:21	system to the
0:20:22	make use of partition
0:20:24	so you really be lucky
0:20:25	because to oh have a problem if it's a which are another but not ask that i have
0:20:30	because i like fifteen years the got to have some you know paper
0:20:33	oh
0:20:34	oh using weight full
0:20:35	you all little course modeling to mock up to feel the model
0:20:38	do but mission
0:20:40	which actually
0:20:41	to to have a lot of
0:20:42	but but a thing yeah
0:20:45	and because i am so big things what if
0:20:49	uh one that what you i should be a slice some other
0:20:53	right
0:20:54	is
0:20:55	where
0:20:56	our
0:20:56	you know from different files a file to see
0:21:01	right so uh we actually didn't
0:21:03	try any noisy data for this setup
0:21:05	um
0:21:06	but i'm an advocate of multiple layers
0:21:09	of uh
0:21:09	representations
0:21:11	and hope is that um when you build
0:21:13	um
0:21:14	uh
0:21:15	sort of deeper models where
0:21:17	lower models
0:21:18	try to pick up signals and hard models try to look a look for more abstract patterns
0:21:23	when you do that
0:21:24	um high level uh features will try and suppress the noise and separate the signals
0:21:28	but uh
0:21:29	for now we don't really have any
0:21:30	um
0:21:32	experiments to back a clean
0:21:33	a one more real close
0:21:35	oh yeah well like this one is not tradition
0:21:38	i i feel like twenty years ago i saw stuff on uh using neural that's to recognise phonemes
0:21:43	and so
0:21:45	i
0:21:45	i am curious
0:21:46	uh_huh
0:21:47	what really the change just because the other thing that
0:21:50	i i think about with that is
0:21:52	scaling issues one i think about did you recognition or anything recognition
0:21:56	if a
0:21:58	all i have to do is make them something sufficiently slow work or lower or higher and i can usually
0:22:02	destroy a under neural that based on you know what is the train performance one wondering
0:22:06	there's this some sort of advancement have you gotten around issues with scaling and
0:22:10	and uh uh transformations and space
0:22:13	right um
0:22:14	so i think what's different from twenty years ago is that um
0:22:18	these sort of generative models to made a lot of progress in it's been seen that
0:22:21	you can use them to see neural networks and get much better results than you good from new map before
0:22:27	um
0:22:28	i and
0:22:29	the amount of data that's now available is much larger than about a years ago
0:22:33	so that's have sort of to to the question of
0:22:36	what's different from the last twenty years
0:22:38	um
0:22:39	in terms of uh scale of the data
0:22:42	uh
0:22:43	i think the
0:22:44	kind of
0:22:44	um units were using or sort of scale invariant at least in terms of intensity and have the motivation for
0:22:50	using them
0:22:50	but they're not
0:22:52	the uh the time aspect it's not all covered and were actually looking at
0:22:55	models to try and uh
0:22:57	um
0:22:58	sort of
0:22:58	the invariant to that aspect of it
0:23:00	a a like to mention that convolutional a networks of been
0:23:04	useful in vision related task
0:23:06	and i think
0:23:07	they have the potential for
0:23:09	adjusting for scales
0:23:10	and half
0:23:11	we gonna try and attractive
0:23:12	those work
0:23:13	but the one last comment
0:23:15	what's your definition of row that the with this approach work for a to the that what that's it
0:23:21	so is what was used in for a it so
0:23:23	as you own definition
0:23:25	um that would just for to that
0:23:28	a of what this to data of sorry a i've i i i
0:23:32	oh
0:23:34	ah okay
0:23:36	yeah um
0:23:38	i think for
0:23:39	for this
0:23:39	paper the definition was
0:23:41	the raw form that you could capture from the entrance
0:23:44	um so we didn't wanna to make any assumptions that if you take spectral information
0:23:49	that there were
0:23:50	assumption that
0:23:51	um
0:23:52	the uh signal was stationary between a
0:23:55	within a single frame which i think is the
0:23:57	of the not a very correct information
0:23:59	um and probably harms
0:24:01	uh
0:24:02	detection of certain types of phonemes
0:24:04	i'm so um the second answer to that question it is uh
0:24:07	um
0:24:09	uh it's just
0:24:10	a matter of uh convenience
0:24:11	it depends on whatever the input was to our system
0:24:14	that's already data
0:24:16	but the
0:24:18	yeah so
0:24:19	so that that's that's a first definition which was as close
0:24:23	as you can get
0:24:24	to the capture device
0:24:26	okay we need one thing

LEARNING A BETTER REPRESENTATION OF SPEECH SOUND WAVES USING RESTRICTED BOLTZMANN MACHINES

Innovative Representations of Audio

Presented by: Navdeep Jaitly, Author(s): Navdeep Jaitly, Geoffrey Hinton, University of Toronto, Canada