0:00:13okay so
0:00:15we next have have a talk
0:00:17but not D J lead who
0:00:19is from jeff pensions group
0:00:22and both not a model we're very excited to a did not need to be part of the session
0:00:27um in part because there's
0:00:28oh but set of methods that have been um
0:00:31widely used over the last um five to ten years
0:00:35for doing stuff with the images and and and machine vision um a and a sort of things are for
0:00:39as deep belief networks and
0:00:42a have a really been used in sound very much and so just group as a very recently over the
0:00:46past couple of years
0:00:47and starting to apply these methods
0:00:49uh to do stuff with with audio and so not it's gonna be talking a little bit about this and
0:00:53these methods are really extensions of things that work
0:00:55i initially log back and the eighties
0:00:57um and have really been revived with a set of new training methods
0:01:01um in the last ten years
0:01:02so not deep
0:01:04um i'd like to thank that josh and malcolm for given see them and presents of this work today
0:01:09um um so that like does make josh mentioned
0:01:12there's been a lot of development um recently
0:01:16a a generative models for high-dimensional data
0:01:19and uh hit some examples
0:01:21of some of um
0:01:22the samples generated from such models
0:01:25i actually
0:01:26seem to have
0:01:27um here some examples of uh did just generated from the first uh
0:01:32do you believe network
0:01:33which was published in two thousand six
0:01:36um i'd like to point out that these are actually samples from a model
0:01:39rather than we constructions of particular data cases
0:01:42so the models quite powerful
0:01:44and um
0:01:46it has very high peaks at real data point
0:01:50or some samples from
0:01:52um i recently published model
0:01:54which was a gated M R F
0:01:56and this uh model was trained a natural image patches
0:02:00and so you can see this model to really good at um modeling what's short in wrong long-range correlations an
0:02:08as an example of um
0:02:10motion sequence
0:02:11a so models of also been developed for um
0:02:14um motion sequences and the this case um the training data was joint angles
0:02:20from motion capture
0:02:24so okay so it's been seen
0:02:26that um features from these generative setups
0:02:29are also very good at discriminative task
0:02:31and um that makes sense on an intuitive level uh what's good for
0:02:35um generating spats big types of patterns it's probably good at recognising those patterns
0:02:40and um
0:02:41yeah actually start show an example of of features that were
0:02:45yeah a good at generating a
0:02:47textures are sound textures and those could be used for recognizing these star
0:02:52um if models to been used widely for vision tasks
0:02:55but they have been made it uh quite as much into sound yet and so
0:02:59um for this uh work we wanted to see
0:03:02if we could um
0:03:03use these models for raw speech signals
0:03:05and uh see if the features that the learn
0:03:08where useful than a generative step
0:03:10a in a discrimination task
0:03:13so our goal um specifically is uh given raw speech signals
0:03:18we want to build a generative model for sub sequences
0:03:22of uh
0:03:23these signals of six point two five miliseconds lang
0:03:26were using timit
0:03:28which is sampled at sixteen khz
0:03:30so we had a have
0:03:31data vectors which were a hundred samples long
0:03:34and in in the vector we're actually modeling
0:03:37it's a hundred dimensional vector
0:03:38uh whose entries are the intensities
0:03:40of the raw sound uh
0:03:45okay so here's a quick out of that talk um
0:03:48oh could we talk about a R B M's
0:03:50and then now um which the restricted boltzmann machines that's a generate model we use for this paper
0:03:56i and L shows some results from the generative model
0:03:59and all talk about to the application of the features to
0:04:03phone recognition on timit
0:04:07so so for sex should uh
0:04:08oh i would like to talk about why we wanted to use raw signals themselves
0:04:15the first the reason was that we didn't one make any assumptions about uh
0:04:19the signal them
0:04:20such just stationarity of the single within a
0:04:22a single frame
0:04:25secondly we were motivated by speech synthesis
0:04:28and uh in that the domain
0:04:29being able to model draw speech signals what allows eventually
0:04:33to be able to generate a
0:04:35realistic signals without having to satisfy phase constraint
0:04:40we also want to be able to discover
0:04:43a a pattern which
0:04:45and their relative onset times
0:04:47i with our model and we think that should probably be helpful
0:04:50uh in uh discriminating between certain point
0:04:55the last reason
0:04:56is a because we now can
0:04:58and that sounds a little facetious but um
0:05:02probably the most important a motivation for using raw signals
0:05:06um traditional encoding such as mfccs of been around for quite some time now
0:05:11a within
0:05:12the same might of time computational resources have uh
0:05:15but to that have the time
0:05:17um um at the same time uh
0:05:20a um
0:05:21a lot of data is now available to train really powerful models
0:05:25and also uh machine learning get made a lot of
0:05:28progress in being able to pick out features
0:05:30and building
0:05:31really good models from data alone
0:05:33and so that's why um we wanted to try and do this
0:05:36straight on off
0:05:41a a week
0:05:42it's a quick outline of uh
0:05:44um i'm restricted both machines
0:05:46so um are stress a restricted boltzmann machine or an B M
0:05:50is it and directed graphical model
0:05:52and it has uh two layers
0:05:54of uh node
0:05:55the bottom one which is the visible
0:05:58uh layer
0:05:59a a points the dimensions of the data that's observe
0:06:02and the top clear
0:06:03is that a known it or the hidden variables and these are basically latent variables but ryan explained the data
0:06:10um um there's a
0:06:11set of interaction weights connecting these two layers
0:06:15and um
0:06:16the architecture such that uh
0:06:18there's part part i connectivity
0:06:20which implies that given the visible note all the hidden nodes are independent of each other
0:06:24and the opposite is true of of with the balls when the hidden variables are known
0:06:30and since uh it's and nine directed graphical model
0:06:39it's a nine directed graphical model and so there is an energy function associated with uh
0:06:44and given configuration of the visible and hidden states
0:06:48and um
0:06:49the energy for a given state governs is probability through the boltzmann distribution
0:06:55what i trying show shown this model
0:06:57in a the set of iterations here
0:06:59was um
0:07:01the uh exactly equation for a a a cost in binary R are M
0:07:05which is the R be "'em" that use for a
0:07:08the scenario where we have real valued signals
0:07:11and binary hidden note
0:07:13a let me see if i couldn't
0:07:15get out of the slideshow show
0:07:26i never mind
0:07:28the questions not actually that a many ways
0:07:30so um
0:07:32the important point to note about the question is there's a term and there
0:07:35which are looks at the interaction between the configuration of the hidden variables
0:07:40and uh the isn't will over all
0:07:47something really interesting about uh this model is that the priors are quite complicated
0:07:51um because they involve a sum of um at an exponential uh
0:07:56number of configuration
0:07:58and on the posteriors on the other hand
0:08:00are are are quite simple
0:08:02so given visible data
0:08:04the hidden variables are all independent of each other
0:08:07and uh
0:08:07they turn on with a probability which is equal to the sigmoid of the input
0:08:12in that hidden node
0:08:13and the input is essentially the dot product of uh
0:08:16the visible data and the uh set a weight connecting a hidden node to
0:08:21the data
0:08:23and so
0:08:23in that since this is a very powerful model and it's different from
0:08:28other generative models where the prior to are independent but posteriors are very hard to calm
0:08:32to compute
0:08:33so a part of this model is
0:08:35that it had very uh
0:08:37a rich priors but very easy posters
0:08:42"'kay" so the maximum likelihood uh
0:08:45of of uh a models it's is this is really complicated because uh
0:08:49the gradient of the log probably is really hard to compute exactly
0:08:52um fortunately uh jeff
0:08:55in discovered about that that ago
0:08:57that an algorithm called contrastive divergence
0:09:00would be used to train these models
0:09:04a pretty well
0:09:05and that's the model where a that's the algorithm or using the learned the parameters
0:09:11one last
0:09:11a point about uh
0:09:13the model where using
0:09:14um binary hidden units
0:09:17are not very optimal for raw speech signals
0:09:20and the reason that is is
0:09:22that um speech patterns can present
0:09:25speech signals in have the same pattern over many
0:09:28different um orders of magnitude
0:09:30but by new units can only turn on one out but intensity level
0:09:35so for this
0:09:36paper we used an alternate of a type of a unit
0:09:39all the steps sigmoidal model unit
0:09:41and uh but in and have the power but property that
0:09:44it can create
0:09:45um i but intensity at almost any all
0:09:48and now i want
0:09:49talk talk much about those units but there's more information but that yeah in this paper that's referent
0:10:01peers the experimental setup
0:10:02um like is that we were looking at um six point two five miliseconds of speech
0:10:07and that course points two hundred samples
0:10:10so for each sample
0:10:11we have a variable in the visible data
0:10:14so are are B M has hundred visible note at the bottom
0:10:17um and we couple that
0:10:19with hundred twenty of this step they model units
0:10:23in the R M
0:10:24uh D signal itself what selected randomly from the timit database
0:10:29and um what's presented to the model
0:10:32a tool on average the model that scene and use sub segment for about thirty time
0:10:40um here some of the features that were learned by the model
0:10:44a a on the left side
0:10:45uh we see the a actual features
0:10:49i just a reminder these use just the weights a connecting the
0:10:52visible data to the hidden units
0:10:54so for each hidden unit we have a pattern
0:10:57and that hidden unit turns on optimal when the data presents the this particular pattern associated but
0:11:03so uh i you can see there's a lot of different types of patterns that are learned
0:11:07a let me a go through a few of them very quickly
0:11:10here is uh
0:11:11a pattern that's uh
0:11:13but to pick out really low frequencies
0:11:16maybe like
0:11:17but F zero for or something
0:11:19or pitch
0:11:21oh here's some patterns
0:11:22uh that uh
0:11:23pick up
0:11:25for some features that pickup patterns which are
0:11:28slightly higher frequency
0:11:31and here's others that are
0:11:33intermediate level frequency
0:11:38and then some
0:11:39that are really high frequencies
0:11:42there's some other really interesting ones
0:11:44which are these patterns that seem to have composite um frequency characteristics there is a low frequency component and high
0:11:52frequency component
0:11:53and we think that the might be
0:11:55picking up but uh
0:11:56a fricatives
0:12:03so not we're blind the model
0:12:04we and
0:12:05uh reconstruct
0:12:07signals from the a posterior activities of the hidden units that themselves
0:12:11so if take um ten frames of signal
0:12:14and we project that signal into the hidden unit
0:12:17i'm showing be activities of the hidden units here in
0:12:20log scale
0:12:21now only shown twenty of the uh i hundred do any units that we actually trained
0:12:27um and you can then take these posterior activities of the hidden units and project them back
0:12:31two visible space to reconstruct a raw signal
0:12:38this is um
0:12:39yeah similar in flavour
0:12:40to the previous talk except were
0:12:42using a parametric model to do this
0:12:47okay uh if you look at the reconstruction and a much larger scale
0:12:51this is
0:12:51six six twenty five miliseconds
0:12:53of raw signal
0:12:55and uh
0:12:56you can see in the
0:12:58a heat map
0:12:59the patterns present a high dimensional pattern and
0:13:02uh the heat map them still
0:13:10here some samples from the model itself
0:13:13there sixteen samples
0:13:15in these samples um
0:13:17five of them are quite similar to the other
0:13:20um but
0:13:21they're different from the other eleven
0:13:28so i think we're
0:13:29have a pretty good
0:13:31model for at least
0:13:32small scale signal
0:13:34so many switch now to the application
0:13:38of these features to phone recognition
0:13:40and so the set
0:13:41that we have
0:13:42is uh we have uh
0:13:45one hundred twenty five
0:13:46millisecond to pry speech
0:13:48and we want to be able to uh
0:13:50uh use the features that we learned to predict
0:13:52the you phoneme labels
0:13:54that we got from a a find model
0:13:56um so we are use the features that we learn to encode this signal you know talk about how we
0:14:01did that in the next slide
0:14:02and then uh we to be encoded features and put "'em" in the neural network
0:14:07and used back propagation two
0:14:09and the mapping to the phoneme label
0:14:15or the set of of uh
0:14:17how we did the encoding
0:14:18so we uh
0:14:19a use the convolutional set here in the way works is
0:14:23we first that the first frame
0:14:26at the first sample of an utterance
0:14:28and we compute the posterior means
0:14:30we then to move it by one sample and we do this computation again
0:14:35we then do this for the entire um
0:14:38and so
0:14:40raymond for high dimensional data
0:14:42but this is a little too high dimensional
0:14:44because with a surely by the signal by hundred twenty tie
0:14:49so um
0:14:51what what we now do would be sub sample these hidden units
0:14:54um so that
0:14:55we sub sample each feature for twenty five miliseconds
0:14:58of signal
0:15:00and the subsampling helps
0:15:01in smoothing out the signal as well
0:15:04yeah i have to point out that convolutional set of sub be quite useful vision task
0:15:09think our results
0:15:10just that
0:15:11the same
0:15:12for the set setup
0:15:14okay so we have a a
0:15:16subsampled a frame for twenty five miliseconds
0:15:19with then advance it
0:15:21by ten miliseconds
0:15:23and we do this for the entire utterance of well
0:15:26so um for any given um
0:15:30of twenty of uh one twenty five miliseconds we take the eleven frames that
0:15:36man that signal
0:15:37and we can cut it all of that into one vector
0:15:40and that's the encoding coding that's put into the neural net
0:15:47have some shoes
0:15:49the features were first uh log transformed
0:15:52after we are created the entire set of features
0:15:55and we also added
0:15:56delta and acceleration of the fact vectors to the coding
0:16:02here's a little bit about the baseline M
0:16:05um yeah if fine model was just an hmm
0:16:08i train on mfccs
0:16:10uh there were sixty one phoneme classes with three states
0:16:13for each class
0:16:14and we used the bigram language model
0:16:16a forced alignment
0:16:17to the test data
0:16:18uh and the training data was used to generate the label
0:16:24so we use this
0:16:25this stand it a standard method for D putting the posterior probabilities
0:16:29just similar to what done in tandem like approach is and this
0:16:33a convert posterior probability predictions two
0:16:35generative probability probabilities which and then be decoded viterbi code
0:16:41a it's a summary of the results for different configurations
0:16:45of our setup
0:16:47uh we used the for
0:16:49these set of this set of experiments to hidden
0:16:51layer neural network
0:16:53and um
0:16:55what we found was
0:16:56uh if we use more hidden that network hidden units in the neural network that we got better result
0:17:02we found that
0:17:04if we use shorter sampling windows
0:17:06and acceleration uh and um
0:17:09shifting windows and we got better results as well
0:17:12also adding the a delta and acceleration parameters help
0:17:16as well
0:17:17we find that uh with one twenty hidden units we got the best of
0:17:22we combine all these four lessons and train one neural network with
0:17:26two layers
0:17:27with four thousand units in each layer
0:17:29and uh
0:17:30we use delta and acceleration parameters
0:17:33and we used the subsampling sampling window of ten miliseconds with an net five milliseconds
0:17:37and uh hundred twenty hidden units
0:17:39you R are M
0:17:40with that we got twenty two point eight percent
0:17:43rows a on the test data
0:17:46for the phoneme error rate
0:17:47on timit
0:17:49um and then we uh to get to a uh
0:17:52for their uh a neural network and with a dbn pre-training we were able to further reduce it down to
0:17:57twenty one point eight
0:18:00a so uh here's the conclusions
0:18:03the speech signal talks
0:18:06which you learning can
0:18:07uh discover meaningful features from data alone
0:18:11and uh i think for their uh work in looking for high dimensional encodings is justified
0:18:17and uh for future work
0:18:18we aim to build better generative models
0:18:21and with that
0:18:23and uh
0:18:25oh of the four K
0:18:35yeah question
0:18:39hi a angel your talk L
0:18:40and a question uh i one to the reasons why people shy in speech from time to make features as
0:18:45the high sensitivity to noise
0:18:47as opposed to just the raw being raw or not
0:18:50so oh how are you going to address that
0:18:54i think my answer to that is
0:18:56we just need enough data in eventually will be able to figure that out
0:19:00i but that's the the bad that's key because that it noise comes all the some forms you have pink
0:19:05noise of white you have kind i Z
0:19:07i mean it's it's just a have data
0:19:09you know we could have some the recognition problem
0:19:12if it's just that but it
0:19:14one a
0:19:15so i i think it's not just the data it's
0:19:17it's models that go with the data and so if you
0:19:22some of these powerful generative models and build trying bill didn't for their assumptions about the characteristics of noise
0:19:29hopefully you will learn to um pick out noise
0:19:31and separate that from
0:19:34real signal
0:19:35so in the case of the features we learnt
0:19:38if you actually look at um the types of features we learned we learned to ignore
0:19:43sort of high frequency components
0:19:45it's so if you look at the reconstruction
0:19:47a signal
0:19:49you'll find that some of the aspects of the
0:19:53or sub rest in the reconstruction
0:19:55so it's burning to pick out um
0:19:57more of the
0:19:58a vocal tract information then it's
0:20:01a trying to get a noise
0:20:03um in a sense that's also speak yes but
0:20:06uh the point is that it able to try and separate out what's noise
0:20:10from what
0:20:11what's not
0:20:13at the shoe obvious but you question
0:20:15well only because also wait some feature
0:20:18oh you so as to not also more useless
0:20:21system to the
0:20:22make use of partition
0:20:24so you really be lucky
0:20:25because to oh have a problem if it's a which are another but not ask that i have
0:20:30because i like fifteen years the got to have some you know paper
0:20:34oh using weight full
0:20:35you all little course modeling to mock up to feel the model
0:20:38do but mission
0:20:40which actually
0:20:41to to have a lot of
0:20:42but but a thing yeah
0:20:45and because i am so big things what if
0:20:49uh one that what you i should be a slice some other
0:20:56you know from different files a file to see
0:21:01right so uh we actually didn't
0:21:03try any noisy data for this setup
0:21:06but i'm an advocate of multiple layers
0:21:09of uh
0:21:11and hope is that um when you build
0:21:15sort of deeper models where
0:21:17lower models
0:21:18try to pick up signals and hard models try to look a look for more abstract patterns
0:21:23when you do that
0:21:24um high level uh features will try and suppress the noise and separate the signals
0:21:28but uh
0:21:29for now we don't really have any
0:21:32experiments to back a clean
0:21:33a one more real close
0:21:35oh yeah well like this one is not tradition
0:21:38i i feel like twenty years ago i saw stuff on uh using neural that's to recognise phonemes
0:21:43and so
0:21:45i am curious
0:21:47what really the change just because the other thing that
0:21:50i i think about with that is
0:21:52scaling issues one i think about did you recognition or anything recognition
0:21:56if a
0:21:58all i have to do is make them something sufficiently slow work or lower or higher and i can usually
0:22:02destroy a under neural that based on you know what is the train performance one wondering
0:22:06there's this some sort of advancement have you gotten around issues with scaling and
0:22:10and uh uh transformations and space
0:22:13right um
0:22:14so i think what's different from twenty years ago is that um
0:22:18these sort of generative models to made a lot of progress in it's been seen that
0:22:21you can use them to see neural networks and get much better results than you good from new map before
0:22:28i and
0:22:29the amount of data that's now available is much larger than about a years ago
0:22:33so that's have sort of to to the question of
0:22:36what's different from the last twenty years
0:22:39in terms of uh scale of the data
0:22:43i think the
0:22:44kind of
0:22:44um units were using or sort of scale invariant at least in terms of intensity and have the motivation for
0:22:50using them
0:22:50but they're not
0:22:52the uh the time aspect it's not all covered and were actually looking at
0:22:55models to try and uh
0:22:58sort of
0:22:58the invariant to that aspect of it
0:23:00a a like to mention that convolutional a networks of been
0:23:04useful in vision related task
0:23:06and i think
0:23:07they have the potential for
0:23:09adjusting for scales
0:23:10and half
0:23:11we gonna try and attractive
0:23:12those work
0:23:13but the one last comment
0:23:15what's your definition of row that the with this approach work for a to the that what that's it
0:23:21so is what was used in for a it so
0:23:23as you own definition
0:23:25um that would just for to that
0:23:28a of what this to data of sorry a i've i i i
0:23:34ah okay
0:23:36yeah um
0:23:38i think for
0:23:39for this
0:23:39paper the definition was
0:23:41the raw form that you could capture from the entrance
0:23:44um so we didn't wanna to make any assumptions that if you take spectral information
0:23:49that there were
0:23:50assumption that
0:23:52the uh signal was stationary between a
0:23:55within a single frame which i think is the
0:23:57of the not a very correct information
0:23:59um and probably harms
0:24:02detection of certain types of phonemes
0:24:04i'm so um the second answer to that question it is uh
0:24:09uh it's just
0:24:10a matter of uh convenience
0:24:11it depends on whatever the input was to our system
0:24:14that's already data
0:24:16but the
0:24:18yeah so
0:24:19so that that's that's a first definition which was as close
0:24:23as you can get
0:24:24to the capture device
0:24:26okay we need one thing