Speech Transcript - LEARNING A BETTER REPRESENTATION OF SPEECH SOUND WAVES USING RESTRICTED BOLTZMANN MACHINES

okay so um we next have have a talk um but not D J lead who is from jeff pensions group um and both not a model we're very excited to a did not need to be part of the session um in part because there's oh but set of methods that have been um widely used over the last um five to ten years for doing stuff with the images and and and machine vision um a and a sort of things are for as deep belief networks and um a have a really been used in sound very much and so just group as a very recently over the past couple of years and starting to apply these methods uh to do stuff with with audio and so not it's gonna be talking a little bit about this and these methods are really extensions of things that work i initially log back and the eighties um and have really been revived with a set of new training methods um in the last ten years so not deep um i'd like to thank that josh and malcolm for given see them and presents of this work today um um so that like does make josh mentioned there's been a lot of development um recently in a a generative models for high-dimensional data and uh hit some examples of some of um the samples generated from such models i actually seem to have a um here some examples of uh did just generated from the first uh do you believe network which was published in two thousand six um i'd like to point out that these are actually samples from a model rather than we constructions of particular data cases so the models quite powerful and um just it has very high peaks at real data point or some samples from um i recently published model which was a gated M R F and this uh model was trained a natural image patches and so you can see this model to really good at um modeling what's short in wrong long-range correlations an images um as an example of um motion sequence a so models of also been developed for um um motion sequences and the this case um the training data was joint angles from motion capture so okay so it's been seen that um features from these generative setups are also very good at discriminative task and um that makes sense on an intuitive level uh what's good for um generating spats big types of patterns it's probably good at recognising those patterns and um yeah actually start show an example of of features that were um yeah a good at generating a textures are sound textures and those could be used for recognizing these star um if models to been used widely for vision tasks but they have been made it uh quite as much into sound yet and so um for this uh work we wanted to see if we could um use these models for raw speech signals and uh see if the features that the learn where useful than a generative step a in a discrimination task so our goal um specifically is uh given raw speech signals we want to build a generative model for sub sequences of uh these signals of six point two five miliseconds lang were using timit which is sampled at sixteen khz so we had a have data vectors which were a hundred samples long and in in the vector we're actually modeling it's a hundred dimensional vector uh whose entries are the intensities of the raw sound uh sample okay so here's a quick out of that talk um oh could we talk about a R B M's and then now um which the restricted boltzmann machines that's a generate model we use for this paper i and L shows some results from the generative model and all talk about to the application of the features to phone recognition on timit so so for sex should uh oh i would like to talk about why we wanted to use raw signals themselves um have the first the reason was that we didn't one make any assumptions about uh the signal them such just stationarity of the single within a a single frame um secondly we were motivated by speech synthesis and uh in that the domain being able to model draw speech signals what allows eventually to be able to generate a realistic signals without having to satisfy phase constraint um we also want to be able to discover a a pattern which um and their relative onset times i with our model and we think that should probably be helpful uh in uh discriminating between certain point the last reason is a because we now can and that sounds a little facetious but um it's probably the most important a motivation for using raw signals um traditional encoding such as mfccs of been around for quite some time now a within the same might of time computational resources have uh but to that have the time um um at the same time uh a um a lot of data is now available to train really powerful models and also uh machine learning get made a lot of progress in being able to pick out features and building really good models from data alone and so that's why um we wanted to try and do this straight on off a a week it's a quick outline of uh um i'm restricted both machines so um are stress a restricted boltzmann machine or an B M is it and directed graphical model and it has uh two layers of uh node the bottom one which is the visible uh layer a a points the dimensions of the data that's observe and the top clear is that a known it or the hidden variables and these are basically latent variables but ryan explained the data um um there's a set of interaction weights connecting these two layers and um the architecture such that uh there's part part i connectivity which implies that given the visible note all the hidden nodes are independent of each other and the opposite is true of of with the balls when the hidden variables are known and since uh it's and nine directed graphical model ah well okay um it's a nine directed graphical model and so there is an energy function associated with uh and given configuration of the visible and hidden states and um the energy for a given state governs is probability through the boltzmann distribution um what i trying show shown this model in a the set of iterations here was um the uh exactly equation for a a a cost in binary R are M which is the R be "'em" that use for a the scenario where we have real valued signals and binary hidden note a let me see if i couldn't get out of the slideshow show okay sorry uh i never mind um the questions not actually that a many ways so um um the important point to note about the question is there's a term and there which are looks at the interaction between the configuration of the hidden variables and uh the isn't will over all something really interesting about uh this model is that the priors are quite complicated um because they involve a sum of um at an exponential uh number of configuration and on the posteriors on the other hand are are are quite simple so given visible data the hidden variables are all independent of each other and uh they turn on with a probability which is equal to the sigmoid of the input in that hidden node and the input is essentially the dot product of uh the visible data and the uh set a weight connecting a hidden node to the data and so in that since this is a very powerful model and it's different from um other generative models where the prior to are independent but posteriors are very hard to calm to compute so a part of this model is that it had very uh a rich priors but very easy posters "'kay" so the maximum likelihood uh estimation of of uh a models it's is this is really complicated because uh the gradient of the log probably is really hard to compute exactly um fortunately uh jeff in discovered about that that ago that an algorithm called contrastive divergence would be used to train these models um uh a pretty well and that's the model where a that's the algorithm or using the learned the parameters one last a point about uh the model where using um binary hidden units uh are not very optimal for raw speech signals and the reason that is is that um speech patterns can present speech signals in have the same pattern over many different um orders of magnitude but by new units can only turn on one out but intensity level so for this paper we used an alternate of a type of a unit all the steps sigmoidal model unit and uh but in and have the power but property that it can create um i but intensity at almost any all and now i want talk talk much about those units but there's more information but that yeah in this paper that's referent here okay peers the experimental setup um like is that we were looking at um six point two five miliseconds of speech and that course points two hundred samples so for each sample we have a variable in the visible data so are are B M has hundred visible note at the bottom um and we couple that with hundred twenty of this step they model units um in the R M uh D signal itself what selected randomly from the timit database and um what's presented to the model a tool on average the model that scene and use sub segment for about thirty time um here some of the features that were learned by the model a a on the left side uh we see the a actual features and i just a reminder these use just the weights a connecting the visible data to the hidden units so for each hidden unit we have a pattern and that hidden unit turns on optimal when the data presents the this particular pattern associated but so uh i you can see there's a lot of different types of patterns that are learned a let me a go through a few of them very quickly here is uh a pattern that's uh but to pick out really low frequencies um maybe like but F zero for or something or pitch oh here's some patterns uh that uh pick up for some features that pickup patterns which are um slightly higher frequency and here's others that are intermediate level frequency and then some that are really high frequencies there's some other really interesting ones which are these patterns that seem to have composite um frequency characteristics there is a low frequency component and high frequency component and we think that the might be picking up but uh a fricatives okay so not we're blind the model we and uh reconstruct signals from the a posterior activities of the hidden units that themselves so if take um ten frames of signal and we project that signal into the hidden unit i'm showing be activities of the hidden units here in log scale now only shown twenty of the uh i hundred do any units that we actually trained um and you can then take these posterior activities of the hidden units and project them back two visible space to reconstruct a raw signal this is um yeah similar in flavour to the previous talk except were using a parametric model to do this okay uh if you look at the reconstruction and a much larger scale this is six six twenty five miliseconds of raw signal and uh you can see in the a heat map the patterns present a high dimensional pattern and uh the heat map them still here some samples from the model itself um there sixteen samples in these samples um five of them are quite similar to the other um but they're different from the other eleven shoes um so i think we're have a pretty good um model for at least small scale signal so many switch now to the application of these features to phone recognition and so the set that we have is uh we have uh one hundred twenty five millisecond to pry speech and we want to be able to uh uh use the features that we learned to predict the you phoneme labels that we got from a a find model um so we are use the features that we learn to encode this signal you know talk about how we did that in the next slide and then uh we to be encoded features and put "'em" in the neural network and used back propagation two and the mapping to the phoneme label or the set of of uh how we did the encoding so we uh a use the convolutional set here in the way works is we first that the first frame at the first sample of an utterance and we compute the posterior means we then to move it by one sample and we do this computation again we then do this for the entire um utterance and so raymond for high dimensional data but this is a little too high dimensional because with a surely by the signal by hundred twenty tie um so um what what we now do would be sub sample these hidden units um so that we sub sample each feature for twenty five miliseconds of signal and the subsampling helps in smoothing out the signal as well yeah i have to point out that convolutional set of sub be quite useful vision task and think our results just that the same um for the set setup okay so we have a a subsampled a frame for twenty five miliseconds with then advance it by ten miliseconds and we do this for the entire utterance of well so um for any given um speech of twenty of uh one twenty five miliseconds we take the eleven frames that um man that signal and we can cut it all of that into one vector and that's the encoding coding that's put into the neural net or have some shoes um the features were first uh log transformed after we are created the entire set of features and we also added delta and acceleration of the fact vectors to the coding here's a little bit about the baseline M um yeah if fine model was just an hmm i train on mfccs uh there were sixty one phoneme classes with three states for each class and we used the bigram language model a forced alignment to the test data uh and the training data was used to generate the label so we use this this stand it a standard method for D putting the posterior probabilities um just similar to what done in tandem like approach is and this a convert posterior probability predictions two generative probability probabilities which and then be decoded viterbi code a it's a summary of the results for different configurations um of our setup uh we used the for these set of this set of experiments to hidden layer neural network and um what we found was uh if we use more hidden that network hidden units in the neural network that we got better result um we found that uh if we use shorter sampling windows and acceleration uh and um uh shifting windows and we got better results as well also adding the a delta and acceleration parameters help as well we find that uh with one twenty hidden units we got the best of we combine all these four lessons and train one neural network with two layers with four thousand units in each layer and uh we use delta and acceleration parameters and we used the subsampling sampling window of ten miliseconds with an net five milliseconds and uh hundred twenty hidden units you R are M with that we got twenty two point eight percent rows a on the test data uh for the phoneme error rate on timit um and then we uh to get to a uh uh for their uh a neural network and with a dbn pre-training we were able to further reduce it down to twenty one point eight a so uh here's the conclusions um the speech signal talks um which you learning can uh discover meaningful features from data alone and uh i think for their uh work in looking for high dimensional encodings is justified and uh for future work we aim to build better generative models oh and with that acknowledgements and uh um oh of the four K yeah question hi a angel your talk L and a question uh i one to the reasons why people shy in speech from time to make features as the high sensitivity to noise as opposed to just the raw being raw or not so oh how are you going to address that um i think my answer to that is we just need enough data in eventually will be able to figure that out i but that's the the bad that's key because that it noise comes all the some forms you have pink noise of white you have kind i Z i mean it's it's just a have data you know we could have some the recognition problem if it's just that but it one a so i i think it's not just the data it's it's models that go with the data and so if you use um some of these powerful generative models and build trying bill didn't for their assumptions about the characteristics of noise then hopefully you will learn to um pick out noise and separate that from um real signal so in the case of the features we learnt um if you actually look at um the types of features we learned we learned to ignore sort of high frequency components it's so if you look at the reconstruction a signal here you'll find that some of the aspects of the fricative or sub rest in the reconstruction so it's burning to pick out um more of the a vocal tract information then it's a trying to get a noise and um in a sense that's also speak yes but uh the point is that it able to try and separate out what's noise uh from what what's not at the shoe obvious but you question well only because also wait some feature is oh you so as to not also more useless system to the make use of partition so you really be lucky because to oh have a problem if it's a which are another but not ask that i have because i like fifteen years the got to have some you know paper oh oh using weight full you all little course modeling to mock up to feel the model do but mission which actually to to have a lot of but but a thing yeah and because i am so big things what if uh one that what you i should be a slice some other right is where our you know from different files a file to see right so uh we actually didn't try any noisy data for this setup um but i'm an advocate of multiple layers of uh representations and hope is that um when you build um uh sort of deeper models where lower models try to pick up signals and hard models try to look a look for more abstract patterns when you do that um high level uh features will try and suppress the noise and separate the signals but uh for now we don't really have any um experiments to back a clean a one more real close oh yeah well like this one is not tradition i i feel like twenty years ago i saw stuff on uh using neural that's to recognise phonemes and so i i am curious uh_huh what really the change just because the other thing that i i think about with that is scaling issues one i think about did you recognition or anything recognition if a all i have to do is make them something sufficiently slow work or lower or higher and i can usually destroy a under neural that based on you know what is the train performance one wondering there's this some sort of advancement have you gotten around issues with scaling and and uh uh transformations and space right um so i think what's different from twenty years ago is that um these sort of generative models to made a lot of progress in it's been seen that you can use them to see neural networks and get much better results than you good from new map before um i and the amount of data that's now available is much larger than about a years ago so that's have sort of to to the question of what's different from the last twenty years um in terms of uh scale of the data uh i think the kind of um units were using or sort of scale invariant at least in terms of intensity and have the motivation for using them but they're not the uh the time aspect it's not all covered and were actually looking at models to try and uh um sort of the invariant to that aspect of it a a like to mention that convolutional a networks of been useful in vision related task and i think they have the potential for adjusting for scales and half we gonna try and attractive those work but the one last comment what's your definition of row that the with this approach work for a to the that what that's it so is what was used in for a it so as you own definition um that would just for to that a of what this to data of sorry a i've i i i oh ah okay yeah um i think for for this paper the definition was the raw form that you could capture from the entrance um so we didn't wanna to make any assumptions that if you take spectral information that there were assumption that um the uh signal was stationary between a within a single frame which i think is the of the not a very correct information um and probably harms uh detection of certain types of phonemes i'm so um the second answer to that question it is uh um uh it's just a matter of uh convenience it depends on whatever the input was to our system that's already data but the yeah so so that that's that's a first definition which was as close as you can get to the capture device okay we need one thing

LEARNING A BETTER REPRESENTATION OF SPEECH SOUND WAVES USING RESTRICTED BOLTZMANN MACHINES

Innovative Representations of Audio

Presented by: Navdeep Jaitly, Author(s): Navdeep Jaitly, Geoffrey Hinton, University of Toronto, Canada