Speech Transcript - COMPARING MULTILAYER PERCEPTRON TO DEEP BELIEF NETWORK TANDEM FEATURES FOR ROBUST ASR

a low everyone um i'm or your vinnie else i i gonna be talking about a a deep learning and more a concrete on using um um the learning on tandem features and analysing how a it performance for robust asr um basically seeing how if that's with no um so a bit of related work and and background um deep learning i'm not gonna go into many details but it's basically the idea of having many layers of computation and typically in that it just a neural network um with uh better initialization than just random so in two thousand six him and i in do these are B M we each um apparently only have a a lot um and training these deep models and since then many groups um are working on be browning um you can see that by the amount of publications on machine learning conferences and also related conference just these or computer vision conferences like C R and so that um this some people that apply a very deep learning to speech and it's quite recent the last couple years and um um but in the um estimating a phone posteriors using neural networks or deep neural networks he's not a new idea um and basically um there's the two main approaches um one that uses the phone posteriors to um to get that and model for the hmm the high eight M model and the other that uses stand them um which means um we take just the yeah the posteriors as features and then just use them you know a otherwise just stand there at gmm "'em" M system and and the is it's quite attractive because may take needs to rely on a in gmm hmm system so that the kind of approach that we are looking here at in this work um so just to oh okay so just to to just explain briefly what and they may use um for those with an no you we just get some some sort of estimate or frame of posteriors probabilities for for the phones i on the top and then there several techniques and tricks um that have been a a applied and found them ten years ago or so so those posts you're probably use um you we have like the law to them and then we do a like uh we we white in the them so they might better the gmm they have one a covariance assumption we do mean by and the normalisation and then last we just concatenate them a with the mfccs or some spectral um features and we train and or or decode um we these extended feature set so pretty pretty easy to answer i um and i easy to implement as well i so so that made jam to the main points uh of these were um first um well had we want to a C how from post us coming a yeah might be uh a neural network combined with spectral features if there is any gain at there when we when we had them to the to the mfcc i i in this tandem fashion um also i and this is probably a a but interesting and and and and i don't know if that this has been a as yet how does noise affect um the deep neural net based systems and in particular um we want to and the light or a kind of rule out what right parts of the beep she are helping in which situations um so for that as i said we have some some questions regarding deep-learning so for example well and why does having a deep structure matter that's the first question then we can also ask ourselves what about pre-training these are B M training that i was talking about easy eating pork then or not and lastly um we know that to train neural networks you it gets Q sometimes especially when they are the um so as the optimization technique use my and and in the paper are the i uh it was focus on the first two points um the that's the point was that's but not are you deeply and that has to um it's something i've been working on and i i wasn't it is not in the paper about i'm what i talk about this in in this talk and and so were referring to those questions like the some some way to see neural networks that the good part of new deep neural networks yeah i is that they are or for models it by expressive and they can represent by complicated nonlinear relations that good because we know are bring probably does that and also i the they're attractive because a great in is easy to compute and in fact now with the um uh that computing and used i scenes all it in bob is some matrix operations they can actually we train pretty fast and it's a very efficient a there some but things so so i it is a non-convex optimization problem and and there's a vanishing gradients problem that if if we are by E the got instant to zero so it's kind of it is not be they are not easy to change especially when they are very the the neural network and also a number of out there you got from very large um in fact our brain has a when or all or there's of my need more than that the neural net that we obtain training nowadays so were feeding is a an issue that people are worried about a of use the as in many other machine learning techniques and that is something that people don't like about neural networks is they are kind of difficult to parade what what's going on a are some exceptions and there's many like the that some people who were in the computer vision and also speech that are and analysing actually what the you runs are are learning and and its impressive in in computer vision for example you can see that um the first you're is learning basically what be one in our brain he's doing um these double like few there's for computer vision and not really much is so so actually these is actually becoming good in some sense to into bread and they be it and and hence deep learning and so just just to two um um like a concrete on exactly experiment that we don um we train these kind of neural network so it it was um D because it has as you can see three hidden layers and and on the left we have just the input just with the like thirty nine acoustic observations and nine frames of frames some of can and then we have that fee following layers we five hundred a thousand and fifteen hundred um by now your logistic unit and lastly the last layer or it's the one that we estimate the uh the phone posteriors through the softmax later and that need say that i did and to or or and use so the i came up with these architecture and i've been using these i i i haven't change like the parameters and so one a because the but i wanted to see the effect and compare which that work um but they are already better numbers good could be found just by trying different architecture and so dumping in the experimental setup um we use the our a was so it's fairly small or well at around one point four million samples for training um at ten milliseconds for at a sampling rate and as as we know like these these the testing conditions are with added noise at that at different snr level i i'm said just to train station airport port and so on and let me like we we train the our models on clean speech and then we are testing that one D several noisy conditions just to see right the the yeah is a is being as in the noisy conditions if at all and and then we just use the standard hmm model proposed in the uh our or to set up on the the same decoding scheme um so um first table of results i'm let me let need just explain E um it's on the as well as you can see just the a a different noise conditions starting from clean and adding more i and in the parentheses you see that kind of um relevant um differences so if we we run these experiments randomly we observe a around point to point four and point for in word that are a different so that's kind of the significant level of of this result and then the for column of result is just stand the mfcc um model we can see a the as we had not use that it did great and the next two columns are from um write we probably these from icsi so basically that at them mlp so the first column would be a the the mlp would be just using a be features no mfccs and ten denote Ps concatenating both so we can see that concatenating mfcc helps because all the numbers are basically lower and it's it helps also a um in improve the word error rates for for us all the noise conditions now we the the belief network that i show um basically we get that the results are a all the most conditions and in particular um the on clean speech we get an improvement which which is i guess i compared but with other people found findings on timit and so on and um the improvement are consistently better than and the tandem mlp a approach that was proposed several years ago which which is good news and also recall a as C at that um mfcc usually helps but when there's is a lot of noise that using the five V be or that of the only be case um actually the and but and then the be and that's words and maybe you using just the T V and for phone posteriors is better um um and and but the first questions that he's how phone two years um combining that tandem fashion um when we use use people earning best as and just mlp features living and then have there's noise affect so it seems that deep neural nets are also good for nice and not for only like king speech and now i'm gonna jump to some more recent results um actually i was able to run is because i been working on a second all the optimization method proposed you and i see M at ten kind of for suggest that maybe pre-training or these are B business was not necessary if you use some sort of second order the optimization in the back perhaps the um but i go step by step to these questions and the columns to look at so first um does optimization matter um what we need here used in the call um the first the first two columns are the same as previously so that and M M mlp was trained using a standard techniques to that's a again in the centre and it's kind of um after seven hundred or so he and you needs we and and that's see an improvement of perform the last problem that and then a P with the little star we were actually be able to train a a more beer a bigger model basically as many parameters as the D you one that model that i will show later and that they so in the beginning and and as we can see at least for but not a a nice um low the region the time tandem mlp with that that are in need um optimization and more parameters actually a performance but and them "'em" up be without these a big new optimization um but then a on the like a higher noise conditions that you did good so that kind of these a pointing but maybe because there's so many you there some sort of or feeding and the model that a deal well with that um so that brings that's to the next point and that the match so now let's take the parameters of the single tandem P layer would is around three million by the way and let's use it in that the deep neural network that i was this at the beginning but with not to no pre-training so it's not not that deep belief network that can propose C just that that or neural net with many layers and what we see here use the performance is identical but on that high noise situations actually the sheet that we saw performance actually gonna and and we actually get a bit better so i might my here use maybe adding the deep nist has has some sort of effect on being able to cancel the noise better than if you do we have to just the shallow network um obviously this it's just an i is that but from the results we can probably see that and has the that pre-training so this is basically the from the first table the the it's the same neural net but with these pre-training step we see that it improves upon the deep neural net that has not been preaching um um so we what that means that uh and it it improves a grass all the noise conditions so i think what this means this pretty training basically it as a generalization um we know actually that for over fitting pre-training helps quite a lot so for the em nice that the set that was probably seen the signs paper this huge or feeding and pretending a lot but in this case it had not on the clean condition but on the even when to noise is quite low um it i i um to preach in the weights that these to make them to some some sort of generality and not only discriminative objective function and so but i two to conclude this discussion about the error and that if he's and so on um i look at the i thing you have a phone error rate of all these three networks i which is i thought are so i just but some random um phone and we can see that there's the the phone error rate seems similar but then and then when we had the noise right the D ends learn more robust a representation is because what we is when we had that we had when we had a a large amount of light a bore what the deep neural net and the shall only run neural net trained with that both with the better optimization technique so i i believe that i but do maybe it's it's it's hiding has to to learn basically better the representations of the data has it has been found actually also in computer vision and so on um and so basically to conclude um i think it is now but it's it's being your i but people running i words also in and them not only in the hybrid systems which is good news for those school who have a lot of uh engineering work around M Ms and a M system and furthermore i think the mfccs is this oh for the scroll are working on how distance maybe they should incorporate more spectral information some somehow especially if there's not a lot of noise then pre-training we we know we it has for over fitting but also it that was for um kind of generalization um of the uh of of the K in the case where we have a these might to clear mismatch between training which was to on a clean speech and and testing and this also i think the model seem to use you given the same amount of parameters they seem to be more robust in very high noisy situation um which which was found also in computer vision um and obviously these conclusions are for now based on a fairly small task and i think for future work um it would be interesting to go i guess a larger dataset set which we are actually working on that and also to compare between the uh so called a deep neural net it's mm M that and these deep uh and the estimate um thank the very much how they some question whereas question oh have this person able i have a question regarding or comments on the or not to work can you go back to the slides use all zeros this was was that is the beginnings of use some wise a good thing and Y yeah sir this one and that's one yeah but so the one use some wise the bat thing for that deep network and here you use the two points well you in the works so there is a there's one problem is the vanishing the creek me and and otherwise all or with feeding yeah and that to me is this two clones seems uh country uh can a contract is a content uh is not as that because uh is is that if you are in the all right so that you be are getting the hell and the uh more was a get which means the basic is model is mean the same and and the and you can change the them automatically right not no to do this these two the or a feeding is this a well as happened the some case all the as as as a all the single happens all the case uh the yeah i think this to happen in the band it's over fitting so in my is actually in my experiment idea an observe of a a whole lot of over i was just doing out to regularization on the weights but i i i i have over fading but in other cases like the if you read the science paper from hint on there's a lot of like that so that you have only like twenty thousand samples and they are all were fitting is you basically get to zero percent here error in those cases obviously the optimization method doesn't matter that much and it's these to basically how you by as you weights that were these using these are B the next question i don't know which one now oh okay but that's you that that there is some pretending happening so that means you must be using some a person and they tell which is not used to the um right not not not for the in this case like that the pre-training training it it's and supervised so it in or you put at a lot of data to a lot more than that to be chain so i yeah so my question that's for a neural network because oh but but was that then uh a that uh_huh we can try and you the that you know out of this a two more um so do the and thing are you to know on them to do for belief net um i just as i okay a model that is in networks uh_huh oh but the and and that's and removed and not as a remote meaning type to construct the would be to make an addict for uh_huh or it comes to the network and the same level and in the sense that i lastly last limit i to construct an I and do that so that the zero net was trained does take that whole objective function of for neural net and do but pro we random them waiting is a a a problem um that is a fair competition uh_huh okay thanks yeah uh_huh a question uh_huh basic and no or you concatenating anything the puts from the mlp P to the mfcc C you probably want to have the most to different information can from the M D so i wonder a when you do the back propagation with that or basically forcing the M P two of two focus on on know something which discriminates between the class that you decide to use did you try to look at how that would work if you did concatenate the features coming from on i trained and and like not to doing the training yes and what can to and something to different so actually they the deep neural net is train before the concatenation have so so in a sense you you just have the phone targets and you train the neural net first and then you concatenate and the each eight to map so i'm not sure because i so i Z oh i i i think you are using the outputs of to yeah so all right so you train your network first and then one you when you have a train you get it fix and then you don't twenty more backprop prop after you can can you could do that but but if you sure to than that but train before the uh back propagation used still have the the same uh a number of i'll put your and is you will have a off to the back propagation right yeah so this you can choose but there you can cut the need to the mfccs Cs do you have now or you will have off to the back propagation right or missing something i sorry i yeah i don't and i don't stand a do was so we problem to present and seven yeah and we can put that to a okay i

COMPARING MULTILAYER PERCEPTRON TO DEEP BELIEF NETWORK TANDEM FEATURES FOR ROBUST ASR

Robust ASR

Presented by: Oriol Vinyals, Author(s): Oriol Vinyals, Suman Ravuri, University of California Berkeley, United States