0:00:17a low everyone um i'm or your vinnie else
0:00:20i i gonna be talking about
0:00:22a a deep learning and more a concrete on using um um the learning on tandem features
0:00:29and analysing how a it performance for robust asr um basically
0:00:35seeing how if that's with no
0:00:37um so a bit of related work and
0:00:41and background
0:00:42um deep learning i'm not gonna go into
0:00:45many details but it's basically the idea of having many layers of computation
0:00:49and typically in that
0:00:52it just a neural network
0:00:53um with uh better initialization than just random so in two thousand six him and i in do these are
0:00:59B M
0:01:00we each um apparently only have a a lot um and training these deep models
0:01:06since then many groups um are working on be browning um you can see that by the amount of publications
0:01:12on machine learning conferences and also related conference just these or computer vision conferences like C R
0:01:19and so that um
0:01:20this some people that apply a very deep learning to speech and it's quite recent the last couple years
0:01:26and um um but in the um estimating a phone posteriors using neural networks
0:01:33deep neural networks
0:01:34he's not a new idea
0:01:36um and basically um there's
0:01:39the two main approaches um one that uses the phone posteriors to um
0:01:44to get that
0:01:46and model for the hmm the high
0:01:48eight M model and the other that uses stand them um which means um
0:01:53we take just the yeah the posteriors as features
0:01:56and then just use them you know a otherwise just stand there at gmm "'em" M system
0:02:01and and the is it's quite attractive because may take needs to rely on a in gmm hmm system so
0:02:06that the kind of approach that
0:02:08we are looking here at in this work
0:02:11so just to
0:02:15so just to
0:02:17to just explain briefly what
0:02:19and they may use um for those with an no you we just get some some sort of estimate
0:02:24or frame of posteriors probabilities for for the phones
0:02:27i on the top and then there several techniques and tricks um that have been a a applied and found
0:02:34ten years ago or so
0:02:36so those posts you're probably use um you we have like the law to them and then we do a
0:02:41like uh we we white in the them so they might better the gmm they have one a covariance assumption
0:02:47we do mean by and the normalisation and then last we just concatenate them a with the mfccs or some
0:02:53spectral um features
0:02:55and we train and or or decode um we these extended feature set
0:03:00so pretty
0:03:02pretty easy to answer i
0:03:04um and i easy to implement as well
0:03:07i so
0:03:10so that made jam to the main points uh of these were um
0:03:15first um well had we want to
0:03:17a C how
0:03:18from post us coming a yeah might be uh
0:03:21a neural network
0:03:22combined with spectral features if there is any gain at there when we when we had them to the to
0:03:27the mfcc
0:03:29i i in this tandem fashion um
0:03:32also i and this is probably a a but interesting and and and
0:03:36and i don't know if that this has been a as yet how does noise affect um the deep neural
0:03:41net based systems
0:03:42and in particular um we want to and the light or
0:03:47a kind of rule out what
0:03:49parts of the beep
0:03:50she are helping in which situations
0:03:53um so for that
0:03:55as i said we have some some questions regarding deep-learning so for example
0:04:00well and why does
0:04:01having a deep structure matter
0:04:03that's the first question then we can also ask ourselves what about pre-training these are B M training that i
0:04:09was talking about easy eating pork then or not
0:04:12and lastly um
0:04:14we know that to train neural networks you it gets Q sometimes especially when they are the um so as
0:04:20the optimization technique use my
0:04:23and and in the paper are the i uh it was focus on the first two points um the that's
0:04:28the point was
0:04:29but not
0:04:30are you deeply
0:04:33that has to um it's something i've been working on and i i wasn't it is not in the paper
0:04:37about i'm what i talk about this in in this talk
0:04:41and and so were referring to those questions like the
0:04:44some some way to see neural networks
0:04:47that the good part of new deep neural networks yeah i is that they are or for models it by
0:04:52expressive and they can represent by complicated nonlinear relations that good because we know are bring probably does that
0:04:59and also i the they're attractive because a great in is easy to compute and in fact now with the
0:05:05uh that computing and used
0:05:07i scenes all it in bob is some matrix operations they can actually we train pretty fast and it's a
0:05:12very efficient
0:05:14a there some but things so so i it is a non-convex optimization problem and and there's a vanishing gradients
0:05:20problem that if if we are by E the got instant to zero so
0:05:24it's kind of
0:05:25it is not be they are not easy to change especially when they are very the the neural network
0:05:30and also a number of out there you got from very large um in fact
0:05:34our brain has a when or all or there's of my need more than that the neural net that we
0:05:38obtain training nowadays
0:05:40so were feeding is a an issue that
0:05:42people are worried about a of use the as in many other machine learning techniques
0:05:46and that is something that people don't like about
0:05:49neural networks is they are kind of difficult to parade what what's going on
0:05:54a are some exceptions and there's many like
0:05:57the that some people who were in the computer vision and also speech that are and analysing actually what
0:06:02the you runs are are learning
0:06:04and and its impressive in in computer vision for example you can see that
0:06:10um the first you're is learning basically what be one in our brain he's doing um these double like few
0:06:15there's for computer vision and not really much is so
0:06:18so actually these is actually becoming good in some sense to into bread and
0:06:22they be it and and hence deep learning
0:06:26and so
0:06:27just just to two
0:06:29um um
0:06:30like a concrete on exactly experiment that we don
0:06:33um we train these kind of neural network
0:06:36so it it was um D because it has as you can see three hidden layers and and on the
0:06:42left we have just the input
0:06:44just with the like thirty nine acoustic observations and nine frames of frames some of can
0:06:49and then we have that fee following layers we five hundred a thousand and fifteen hundred um by now your
0:06:54logistic unit
0:06:55and lastly the last layer or it's the one that we estimate the uh
0:06:59the phone posteriors through the softmax later
0:07:02and that need say that i did and
0:07:05to or or and use so the i came up with these architecture and i've been using these i i
0:07:10i haven't change like the parameters and so one
0:07:13a because the but i wanted to see the effect and compare which that work
0:07:17um but they are already better numbers good
0:07:20could be found just by
0:07:21trying different architecture
0:07:24and so dumping in the experimental setup
0:07:27um we use the our a was so it's fairly small or well at around one point four million samples
0:07:32for training
0:07:34um at ten milliseconds for at a sampling rate
0:07:39as as we know like these these the testing conditions are with added noise
0:07:44at that at different snr level
0:07:46i i'm said just to train station airport port and so on
0:07:51let me like we we train the our models on clean speech and then we are
0:07:55testing that one D several noisy conditions just to see
0:07:58right the the yeah is a is being as in the noisy conditions if at all
0:08:03and and then we just use the standard hmm model proposed in the uh our or to set up on
0:08:08the same decoding scheme
0:08:11um so
0:08:13um first table of results i'm let me let need just explain E
0:08:17um it's
0:08:18on the as well as you can see just the
0:08:21a a different noise conditions starting from clean and adding more i
0:08:25and in the parentheses you see that kind of um relevant um differences so if we we run these experiments
0:08:33we observe a around point to point four and point for in word that are a different so that's kind
0:08:38of the significant level of of this result
0:08:41and then the for
0:08:42column of result is just stand the mfcc um model
0:08:46we can see a the as we had not use that it did great
0:08:49and the next two columns are from
0:08:52um write
0:08:53we probably these from icsi so
0:08:55basically that at them mlp so the first column would be a the the mlp would be just using a
0:09:00be features no mfccs and ten denote Ps concatenating both
0:09:04so we can see that concatenating mfcc helps because all the numbers are basically lower
0:09:10and it's
0:09:11it helps also a um in improve the word error rates for for us all the noise conditions
0:09:17now we the the belief network that i show
0:09:21um basically we get
0:09:22that the results are a all the most conditions
0:09:25and in particular
0:09:28the on clean speech we get an improvement which which is
0:09:31i guess i compared but with other people found findings on timit and so on
0:09:36and um the improvement are consistently better than and the tandem mlp
0:09:42a approach that was proposed several years ago
0:09:45which which is good news
0:09:46and also recall a as C at that um
0:09:49mfcc usually helps but when there's is a lot of noise that using the five V be or that of
0:09:54the only be case
0:09:56um actually the and
0:09:57but and then the be and that's words and maybe you using just the T V and for phone posteriors
0:10:02is better
0:10:03um um
0:10:05and and but the first questions that
0:10:07he's how phone two years
0:10:10combining that tandem fashion um when we use use people earning best as and just mlp features living
0:10:17and then have there's noise affect so it seems that deep neural nets are also good for nice and not
0:10:22for only like king speech
0:10:24and now
0:10:25i'm gonna jump to some more recent results um actually i was able to run is because i been working
0:10:32on a
0:10:32second all the optimization method proposed you and i see M at ten
0:10:37kind of
0:10:39suggest that maybe pre-training or these are B business was not necessary if you use some sort of second order
0:10:45the optimization in the back perhaps the
0:10:47um but i go step by step to these questions and the columns to look at
0:10:51so first um does optimization matter
0:10:55um what we need here used in the call um the first the first two columns are the same as
0:11:00so that and M M mlp was trained using a standard techniques to that's a again in the centre
0:11:05and it's kind of um after seven hundred or so he and you needs
0:11:10we and and that's see an improvement of perform
0:11:12the last problem that and then a P with the little star
0:11:15we were actually be able to train a a more beer a bigger model basically as many parameters
0:11:22as the D
0:11:23you one that model that i will show later and that they so in the beginning
0:11:27and and as we can see at least for but not a a nice um
0:11:32low the region
0:11:34the time tandem mlp with that that are in need um
0:11:38optimization and more parameters
0:11:40actually a performance
0:11:42but and them "'em" up be without these a big new optimization
0:11:45um but then a on the like
0:11:48a higher noise conditions that you did good
0:11:51so that kind of these a pointing but maybe because there's so many you there some sort of or feeding
0:11:56and the model that a deal
0:11:58well with that
0:12:00that brings that's to the next point
0:12:02and that the match
0:12:04so now
0:12:05let's take the parameters of the
0:12:07single tandem P
0:12:09layer would is around three million by the way
0:12:11and let's use it in that the deep neural network that i was this at the beginning but with not
0:12:16to no pre-training so it's not not that deep belief network that can propose C just that that or neural
0:12:21net with many layers
0:12:23and what we see here use the performance is identical
0:12:26but on that high noise situations actually the sheet that we saw performance actually gonna and and we actually get
0:12:33a bit better
0:12:34so i might my here use maybe adding the deep nist
0:12:38has has some sort of effect on being able to cancel the noise better than if you do we have
0:12:43to just the shallow network
0:12:46obviously this
0:12:47it's just an i is that but
0:12:49from the results we can probably see that
0:12:52has the that pre-training so this is basically the from the first table the the
0:12:58it's the same neural net but with these pre-training step
0:13:01we see that
0:13:02it improves upon the deep neural net that has not been preaching
0:13:06um um so we what that means that uh and it it improves a grass all the noise conditions so
0:13:12i think what this means this pretty training
0:13:14basically it as a generalization um
0:13:17we know actually that for over fitting pre-training helps quite a lot so for the em nice that the set
0:13:22that was probably seen the signs paper
0:13:25this huge or feeding and pretending a lot
0:13:27but in this case it had
0:13:29not on the clean condition but on the even when to noise is quite low
0:13:33um it i i um to preach in the weights that these to make them to some some sort of
0:13:39generality and not only discriminative objective function
0:13:44but i two
0:13:45to conclude this discussion about
0:13:47the error and that if he's and so on
0:13:50um i look at the
0:13:51i thing you have a phone error rate of all these three networks
0:13:54i which is
0:13:55i thought are so i just but some random um phone
0:13:59and we can see that
0:14:00there's the the phone error rate seems similar
0:14:03but then and then when we had the noise right the D ends
0:14:06learn more robust a representation is because what we is when we had that we had when we had a
0:14:11a large amount of light
0:14:13a bore what the deep neural net
0:14:16and the shall only run neural net trained with that both with the better optimization technique
0:14:21so i i believe that i
0:14:23but do maybe it's it's it's hiding has to to learn basically better the representations of the data has it
0:14:29has been found actually also in computer vision and so on
0:14:32um and so basically to conclude
0:14:35um i think it is now
0:14:38it's it's being your i but people running
0:14:41i words also in and them not only in the hybrid systems which is good news for those school who
0:14:46have a lot of uh engineering work around M Ms and a M system
0:14:51furthermore i think the mfccs is this
0:14:54oh for the scroll are working on how distance maybe
0:14:57they should
0:14:59more spectral information some somehow especially if there's not a lot of noise
0:15:04pre-training we we know we it has for over fitting but also it that was for um kind of generalization
0:15:10um of the uh of of the K in the case where we have a these might to clear mismatch
0:15:15between training which was
0:15:16to on a clean speech and and testing
0:15:19and this also
0:15:21i think the model seem to use you given the same amount of parameters they seem to be more robust
0:15:26in very high noisy situation
0:15:28um which which was found also in computer vision
0:15:31um and obviously these conclusions are for now based on a fairly small task
0:15:35and i think for future work
0:15:38um it would be interesting to go i guess a larger dataset set which we are actually working on that
0:15:44and also to compare
0:15:45between the uh
0:15:47so called a deep neural net it's mm M that and these deep uh and the estimate
0:15:52um thank the very much how they some question
0:16:01oh have this person able
0:16:05i have a question regarding or comments on the or not to work can you go back to the slides
0:16:09use all zeros
0:16:11was was that is the beginnings of use some wise a good thing and Y yeah sir
0:16:16this one and that's one
0:16:19but so the one use some wise the bat thing for that deep network
0:16:22and here you use the two points
0:16:25well you in the works so there is a there's one problem is the vanishing the creek me
0:16:30and and otherwise all or with feeding
0:16:32and that to me is this two clones seems uh
0:16:35country uh can a contract
0:16:37is a content uh is not as that because uh
0:16:40is is that if you are in the all right so that you be are getting the hell
0:16:44and the uh more was a get which means the basic is model is mean the same
0:16:50and and the and you can change the them automatically right not no to do this these two the
0:16:55or a feeding
0:16:56is this a well as happened the some case all the
0:16:59as as as a all the single happens all the case uh the yeah i think this to happen in
0:17:04the band it's over fitting so in my is actually in my experiment idea an observe of a a whole
0:17:08lot of over
0:17:10i was just doing out to regularization on the weights but i i i i have over fading
0:17:15but in other cases like the if you read the science paper from hint on there's
0:17:20a lot of like that so that you have only like twenty thousand samples and they are all were fitting
0:17:25you basically get to zero percent here error in those cases obviously
0:17:29the optimization method doesn't matter that much and it's
0:17:32these to basically how you by as you weights that were
0:17:34these using these are B
0:17:37the next question i don't know which one now
0:17:42but that's you that that there is some pretending happening so that means
0:17:46you must be using some
0:17:47a person and they tell which is not used to the um right not not not for the in this
0:17:51case like that the pre-training training it it's and supervised so it in or you put at a lot of
0:17:56data to a lot more than that to be chain
0:17:58so i
0:17:59so my question
0:18:01that's for a neural network because
0:18:05but but was that then
0:18:06uh a that uh_huh
0:18:08we can try and you the that you know out of this a two more
0:18:12so do the and thing are you to know
0:18:15on them to do for belief net
0:18:17um i just
0:18:19as i
0:18:19okay a model that is
0:18:21in networks uh_huh
0:18:23oh but the and and that's and removed and not as a remote meaning type to construct the would be
0:18:28to make an addict for uh_huh
0:18:32it comes to the network
0:18:35and the same level
0:18:37and in the sense that i
0:18:39lastly last limit
0:18:40i to construct an I and do that so that the zero net was trained does
0:18:45take that whole objective function of for neural net and do but pro
0:18:49we random them waiting is a a a problem um
0:18:51that is a fair competition
0:18:53uh_huh okay
0:18:57a question
0:19:00and no
0:19:02you concatenating anything the puts
0:19:04from the mlp P to the mfcc C you probably want to have
0:19:08the most to different information can from the M D so i wonder
0:19:13a when you do the back propagation
0:19:16with that
0:19:17or basically forcing the M P two
0:19:20of two focus on on know something which discriminates between the class that you decide to use
0:19:26did you try to look at how that would work
0:19:29if you did concatenate the features coming from
0:19:33i trained
0:19:34and and like not to doing the training
0:19:37yes and what can to and something to different
0:19:40so actually they
0:19:41the deep neural net is train before the concatenation have so
0:19:46so in a sense you
0:19:47you just have the phone targets and you train the neural net first and then you concatenate and the each
0:19:53eight to map
0:19:54so i'm not sure
0:19:55because i
0:19:57so i Z
0:20:00i i i think you are using the outputs of to
0:20:04yeah so all right
0:20:07so you train your network first and then one you when you have a train you get it fix and
0:20:11then you don't twenty more backprop prop after you can can you could do that but
0:20:15but if you
0:20:17sure to than that but train before the uh back propagation used still have the the same uh a number
0:20:23of i'll put your and is you will have a off to the back propagation right
0:20:27yeah so this
0:20:28you can choose
0:20:29but there you can cut the need to the mfccs Cs do you have now or you will have off
0:20:33to the back propagation right
0:20:36or missing something
0:20:40sorry i
0:20:41yeah i don't and i don't stand
0:20:42a do was
0:20:43so we problem to present and seven yeah and we can put that to a okay