0:00:17good morning everybody
0:00:21I'm very happy to see you all in this morning
0:00:38Professor Li Deng, who proposed the keynote this morning
0:00:43its not so easy to introduce him, because
0:00:47he is very well known in the community
0:00:49he is the fellow of a lot of societies like
0:00:53ISCA, IEEE, American Acoustical Society
0:00:57he has proposed several hundreds of papers during the last years
0:01:05and different talks
0:01:08Li Deng did his PhD in the University of Wisconsin
0:01:13He started his carrier in University of Waterloo
0:01:31He will talk to us today about two very important topics
0:01:37very important to all of us
0:01:39one is how to move out of GMM
0:01:44its not so bad because I start my carrier with GMM
0:01:49I need some new ideas to do
0:01:53something else
0:01:54the second topic will deal with the dynamic of speech
0:01:59we all also know but dynamic is very important
0:02:13we will not take more time on his talk, I prefer to listen him
0:02:20thank Li
0:02:27thank you, thank the organization and Haizhou
0:02:31to invite me to come here to give the talk
0:02:34it is the first time I've attended Odyssey
0:02:37I've read of lot of thing that the community has been doing
0:02:41As Jean has introduced
0:02:45now I think not only in speech recognition but also in speaker recognition
0:02:51there are few fundamental tools so far
0:02:56one is GMM, one is MFCC
0:03:01in common
0:03:03last year, I've learned a lot of other thing from this community
0:03:07it turns out that, the main thing by this talk is to say
0:03:11both of these components may have potential to replaced with much better results
0:03:18I touch a little bit on MFCC, I don't like MFCC
0:03:23so I think Hynek hates MFCC also
0:03:25now only until recently, when we was not doing deep learning
0:03:29there are evidences to show that all components maybe replaced certain in speech recognition, people
0:03:36have seen that it is coming
0:03:39hopefully, after this talk, you may think about whether in speaker recognition, these components can
0:03:45be replaced
0:03:46to get better performance
0:03:49the outline has three parts
0:03:54In the first part, I will give a little bit about quick tutorial
0:03:59having several hours of tutorial material
0:04:01over last few months, so it is a little challenging to compress them down.
0:04:07to this short tutorial
0:04:11rather talking about all the technical details
0:04:14I've decided to just tell the story
0:04:26I also notice that in the next section after this talk
0:04:30there are few papers related to this
0:04:35Restricted Boltzmann Machines, Deep Belief Network
0:04:39Deep Neural Network in connection with HMM
0:04:46at the end of this talk, you may be convinced that this may be replaced
0:04:49as well
0:04:49we can consider in the future, with much better speech recognition performance thing than that
0:04:56we have
0:05:00and also Deep Convex Network, Deep Stacking Network
0:05:16so I think over last 20 years, people have been working on segment models, hidden
0:05:22dynamic models
0:05:22and 12 years ago, I even had
0:05:25a project with John Hopkins University working on this
0:05:29and the results were not very promising
0:05:35now we are beginning to understand the great idea we proposed over there
0:05:39it did not work well at that time
0:05:41It is only after we do this, we realize how we can put them together
0:05:45and that is the final part
0:05:51the first part
0:05:56how many people here ever attended one of my tutorial over last year
0:06:01OK, its a small number of people,
0:06:09this one you have to know, deep learning, sometime you have hierarchical learning in the
0:06:14essentially, refer to a class of machine learning technique
0:06:18largely developed since 2006
0:06:21by ... you know actually, this is the key paper
0:06:26that actually introduced a fast learning algorithm for this called Deep Belief Network
0:06:36in the beginning, this is mainly done on image recognition, information retrieval and other applications
0:06:43and we, actually Microsoft was the first to collaborate with University Toronto
0:06:51researchers to bring that to speech recognition
0:06:54and we show very quickly that not only for small vocabulary
0:06:58it does very well but also for large vocabulary does even better
0:07:02this really happens
0:07:03you know in the part, for small recognition, it worked well for larger sometime it
0:07:09but this is the bigger tasks we have, the better success we have, I try
0:07:14to analyze
0:07:14to you that why it happens
0:07:17and Boltzmann machine, we will talk Boltzmann machine in the following talks, I think, Patrick
0:07:22has two papers on that
0:07:24and Restricted Boltzmann machine
0:07:27and this is a little bit confusing, so if you read the literature
0:07:31very often deep neural network and deep belief network
0:07:36which are defined over here which are totally different concepts
0:07:40one is a component of another
0:07:44just for save of convenience, the authors often get confused
0:07:49they called deep neural network as DBN
0:07:52and DBN is also referred to Dynamic Bayes network
0:07:55even more confusing
0:07:57one of thing that
0:07:59for tutorial, for people attended my tutorial, I gave a quiz
0:08:06people know all this
0:08:18last week, we got a paper accepted for publication, the one I wrote together with
0:08:24Geoffrey Hinton and with 10 authors all together
0:08:27work in this area
0:08:29we try to clarify all this, so we have unified terminologies
0:08:31when you read the literature, you know how to map one to another
0:08:38and Deep auto-encoder, I don't have time to go to here, and I will say
0:08:42about some new developments
0:08:43to me it is more interesting because some limitations of some others
0:08:50This is the hot topic, here I list whole recent workshops and special issues
0:08:59and actually, in Interspeech 2012
0:09:03you see tens of papers in this area most in speech recognition
0:09:07and actually, in one of the area, just format areas with 2 full sections for
0:09:16this topic, just for recognition
0:09:19and some others, we have more, and special issue
0:09:26PAMI, its mainly related to machine learning aspects and also computer visual applications
0:09:33I try to put a few speech papers there as well.
0:09:36and DARPA program
0:09:402009, I think last year they stopped
0:09:47and I think in December, there is another workshop related to this topic, it is
0:09:54very popular
0:09:55I think because people see the good results are coming, and I hope that
0:10:00one message of this talk is to convince you that is a good technology so
0:10:06you want to seriously consider adapting some of this essences
0:10:10tell some stories about this so
0:10:14so the first time, this is the first time
0:10:17when deep learning showed promising in speech recognition
0:10:20and activities grow rapidly since then and that was around
0:10:24two and half years ago
0:10:28or three and half years ago, whatever
0:10:31in NIPS, NIPS is a machine learning workshop
0:10:35every year
0:10:46so I think one year before that
0:10:49so actually, talked with Geoffrey Hinton
0:10:52a professor at Toronto, he has shown me that
0:10:56he showed me that the Science paper, he actually has a poster there
0:11:00the paper was well written and the diagram was really promising
0:11:05in term of information retrieval for document retrieval
0:11:08so I looked this, after that we started talking about
0:11:12maybe we can work on speech
0:11:15he worked on speech long time ago
0:11:24so we decided to have this workshop, and actually we work together before
0:11:30my colleague, Dong Yu, and my self and Geoffrey, we actually decided to have
0:11:37a propose accepted which presents whole deep learning in preliminary work
0:11:43and that time most people do TIMIT, a small experiment
0:11:46and we turn out that this workshop gives a lot of excitement
0:11:53so I give a tutorial, 90 minutes
0:11:58about 45 minutes tutorial, and Geoffrey, I talk about speech, and he talks about deep
0:12:02learning at that time, and we decided
0:12:05to get people interesting this
0:12:07so the curriculum is as follows, for NIPS
0:12:12at the end of the final day workshop
0:12:18each organizer presents a summary of the workshop
0:12:24and the instruction for that it is a short presentation, it should be funny,
0:12:30should not be too serious
0:12:32every organizer is instructed to prepare few slides to summarize
0:12:41your workshop in the way that your impression to people attending the workshop
0:12:47this is the slide, we prepared
0:13:05speechless summary presentation of the workshop on speech
0:13:10because, we don't really want to talk too much, just go up there, and show
0:13:15that slide
0:13:16no speech there, just animations
0:13:20so we said that, we met in this year
0:13:24so this is supposed to be industrial people
0:13:31and this is supposed to be academic people
0:13:33so they are smart and deeper
0:13:37and they say that, can you understand human speech
0:13:41and they say that, they can recognize phonemes
0:13:47and they say, that 's a nice first step and what else do you want?
0:13:52and they said they want to recognize speech in noisy environments and then
0:13:58and then he said maybe we can work together
0:14:01so we have all concepts together
0:14:14that's all presentation
0:14:24we decided to do small vocabulary first
0:14:31and then quickly we moved I think in December of 2010
0:14:36move to very large vocabulary
0:14:38to our surprise, the bigger vocabulary you have, the better success you get
0:14:43very unusual
0:14:44and myself analyze the area with details
0:14:47you know we have been working on it before 20 some years
0:14:54one is surprise to me, convince me to work in this area individually
0:15:02was that every pattern that I see from the recognizer it is very different from
0:15:08absolutely, it is better, the area is very different, that means it is good for
0:15:13me to do that
0:15:14anyway, I talk about DBN
0:15:20one of concept is deep belief network, that is one of that Hinton published in
0:15:282 papers there
0:15:34nothing to do with speech, it's called deep belief network. its pretty hard to read,
0:15:38if you are not in the field for while
0:15:40and this is another DBN called dynamic Bayesian network
0:15:44few months ago, Geoffrey sent an email to me saying that look at this
0:15:50acronym DBN, DBN
0:15:59he suggests that before you give any talk you check
0:16:03mostly, in speech recognition, people do Dynamic Bayes Network more
0:16:09anyway, I will a little bit technical contents on it, time is running up quickly
0:16:17number one is the first concept is restricted Boltzmann machine
0:16:21actually, I have 20 slides, so I just take one slide over these 20
0:16:26so think about this as visible
0:16:30it can be label, label can be one of visible units
0:16:33we do discriminative learning, other thing is observation, think about of observation, and other thing
0:16:38forget about this
0:16:39think about MFCC, think about label
0:16:43or speech label, senome or other labels
0:16:47so we put them together as observation and we have hidden layer here
0:16:51and then the difference between Boltzmann machine and neural network is that
0:16:57the standard neural network is one direction, from bottom up
0:17:00now Boltzmann machine is both directional, you can go up and down, now connections between
0:17:04neighboring units in this layer and that layer are cut off
0:17:08if you don't do that, it is very hard to learn
0:17:11so one of thing is that in deep learning they start with restricted Boltzmann machine
0:17:16is that
0:17:17if you have bi-direction of connections
0:17:21and if you do all in detailed maths, write energy functions, you can write down
0:17:25the conditional probabilities of hidden units given it and the other around.
0:17:29so if you put energy right, actually you can get the conditional probability of this
0:17:34given this to be Gaussian
0:17:35which is that something people like that, this is conditional you can introduce whole thing
0:17:41as Gaussian mixture model
0:17:42so you may think that is just Gaussian mixture model so I can do it
0:17:47each other
0:17:48the difference is that, this you can get almost exponentially large number of mixture components
0:17:55rather than finite
0:17:56I think in speaker recognition, its about 400 or 1000 mixtures whatever
0:18:06and here if you get 100 units
0:18:11you get almost unlimited number of components
0:18:13but they are tied together
0:18:15Geoffrey has done very detailed mathematics to show that this is very powerful way of
0:18:22doing Gaussian model
0:18:23actually, you get product of experts rather than mixture of experts
0:18:33that to me it is one of key inside that we get from him
0:18:37that is RBM, so think about this as RBM
0:18:40think about this as visible
0:18:44this observation and hidden and we put them together we have it
0:18:47it is very hard to do speech recognition on it
0:18:52this is a generative model, you can do speech recognition, but if you do that,
0:18:57the result is not very good
0:18:59dealing discrimination tasks with a generative model you are limited by some of the
0:19:07you don't directly focus on what you want
0:19:11however, you can use it as building block
0:19:16to build DBN (deep belief network)
0:19:18the way we do it actually in Toronto
0:19:24if we think about this as building block
0:19:28you can do learning, after you do learning of this I just skip
0:19:33it will take whole hour to talk about that learning, but assume that you know
0:19:35how to do that
0:19:36after you learn this, you can treat this as feature extraction from what you get
0:19:40and you treat as stacking up
0:19:43deep learning researchers argue that this becomes the feature of that
0:19:52and then you can do further I think it is brain architecture
0:19:56think visual cortex, 6 layers
0:19:59you can build up, whatever that can learn over here become the hidden feature
0:20:03hopefully, if you learn that right you can extract the important information from data that
0:20:08you have
0:20:08and then you can use feature on the feature and stacking up
0:20:12why we are stacking up, actually it puts interesting theoretical results
0:20:16that actually shows that if you unroll this single DBN
0:20:20sorry, one layer of RBM
0:20:23in term of belief network, actually it is equal to infinity length
0:20:28because, every time this is related to learning
0:20:33learning is actually go up and down, every time you go up and down, it
0:20:37is equivalent to show that it
0:20:39you actually get one layer higher, now the restriction here is that
0:20:46all the weights have to be tied, it is not very powerful
0:20:50but now we can untie the weights by doing separated learning
0:20:54what we do it, it is very powerful model
0:20:55anyway, so the reason why you this one goes down, this one goes up and
0:21:00down is that if you
0:21:02actually, I don't have time to go here, but believe me
0:21:05so if you stack up this one, one layer up
0:21:10and then you can mathematic show that this is equivalent to having
0:21:15just one layer RBM at the top and then belief network going down
0:21:20and this actually called Bayes network
0:21:23so look at belief network is similar to Bayes network
0:21:26but now if you look at this, it is very difficult to learn
0:21:30so for each any one going down over here something in machine learning called explaining
0:21:36away effect
0:21:37so the inference becomes very hard, generation is easy
0:21:41and then the next invention in this whole theory is that
0:21:47just reverse order
0:21:51and you can turn into neural network, it turns out that it is not theory
0:21:56in that aspect as that works well
0:21:59and practice it works really well
0:22:00actually, I am looking to some of theories of this
0:22:04so this is the full picture of DBN
0:22:08so DBN consists of bi-directional connections here
0:22:11and then single direction goes down
0:22:13so if you do this, you actually can use that as generative model that you
0:22:17can do recognition on this
0:22:18unfortunately, the result is not good
0:22:21a lot of steps that people reach the current state
0:22:25I am going to show you all steps here
0:22:27so number one RBM is useful, gives you feature extraction
0:22:31and you stack up RBM few layers up
0:22:34and you can get DBN, actually at the end you need to do some discriminative
0:22:39learning at the end
0:22:40uh, so let's see, but generally, the capacity is just very good
0:22:46this is the first time, I saw
0:22:48the generative capability from Geoffrey, I was also amazed
0:22:53so this is that example that he gave me
0:22:59so if you train, use this digit
0:23:05the database is called MNIST
0:23:12an image database, everybody uses it, as like our speech TIDIGIT
0:23:19you put them here and you learn it
0:23:21you know according to this standard technique
0:23:24you actually now put 1 of digit here you want to synthesize 1
0:23:29you put 1 here and all other are 0, and then you run
0:23:35you can actually get something really nice, if you put 0 here
0:23:37this is different from the traditional generative process
0:23:42the reason why they are different because of stochastic process
0:23:46it can memorize
0:23:50some of numbers are corrupted
0:23:53most of time you get realistic
0:23:54last time, in one of tutorial I gave
0:23:58I give the tutorial shown this result , how about of speech synthesis people in
0:24:03the audience
0:24:04they said that is great, I will do speech synthesis now
0:24:07you get one sentence, fixed number, not human one
0:24:10human do writing, every time for different writing
0:24:14intermediately, go back to write draft propose, and ask me to help them
0:24:22this is very good, stochastic components there, the result looks like how human are doing
0:24:29now, we want to use for recognition, this is the architecture
0:24:39I am amazed, I had a lot of discussion with Patrick yesterday
0:24:42I just feel that, when you have generative model you really need to
0:24:54you put image here, and move up here, and this becomes the feature
0:24:58and all you do that you turn on this unit by one
0:25:00and run a long time until convergence
0:25:04and you look the probability for this
0:25:05to get number, OK
0:25:06and turn of other units, and run run, and see which number is high
0:25:13I suggest that you don't do that waste your time
0:25:16number one is it takes long time to do recognition, number two we don't know
0:25:21to generate to the sequence
0:25:23and he said the result is not very good, so we did not do it
0:25:27we abandon the concept of generation, to do everything generative, that's how we do.
0:25:36and that's how deep neural network is born
0:25:39so all you do is that you just treat all the connections to be
0:25:47that why at the end my conclusion is that the theory of deep learning is
0:25:51very weak
0:25:52ideally DBN goes down, it generate the model
0:25:57in practice, you say it is not good, just forget about this, think about
0:26:01and eliminate this, and make whole weights moving up
0:26:05we modify this, the easiest way to do it just forget about this, you know
0:26:09just change make it go up, make this go down again, people don't like it
0:26:14in the beginning, I supposed it is horrible, that is crazy to do it
0:26:19just break the theory to build the DBN
0:26:22finally, what is the best result, what we do that is really as same as
0:26:28what multilayer perceptron has been doing except it just
0:26:33has very deep layers
0:26:35and now if you do that typically, randomize
0:26:40you know, all the weights, you learn this as standard arguments
0:26:4420 some years ago saying that
0:26:46mathematics proves that the deeper you go, the more
0:26:51the lower level you go, because the label is the top level
0:26:54so you do back-propagation taking the derivative of error from here go down here
0:26:59the gradient is very small
0:27:02you know sigmoid function sigmoid (1-sigmoid)
0:27:05so the lower you go, the more chance that gradient term vanishes
0:27:14they even don't back-propagation for deep networks so look that it, it seems impossible to
0:27:20learn, they gave up
0:27:21and then now, one of very interesting things that comes out of deep learning is
0:27:27to say
0:27:28rather using random numbers, can be interesting to using DBN to plug in there, that
0:27:32some thing I don't like that
0:27:33look the argument why it is good, what we do is that we train to
0:27:38this DBN
0:27:38over here
0:27:41the weights DBN, you just use the generative model for the training
0:27:46and once you trained, you fix these weights, you just copy the whole weights into
0:27:52this, deep neural network to initialize these
0:27:54after that you do back propagation
0:27:58again, these weighting is very small, but its OK
0:28:02you already got DBN over here
0:28:03you got RBM, it should be RBM, not DBN anymore
0:28:09it is not too bad,
0:28:15so you see exactly how to train this, it is just that using random is
0:28:25not good
0:28:27if you use DBN's weights over here is not too bad, but over here, you
0:28:32you just run recognition, for MNIST
0:28:38the error goes down to 1.2% that is whole Geoffrey Hinton's idea
0:28:47and he published inside a paper about this, at that time, it seems to be
0:28:51very good
0:28:52but I am going to tell you that MNIST result 1.2% error, but with few
0:28:58more generations of networks, I will show you, we are able to get 0.7%
0:29:05and same kind of philosophy goes to speech recognition
0:29:12I will go quickly, in speech all of you think about how to do sequence
0:29:18it is very simple
0:29:21now we have deep neural network
0:29:24what we do that we normalize that using softmax
0:29:28to make that to be, similar to the talk yesterday, a kind of calibration
0:29:35and we get posterior probabilities and divided by prior you get generative probabilities, and just
0:29:40use HMM to do that
0:29:42that why called DNN-HMM
0:29:49the first experiment we did on TIMIT
0:29:53with just phonemes, easy
0:29:55each state, one of three states is a phoneme, very good result, I can show
0:30:00you then move to large vocabulary, one of thing that we do in our company
0:30:05you know Microsoft called them as senomes
0:30:14rather we have a phone, we cut it in dependent phone
0:30:18that becomes our infrastructure
0:30:20so we don't change all this
0:30:22rather we use 40 phones, what happen if we use 9000.
0:30:25you know, the senomes, long time ago people could not do that, 9000 here, crazy
0:30:30300, 5000, every time you have 15 million weights here, it is very hard to
0:30:37now we bought very big machines
0:30:39a GPU machine, parallel computing
0:30:45so we replace this by ... it can be very large
0:30:52this is very large, and input is also very large as well
0:31:01so we use a big window
0:31:03we have a big output, big input, very deep, so there are 3 components
0:31:09why big input-long window
0:31:11which could not be done in HMM
0:31:13do you know why? because
0:31:15I have a discussion with some experts it could not be done for speaker recognition,
0:31:22for speech recognition, the reason why it couldn't be done, because
0:31:26first of all you have to diagonalize HMM
0:31:32but its not big, if you do too big, Gaussian becomes sparseness problem
0:31:37covariance matrix
0:31:39for the end, all we do that make it simple as possible, just plug whole
0:31:43long window
0:31:44and then feed whole thing, we get million of parameters
0:31:48typically, this number is around 2000
0:31:502000 here, every layer, 4 million parameters here, another 4 million, another 4 million
0:31:55and just use GPU to train the model together
0:31:57here is not too bad
0:31:59so if we use about 11 frames
0:32:04now, it is even extended to 30 frames
0:32:11but in HMM, we never imagine of doing it
0:32:14we don't even normalize this, we just the roll
0:32:16values over here
0:32:17in the beginning, I still use MFCC, delta MFCC, delta
0:32:23multiply by 11 or 15 whatever
0:32:26then we have a big input
0:32:28which is still small compared with hidden unit size
0:32:31and train this whole thing, and every thing works really well
0:32:33and we don't need to worry the correlation modeling, because correlation is automatic captured by
0:32:38the whole the weights here
0:32:40the reason I bring it here, just to show you that, this is not just
0:32:55we went through history, literature, we never saw put this one as speech until this
0:33:02first data
0:33:03now just give you a photo here, GMM everybody know
0:33:09HMM, GMM, so whole point is to show you that
0:33:15the same kind of architecture if you look at HMM
0:33:18you can also see GMM is very shallow
0:33:21all you do it that for each state the output 1 is score for GMM
0:33:26over here, you can see many layers
0:33:28so you build feature up and down, this one shows deep versus shallow
0:33:33here is the result. We wrote the paper together, it will be appear in November
0:33:41and that result summarize
0:33:45four groups research together over last three years
0:33:49since 2009
0:33:51university of Toronto, Google, and
0:33:55and our Microsoft research was the first one who
0:33:58actually serious work for speech recognition
0:34:01Google data and IBM data
0:34:03they all confirm the same kind of effectiveness
0:34:05here is the TIMIT result
0:34:10it is very nice, all people think that TIMIT is very small
0:34:14if you don't start with this, you get scared away.
0:34:18so I will go back in the 2nd part of this talk, it is monophone
0:34:24hidden trajectory model, I did many years ago
0:34:26to get this number, need 2 years to do this
0:34:29I wrote the training algorithm, very good my colleagues wrote the decoder for me, this
0:34:36is very good number
0:34:38for TIMIT, and it is very hard to do decoding
0:34:48the first time we try this DBN
0:34:50deep neural network
0:34:55I wrote this paper with ... we do is MMI training
0:35:03you can do back propagation through the MMI function for whole sequence
0:35:09so we got 22%, it is almost 3%
0:35:18and then we look the errors between this and this are very different, especially, for
0:35:22very short samples
0:35:23it is not really good, but for the very long side is much better
0:35:27I've never seen that before
0:35:30so do this, this kind of work which is compared with HMM
0:35:34this result has been done for 20 some years ago
0:35:37this is error, 27% error around 4% up
0:35:42around 10 years, 15 years, the error drops 3%
0:35:50and this and this is very similar in term of error
0:35:58so you see the error is very different
0:36:06so the first experiment is voice search
0:36:10at that time, voice search is very an important task, and now voice search goes
0:36:16to everywhere
0:36:18in Siri has voice search, in Window phone we have that
0:36:23even in Android phones
0:36:25very important topic
0:36:27so we have data, we have worked on this one, very large vocabulary
0:36:33and summer of 2010
0:36:35we first to in our group, just boost that because the it is so different
0:36:44from TIMIT
0:36:47and we actually don't even change parameters at all
0:36:49all parameters, learning rate
0:36:52from our previous work in TIMIT
0:36:54and we got down here, that is the paper we wrote
0:36:57just appear this year
0:37:04and then this is the result that we got
0:37:07if you actually want to look at exactly how this is done
0:37:11most of the thing provide
0:37:13in this paper is read speech
0:37:15to tell you how to train the system
0:37:17but you need to use GPU to implement, without GPU, it takes 3 months, just
0:37:21for experiments
0:37:22for large vocabulary, for GPU is really quick
0:37:25most of thing is the same, you do this, you do this
0:37:32we try to provide theory as much as possible
0:37:36so if you want to know how to do this in some applications take a
0:37:40look at this
0:37:40so this is the first time
0:37:44the effects of increasing the depth of DNN for large vocabulary
0:37:50so our systems, the accuracy go up like this
0:37:58and the baseline, using HMM, discriminative training MPE learning
0:38:05around 65, this is just neural network
0:38:08single layer neuron is doing better than all this
0:38:12and you increase it, you get it
0:38:15what you go there, some kind of overfit, data is not very good, we label
0:38:1924 hours
0:38:20data at that time, so we say
0:38:23do more, we try 48 hours
0:38:25this one drops big
0:38:26so the more data you have the better can you get
0:38:29some of my colleagues said that why we don't use Switchboard
0:38:36I say this is too big for me, we don't do it
0:38:38so actually, we do this Switchboard
0:38:40and then we got a huge gain
0:38:41even more gain that I showed you here
0:38:43it just because of more data
0:38:45so typical problem
0:38:46is not really spontaneous speech, but this is spontaneous as well
0:38:52so this for spontaneous speech as well
0:38:55it seems with limited data we go up here quite heavy
0:38:58and then you get 1 order, or 2 orders magnitude more data there
0:39:02so you have much more GPUs to run, much better softwares
0:39:05every thing runs well
0:39:08it turns out that same kind of read speech
0:39:10we publish over here
0:39:14let me show you some of the results
0:39:16this is the result, this is the table in our recent paper
0:39:24with Toronto group
0:39:29so standard GMM based HMM
0:39:31with 300 hours of data
0:39:33has error rate about 23 some percent
0:39:38we do very carefully
0:39:43tune the parameters, this parameter have been tuned (the number of layers)
0:39:44and we got from here to here
0:39:47and that is actually attracted a lot of people attention
0:39:49and then we realize that
0:39:53we got 2000 hours, and this result from that is even better
0:39:56and at that time, it is Microsoft result
0:40:01and then one of recent paper, publishes the result that
0:40:08of course, when you do that people argue that you have 29 million parameters
0:40:14and people always you know
0:40:16pick, picking people in speech community people
0:40:19obviously, uh, you got more parameters, of course you're going to win what
0:40:21so what if you use the same number of parameters
0:40:23we said fine, we do that
0:40:24so we use the sparseness coding
0:40:26to actually cut up all the weights
0:40:28and the number of non-zero parameters is 15 million
0:40:33with the smaller number of parameters,
0:40:34we get even better result
0:40:36it's amazing... the capacity of deep network is just tremendous
0:40:40you cut all the parameters
0:40:41in the beginning, we don't
0:40:42typically, you expect to be similar right
0:40:44get rid of the lower
0:40:46you get slightly gain
0:40:47but that doesn't carry off before we get more data anyway
0:40:49so this is, maybe
0:40:50within the statistical variation, but so
0:40:53with the smaller number of parameters
0:40:55then GMM, HMM which is trained using discriminative training
0:40:58we get about something 30 something % error reduction
0:41:02more so than our TIMIT, and also more so than our
0:41:06our voice search
0:41:10and then this is another paper, and then IBM came along
0:41:12and then Google came along, they say you know, it's better result, I think they
0:41:16want to do as well
0:41:18so you can see that thesis's Google result
0:41:19thesis's about 5000 hours, amazing right
0:41:22they just have better infrastructure
0:41:24mapping this all that, so they manage to do that on 5000, 6000 hours
0:41:29so this number just came up
0:41:32actually that number
0:41:33so actually this will be in the Interspeech papers, if you go to see them
0:41:38so one of the thing Google does is that they don't put this baseline result
0:41:42they just give a number,
0:41:44just ask what number they have
0:41:47so... sorry.. sorry
0:41:50with more data they have, with the same data they don't have the number, either
0:41:53they just don't bother to do
0:41:54they all believe more data is better
0:41:56so with a lot more data they got this
0:41:58and then we just with about how many, about three
0:42:01uh, with this much data
0:42:04I take about 12%, is better when we got more data
0:42:07they should put a number here, anyway
0:42:09so I'm, we're not nick picking on this
0:42:12and thesis's the number I show, thesis's Microsoft's result, the number from here to here
0:42:16from here to here for different
0:42:18these are 2 different test sets
0:42:19and all these, all the people are here, you should know, this is very important
0:42:23for our review
0:42:24ah now, this is IBM result
0:42:27ah sorry, this is voice search result that I showed you early
0:42:29this is 20%
0:42:31it's not bad
0:42:32so because you have 20 hours of data, so
0:42:34it turns out the more data you have
0:42:36the more error reduction you have
0:42:38and for TIMIT, we get only about 3-4 absolute, about ten something percent
0:42:43now, and this is the
0:42:46so this broadcast result is from IBM
0:42:50and I heard that in Interspeech, they have much better result than this
0:42:55so if you're interested, look at it
0:42:57my understanding is that
0:42:59from what I heard, is that their result is comparable to this
0:43:02some people say even better
0:43:04so if you want to know exactly IBM is doing, they would have even better
0:43:08in term of distributed learning
0:43:10compared to most other places
0:43:12but anyway so this kind of error reduction
0:43:15has been unheard of in the history, I mean in this area about 25 years
0:43:20and the first time we got these results, we're just stunned
0:43:22and Google, this is also Google's result, and even Youtube speech which is much more
0:43:27spontaneous with all the noise
0:43:29they also manage to get something from here
0:43:31this time they're pretty honest to put this over here with the same amount of
0:43:3514 hours they got more
0:43:37but in our case, we also get 2000 hours, we actually get more gain
0:43:40rapid gain ah yes
0:43:41so the more data you have
0:43:43and then of course, to get this, you have to tune the depth
0:43:45the more that you have, the deeper you can go
0:43:47and the bigger you may wan to have
0:43:50and the more gain you have
0:43:51and this is the story I want to comment
0:43:52without, you really have to change major things in the system architecture
0:43:58OK, so once
0:44:00one thing that we found
0:44:01so my colleagues Dong Yu and myself and ah and
0:44:06recently found was that
0:44:10so in most of the thing that we
0:44:13I believe in old days IBM and
0:44:15and Google and our early work
0:44:17we actually use DBN to initialize our model off-line
0:44:20we said can we get rid of that, that training is very tricky, not many
0:44:23people know how to do that
0:44:27if for certain recipe, you have to look at the pattern
0:44:30it's not obvious thing how to do that because the learning
0:44:33there's the keyword in the learning called the contraction divergence you might hear that word
0:44:37in the later
0:44:38part of the talk today
0:44:42contrastive divergence on theory,
0:44:44essentially the idea is you should you know iterate
0:44:47you should do multi-column simulation
0:44:49Gibbs sampling for infinite turns
0:44:52but in practice, it's too long
0:44:55it's a cut it to one
0:44:56and of course from that, you can, have to use variational learning
0:45:00variational bump to prune for better result
0:45:02it's a bit tricky
0:45:04that's why it's better to get rid of it
0:45:06so our colleagues, so actually have a patent filed just some few months ago
0:45:10on this, and also a paper from my colleague
0:45:13would actually use ... for the
0:45:17for switchboard task
0:45:18and they show that
0:45:19you actually can do comparable things to RBM learning
0:45:23so might I would say now, for large vocabulary
0:45:26we don't even have to learn much about DBN
0:45:30so .. the theory so far is not clear
0:45:34exactly what kind of power you have
0:45:36but I might sense is that
0:45:39if you have a lot of unlabeled data in the future
0:45:42it might help
0:45:44but we also did some preliminary example to show it's not the case any more
0:45:47so it's not clear how to do that
0:45:49so I think at this point we really have to get better theory
0:45:52if you take a better theory, and also kind of comparable
0:45:54you know it's a
0:45:56although all these issues cannot be settled
0:45:58so the idea of discriminative pre-training is that
0:46:01you just train the standard ..um
0:46:07multi-layer perceptron using you know
0:46:10thesis's easy to train. For shallow, you can train, the result's not very good
0:46:14and then every time
0:46:15you do you fix this
0:46:17you add a new one, and you do. You need to fix the lower layer
0:46:21from the previous shallower layer
0:46:24and that's good, that's the spirit, It's very similar to
0:46:28OK .. the spirit is very similar to layer by layer learning
0:46:31now every time
0:46:32when we add up a new layer, we inject
0:46:35discriminant labeled information
0:46:37and that's very important, if you do that, nothing goes wrong
0:46:39so if you just use all the random number, to go over here and do
0:46:42that, and nothing is going to work
0:46:44uh, but except
0:46:45there's some exception here, but I'm not going to say much about
0:46:48but once you do this
0:46:50layer by layer with
0:46:52the spirit is still similar to DBN right, layer by layer
0:46:55but you inject discriminative learning
0:46:57I believe it's very natural thing to do
0:46:59we talked about this right
0:47:00so we learn
0:47:02the generative learning in DBN
0:47:04you know, layer by layer, to be very careful
0:47:07you don't just do it to much
0:47:08and then if you inject some discriminant information
0:47:12it's bound to happen
0:47:13you get new information there, not just looking at the data itself
0:47:16and it turns out that if we do, we get
0:47:19we actually in some experiment we even get slightly better result than DBN training
0:47:23so it's not clear the generative learning
0:47:26plays, is going to play a more important role
0:47:29as some people claimed
0:47:32OK so I'm done with
0:47:35the deep neural network, so I spend a few minutes to tell you a bit
0:47:39more about
0:47:39some different other different kind of architecture called deep convex network
0:47:45which to me is kind of more interesting
0:47:47so I spend most time on this
0:47:49so actually we have a few papers published, it turned out that
0:47:54so the idea of this network is that
0:47:56while this is actually done for MNIST
0:47:58so when use this architecture
0:48:00we actually get so much better result than DBN
0:48:04so we're very excited about this
0:48:05but the point is that the learning has to.. you know
0:48:08we have to simplify this network
0:48:10it turns out learning now
0:48:11the whole thing is actually convex optimization
0:48:14So I do not have time to go through all this
0:48:15we have time for the parallel implementation
0:48:17which is almost impossible
0:48:19for deep neural network
0:48:22and the reason for those of you who've been actually working on neural network, you
0:48:25notice that
0:48:26the learning for
0:48:27discriminant ... discriminant learning phase
0:48:30which is called the fine tuning phase
0:48:32and are typically the stochastic weighted descent
0:48:34you cannot distribute
0:48:35so this one cannot be distributed
0:48:36so I'm not going to
0:48:38I really want to use
0:48:39this architecture to try speech recognition task so usually we have lots of discussion
0:48:42so maybe 1 year from now
0:48:44so if it's working well for you discriminant learning task
0:48:48I'm glad that
0:48:49this now is going to define the task
0:48:51for discrimination that I.. I .. had
0:48:54discussion so
0:48:56that gives me the opportunity to try this
0:48:58I love to try, I love to report the result
0:49:00even it's negative, I'm happy to share with you
0:49:02OK, so thesis's a good architecture
0:49:04and another architecture that we tried
0:49:06is that we split the hidden layers into 2 parts
0:49:09we do the crossproduct, so that overcomes
0:49:12some of the DBN weakness
0:49:14originally not being able to do correlation in the input
0:49:18and people just try a few tricks
0:49:20you know more than correlation
0:49:22it did not work well, almost impossible
0:49:24so thesis's very easy to implement
0:49:26and most of the learning here is convex optimization
0:49:28and often get very good result over others
0:49:31there's another architecture called the tensor
0:49:34so the same kind of correlation
0:49:36modeling for tensor version
0:49:38also can be carried out into
0:49:40deep neural network
0:49:41so we actually, my colleague, we actually submit a paper in Interspeech
0:49:44I think if you're interested in this one, should go there to take a look
0:49:47at it
0:49:47so the whole point is that
0:49:48now rather than doing the stacking using input output concatenation
0:49:52you can actually do the same thing for each of hidden neural network
0:49:56so in this paper, we actually evaluate that on the switchboard
0:50:00and we get additional 5% relative gain out of the best we have got so
0:50:03far. So this is a good staff
0:50:05so the learning becomes trickier
0:50:06because when you do .. so the back propagation
0:50:11and you have to think about how to do this
0:50:13it adds some additional nuisance in term of effective computation
0:50:16but the result is good
0:50:18so now I'm going to the second part, I'm going to skip most of them
0:50:22OK skip most of them
0:50:25OK so this uh... I actually wrote a book on this
0:50:27so this is
0:50:29dynamic Bayesian network as deep one
0:50:31the reason why it's deep is there are many layers
0:50:34so you get the target
0:50:35you get articulation
0:50:36you get environment, all together this
0:50:38so we tried that
0:50:40and the implementation of this is very hard
0:50:43so I just go quickly and then to go to the bottom
0:50:47uh, so, uh, this is one of the paper that
0:50:50uh, I wrote uh, together with
0:50:53one of the experts, who actually
0:50:54this is my colleague who actually invented this variational Bayes
0:50:58and then ... to work with him
0:51:01to implement this variational Bayes
0:51:03into this kind of ...
0:51:06dynamic Bayesian network
0:51:07and the result is very good
0:51:09although the journal we published is wonderful
0:51:11so you can actually synthesize
0:51:13you can track all these formants in very precise manner
0:51:17and then some articulatory problem, it's very amazing, but once you do recognition
0:51:21the result is not very good
0:51:22so I'm going to tell you why, if we have time
0:51:25and then of course
0:51:26one of the problems
0:51:27so this 2006 we actually
0:51:31so we realize that kind of learning is very tricky
0:51:34essentially you approximate things you don't know what you approximate for
0:51:39that's one of the problems of deep Bayesian, it's very
0:51:42but you can get some insights
0:51:43you work with all the experts in the [ ... ]
0:51:45at the end at the bottom line
0:51:47we really don't know how to interpret
0:51:49but you... but is just
0:51:51you don't know how much you lose right
0:51:52so we actually have the simplified version that I spend all time working on, and
0:51:56that gives me this result
0:51:58that's actually the paper
0:51:59so this is .. is about
0:52:01about 2-3 percent better than the best
0:52:04context dependent HMM
0:52:06I'm happy at that time; we stopped at this
0:52:08once we do this
0:52:09and it's so much better than this
0:52:10so in other words, DBN
0:52:12related, or at least in TIMIT task
0:52:14it does so much better than
0:52:16than .. than dynamic Bayesian kind of work
0:52:19and then we're happy about this
0:52:21now of course I won't
0:52:22yes, so this is the history of dynamic model
0:52:25and a whole bunch of thing going on there
0:52:27and the key is how to embed
0:52:29such dynamic property into the DBN framework
0:52:33if you embed the property of
0:52:36big chunk into
0:52:37dynamic Bayesian network is not going to work ... due to technical reasons
0:52:42but the other way around has a hope, that's one of the
0:52:46so the part 3 will going to tell you
0:52:49which I'm running out of time
0:52:50I'm actually going to show you
0:52:52first of all some of the lessons
0:52:54so thesis's the deep belief network or Deep Neural Network
0:52:57and this, I used the * here, to refer that to as Dynamic Bayesian Network
0:53:02so one
0:53:05so all these hidden dynamic models .. is the special case of the Bayesian network
0:53:10you can see that, or otherwise I showed you earlier on
0:53:13there were a few key differences that we learned
0:53:15one is that for DBN
0:53:17it's distributed implementation
0:53:20so in our current system, for this system
0:53:23in our HMM/GMM system
0:53:25we have the concept that this particular model
0:53:28is related to a
0:53:29this particular model is related to e right
0:53:31you have this concept right, and of course you need training to mix them together
0:53:34but you still have the concept
0:53:35whereas in this neural network.. no .. each weight
0:53:39codes all class information
0:53:41I think it's very powerful concept here
0:53:43so you learn things and get distributed
0:53:45it's like neural system right
0:53:47you don't say this particular neuron contains visual information
0:53:50it can also code audio information together
0:53:53so this has better
0:53:55neuron basis compared with conventional techniques
0:53:58also ...... when we did this model
0:54:01we just set one single bit wrong
0:54:04at that time, we all said ... we don't have parsimonious model representation.
0:54:08that's just wrong
0:54:105 years ago, 10 years ago, may be OK right
0:54:12now in our current age
0:54:14just use massive parameters if you know how to learn them
0:54:17and also you know how to regularize them well
0:54:19and just turn on that the DBN has a mechanism
0:54:21to automatically regularize things well
0:54:24and that is not proven yet, I don't have the theory to prove that
0:54:26but in our ... u know every time you stack up
0:54:29u can intuitively understand that
0:54:31u don't overfit right
0:54:32because if u do overfit, u do this many years ago
0:54:36but if u do this, u know keep going deep, u don't over fit because
0:54:39whatever information that u get applied
0:54:41the new parameters
0:54:43actually sort of take into account
0:54:46the feature from lower parameters, so it doesn't count as lower
0:54:50model parameters any more, so automatically u have the mechanism to do this
0:54:53but in DBN, u don't have that property
0:54:55u need to stop, it doesn't have that property
0:54:57so this 's very strong
0:54:59and another key difference
0:55:01is something I talked about earlier
0:55:03product vs mixture
0:55:06mixture is you sum up probability distribution
0:55:08and product is you take product between them
0:55:11so when you take the product, you actually exponentially expand the power of representation
0:55:16So these all the key differences between these two type of model.
0:55:19Another important thing is that for this learning we combine generative and discriminative.
0:55:26Although the final result we got, we still think that discriminative is more important than
0:55:32But at least in the initialization, we use the generative model and DBN to initialize
0:55:38the whole system and discriminative learning to adjust the parameters.
0:55:42The generative model we did earlier is purely generative.
0:56:02Finally, longer windows or shorter windows.
0:56:07In the earlier case, I am still not very happy about longer window.
0:56:15Because every time you model dynamics which I've actually talked about this, about free method.
0:56:21How to build dynamics into the model, they both have a very short history, not
0:56:29long history.
0:56:30No history of research actually focused on dynamics.
0:56:34There is so many limitations, you have to use short window. in long window, nothing
0:56:39works hard, we've tried all these.
0:56:46So deep recurrent network is something that many people working on now.
0:56:52In our lab, in the summer, much as all the projects relate to this. Maybe
0:56:58not all, at least very large percentage.
0:57:01It has worked well for both acoustic model and language model. I would say that,
0:57:07recurrent network has been working well for acoustic modeling.
0:57:29In language modeling, there is a lot of good project in the recurrent network.
0:57:47The weakness of this approach, there is a generic temporal dependency.
0:57:59I have no idea what it is, there is not constraint, one following another. This
0:58:06kind of temporal model is not very big.
0:58:09The dynamics in DBN is much better.
0:58:15In term of interpretation, in term of generative capability, in term of physical speech production
0:58:19mechanism, it is just better. The key is how to combine them together.
0:58:23We don't like this, and we have shown that all this does not capture the
0:58:32essence of speech production dynamics.
0:58:35There is huge amount of information redundancy, think about you have a long window here
0:58:41and every time you shift ten millisecond and 90% of the information overlapping.
0:58:59And some people may argue that it doesn't matter and they did experiment to show
0:59:03that it doesn't help at all.
0:59:04The importance of optimization techniques is the Hessian-free method.
0:59:18I am not sure in language modeling, you may not do that actually, but in
0:59:22acoustic modeling, this is a very popular technique.
0:59:25And also, another point is that recursive neural network for parsing in NLP has been
0:59:31very successful.
0:59:32I think last year in ICML, they actually presented the result of recursive neural net
0:59:36which is not quite the same as this, but used the structure for the parsing,
0:59:40they actually got state-of-the-art result for the parsing.
0:59:43The conclusion of this slide is it's an active and exciting research area to work
0:59:48So the summary is as follows. I provide historical accounts of two fairly separate research.
0:59:57One is based upon DBN, the other one is based on Dynamic Bayes Network in
1:00:05So I actually hopefully show you that speech research motivates the use of deep architectures
1:00:13from speech production and perception mechanisms.
1:00:16And HMM is a shallow architecture with GMM to link linguistic units to observations.
1:00:26Now I have shown you that, I didn't have time to talk about this, the
1:00:31point is this kind of model has less success then it is expected.
1:00:34And now we are beginning to understand why that is a limitation over here, and
1:00:40actually I have shown some potential possibilities of overcoming that kind of limitations in the
1:00:47neural network framework.
1:00:48So one of the thing that we understand why this kind of limitation that has
1:00:53been developed in the past has not be able to take advantage of the dynamics
1:00:58into the deep network.
1:01:00It's because we didn't have the distributed representation, didn't have massive parameters, didn't have fast
1:01:06parallel computing and we didn't have product of experts.
1:01:09All these things are good for this, but the dynamics are actually good for this,
1:01:13and how to merge them together, I think is a very popular research that actually
1:01:17work on.
1:01:18You can actually make the deep network to be scientific in terms of speech perception
1:01:23and recognition
1:01:24So the outlook the future direction is that so far we have DBN DNN to
1:01:32replace HMM GMM .
1:01:34I would expect in within three five years, you may not be able to see
1:01:40GMM especially in recognition.
1:01:41I think in industry.If I am wrong then shoot me.
1:01:49The dynamic properties of model of this Dynamic Bayesian Network speech has the potential to
1:02:01replace HMM.
1:02:15And the Deep Recurrent Neural Networks, which I have probably tried to argue that there
1:02:21is a need to go beyond unconstrained temporal density while making it easier to learn.
1:02:27Adaptive learning is so far not so successful yet, we tried a few projects, it
1:02:33is harder to do it.
1:02:35The scalable learning is hard, for industry at least is, for academic don't worry about
1:02:42As well as NIST define it into small tasks, you will be very happy to
1:02:47work on that. But for industry this is a big issue.
1:02:50Reinventing our infrastructure at the industrial scale. I think we have time to go through
1:02:59all the applications.
1:03:00Spoken language understanding, has been one of the successful application I've shown you.
1:03:08Information retrieval, language modeling, NLP, image recognition, but the speaker recognition is not yet.
1:03:24The final bottom line here is that the deep learning so far is weak in
1:03:30theory, I have I have convinced you about it with all the critics.
1:05:18In Bengio case, he randomize everything first. And then if you do that, of course,
1:05:24it is bad.
1:05:26So the key is that, if you get something did so best, I think to
1:05:31me what generative model maybe useful in that case. But the key of this learning
1:05:36is if you put a little bit discrimination over here, it is probably better.
1:06:47So probably the best is you use the structure here and also this, and we
1:06:52know how to train that now. I think both width and depth is important.
1:07:09We tried that, we didn't fix the measurement, we just used algorithm to cut out
1:07:15all the way. We didn't lose anything, in fact from the result I showed you,
1:07:20it still gains a little bit.
1:07:29Cross validation.
1:07:32That's no way, there is no theory on how to do that.
1:07:35But in particular case, some of the networks that I've shown you, I have theory
1:07:39to do that, I can control that.
1:07:44There's some networks, you can do theory. That means you can automatically determine it from
1:07:49data. But for this deep belief network, it is weak in theory.
1:08:31He is also doing deep graphical model.
1:08:48Two years ago, he gave this ? on how to learn the topology of deep
1:08:54neural network, in term of width and depth.
1:08:57And he was using Indian Buffet Process.
1:09:03At the end, everything has to be done by Monte Carlo simulation and for five
1:09:10by five, he said simulation take about several days.
1:09:15I think that approach is not scalable, unless people improve that aspect.
1:09:27That also motivates more of the academic research on machine learning to make that scale.
1:09:31I think the idea is good, but the technique is so slow to do anything
1:09:35about this.
1:09:50For deep neural network, stochastic gradient is still doing the best, it is good enough.
1:09:55But my understanding is, we are actually playing around with this. He wants to add
1:10:01the recurrence some more complex architecture, stochastic gradient isn't strong enough.
1:10:05There is a very nice paper done by Hinton's group, one of his PhD student.
1:10:12Who actually used Hessian free optimization to do DBN learning.
1:10:20They actually showed that result is just one single figure, very hard to interpret that
1:10:27one, the paper in ICML 2010. It's doing better in this compared with using DBN
1:10:34to initialize neural network.
1:10:36To me, it is very significant. We are still borrowing this for more complex network,
1:10:44more complex second order method, probably it will be necessary.
1:10:50And also the other advantage of Hessian free is the second order, it can be
1:10:54parallelized for bigger batch training rather than minibatch training, and that makes big difference.
1:11:06We tried that one, it doesn't work well for DBN, we need to have a
1:11:15lot of data. Probably the best for DBN network is stochastic gradient .
1:11:22If you are using the other networks, some later networks that we have talked about.
1:11:31They are naturally suited for batch training.
1:11:35In some more modern version of the network, batch training is desirable. They are designed
1:11:47for those architecture, it is for parallization.