0:00:15Like the first of thing. I will thank the organisers for having
0:00:19this opportunity to share with you
0:00:22some of my personal views
0:00:25on this very hot topic here. So,
0:00:29I think the goal of this tutorial really is to
0:00:33help diversifying the deep learning approach. Just like the theme of this conference,
0:00:40Interspeech For Diversifying
0:00:42the Language, okay.
0:00:45So I have a long list of people to thank. Oh, so I want.
0:00:49Yeah, thank you.
0:00:50So I have long ... long list of people here to thank.
0:00:53Especially Geoff Hinton. I worked with him for some period of time.
0:00:58And Dong Yu and whole bunch of Microsoft colleagues.
0:01:02Oh, who,
0:01:05contributed a lot to the material
0:01:08I'm going to go through.
0:01:10And also I would like to thank many of the colleagues sitting here who had
0:01:14a lot of discussions with me.
0:01:16And their opinions also shaped some of the content that I am going to go
0:01:20through with you over the next hour.
0:01:23Yeah, so the main message of this talk
0:01:26is that deep learning is not the same as deep neural network. I think in
0:01:30this community most of people
0:01:31mistake deep learning with deep neural network.
0:01:36And most ...
0:01:38So deep learning is something that everybody here would know. I mean just look at
0:01:42... I think I counted close to 90 papers somewhere
0:01:44related to the deep learning or approaching at least. Kind of the number of papers
0:01:50increasing over last twelve years.
0:01:53So deep neural network is essentially neural network
0:01:56you can unfold that in space. You form a big network.
0:02:03Either way or both. You can unfold that over time. If you don't unfold
0:02:07that neural network over time because of reccurent network, okay.
0:02:11But there's another very big branch of deep learning, which I would call Deep Generative
0:02:17Like a type of neural network you can also unfold in space and in time.
0:02:22If it's unfolded in time, you would call it a dynamic model.
0:02:26Essentially the same concept. You unfold the network.
0:02:31Oh. You know
0:02:32in same direction in terms of time
0:02:36But in terms of space they are unfolded in the
0:02:39opposite direction. So I'm gonna elaborate this part. And for example
0:02:43our very commonly used model.
0:02:46You know a Gaussian Mixture Model, hidden Markov model, really has the
0:02:53neural network unfolding in time.
0:02:57But if you make that unfolding in space you get big Generative Model
0:03:00which hasn't been very popular in our community.
0:03:05You know I'm going to survey whole bunch of work related to this area, ah,
0:03:09you know through the my discussion with many people here.
0:03:14But anyway so the main message of this talk is eventually to
0:03:20hope and I think there's a promising direction that is already taking place in machine
0:03:26learning community
0:03:27I don't know how many of you actually went to International Conference on Machine Learning
0:03:30(ICML) this year, just a couple of months ago in Beijing.
0:03:33But there's huge ammount of work in Deep Generative Model and some very interesting
0:03:36development, which I think I'd like to share with you at high level,
0:03:41so you can see that all this deep learning, although it just started in terms
0:03:46of application in our
0:03:47speech community, we should be very proud of that.
0:03:51Hmm, now,
0:03:52In number of machine learning communities there's huge amount of work going on
0:03:57in Deep Generative Model. So I hope I can share with you some of recent
0:03:59development with you to
0:04:02to enforce the message that
0:04:05a good combination between the two
0:04:08which have
0:04:10complementary strengths and weaknesses can be actually get together to further
0:04:15advance deep learning in our community here.
0:04:19Okay, so now. These are very big slides. I'm not going to go through all
0:04:23of details. I'm just going to highlight a few
0:04:25things so in order to enforce the message that
0:04:30generative model and
0:04:31neural network model can be helping each other.
0:04:34I'm just going to highlight a few key attributes of
0:04:38both approaches. They are very different approaches.
0:04:41I'm going to highlight that very briefly. First of all
0:04:45in terms of structure they are both graphical in nature as a network, okay.
0:04:48You think about this deep generative model, typically some of these
0:04:56we call that a Dynamic Bayesian network. You actually have joint probability between ?? label
0:05:01and the observation.
0:05:03And which is not the case for deep neural network,
0:05:06In the literature you see many other terms
0:05:10that relate to deep generative model like probabilistic graphical model,
0:05:14such as stochastic neurons,
0:05:17sometimes it's called the stochastic generative network as you see in literature. They all belong
0:05:21to this
0:05:22category. So if your mindset is over here, even though you see some neural words
0:05:28describing that
0:05:29you know you won't be able to read all this literature, so the mindset is
0:05:32very difficult when you study these two.
0:05:34So the strenght
0:05:35of deep generative model is that,
0:05:39this is very important to me,
0:05:42how to interpret, okay.
0:05:44So everybody that I talked, including the lunchtime when I talk to students,
0:05:49they complain. I say: have you heard about deep neural network? and everybody says yes,
0:05:52we do.
0:05:54To what extent have you started looking to that? and they said we don't want
0:05:57to do that because we cannot
0:05:58even interpret what's in the hidden layer, right.
0:06:01And that's true
0:06:02and that actually is quite very deciding. I mean if you
0:06:05read into this ?? science literature in terms of connectionist model
0:06:09really the whole design is that you need to have a representation here to be
0:06:13So each neuron can represent different concept
0:06:17and each
0:06:18concept can be represented by different neurons, so the very design
0:06:21it's not meant to be interpretable,
0:06:24And that actually creates some difficulty for many
0:06:27and this model is just opposite. It's very easy to interpret because the very nature
0:06:33of generative story.
0:06:34You can tell what the process is
0:06:36and then of course if you want to do
0:06:39a classification or some other application in machine learning
0:06:42you simply just have to ..
0:06:44and for forecast we simply have base route to invert that, that's exactly what in
0:06:48our community
0:06:49we have been doing for thirty years hidden Markov model. You get the prior, you
0:06:52get generative model and
0:06:53you multiply them and then you do it. Except at that time we didn't know
0:06:57how to make that
0:06:58deep for this type of model. And there are some piece of work that I'm
0:07:01going to survey.
0:07:02So that's one big part of the advantage of this model.
0:07:05Of course everybody know that what I just mentioned there.
0:07:09In deep generative model actually the information flow is from top to down.
0:07:13You actually have .. what top simply means is that you know you get a
0:07:16label or you get a higher level concept
0:07:18and the lower level down simply means you can rotate to fit into that.
0:07:22Everybody know that in a neural network
0:07:25the information flow is from bottom to up, okay. So you fit the data and
0:07:29you compute whatever output and then
0:07:30you go either way you want.
0:07:31In this case
0:07:33the information come from top to down. You generate the information
0:07:37and then if you want to do classification, you know, any other machine
0:07:42learning applications, you know you can do Bayesian. Bayesian is very essential for this.
0:07:49But there's whole list of those. I don't have time to go through, but just
0:07:52you know those are high lights, these
0:07:54we have to say. So the main strenght of deep neural network that actually gained
0:07:59over the previous years, really is mainly due to these strenghts.
0:08:04It's easier to do a computation in terms of
0:08:10so this what I wrote is a regular compute, okay.
0:08:13So if you
0:08:14look into exact what kind of compute is involved here
0:08:17it's just the millions of millions of millions of times of computing
0:08:21of the big matrix by a vector.
0:08:23You do that many times. ?? place very small model role
0:08:27it's very regular.
0:08:28And therefore GPU is really
0:08:31ideally suited for this kind of computation
0:08:33and that's not the case for this model.
0:08:36So if you compare between these two then you really will understand that if you
0:08:41can pull
0:08:42some of these advantages into this model
0:08:44and pull some of this advantage in this column into this one
0:08:48you have integrated model. And that's kind of the message I'm going to convey and
0:08:53I'm going to
0:08:54give you example to show how this can be done.
0:08:57Okay, so in terms of interpretability it's very much related to
0:09:04how to incorporate the main knowledge
0:09:06and network constraint into the model. And for deep neural network it's very hard.
0:09:12What people have done that, I have seen many people in this conference and also
0:09:16in a ??
0:09:17tried very hard it's not very natural.
0:09:20What is
0:09:22This is very easy
0:09:23I mean you can code your domain control knowledge directly into the system. For example
0:09:29like distorted speech, voice speech, you know
0:09:32in the summation, into special domain, summation of
0:09:35either wave-form domain is a noise
0:09:38the clean speech you get by observation. That's so simple you just cut that into
0:09:43one layer, into summation or
0:09:44you can call them in terms of Bayesian probability very easily.
0:09:47This is not that easy to do. People tried to do that, it's not just
0:09:51as easy.
0:09:51So to encode
0:09:53a domain knowledge into network constraint of the problem
0:09:58your deep learning system. This has great advantage.
0:10:01So I'm actually, I mean this is just a random selection
0:10:03of things you know. There's very nice paper over here
0:10:06Acoustic Phonetics.
0:10:08All this knowledge at speech production
0:10:11and this kind of nonlinear
0:10:14and this is an example of this is noise robust. You put the phase information
0:10:19of the speech and noise you can come up with
0:10:22very nice conditional distribution. It's kind of complicated
0:10:24but this one can be put directly
0:10:26into generative model and this is some example of this. Whereas in deep neural networks
0:10:31it's very hard to do.
0:10:33So the question is that do we want to throw away all these knowledge in
0:10:36the deep learning
0:10:37and my answer is of course no. Most of people will say no, okay.
0:10:45And people from the outside of speech (community) there was a yes. I talk about
0:10:48some people in machine learning,
0:10:49anyway so since this is speech conference I really want to emphasise that.
0:10:54So the real
0:10:55solid reliable knowledge that we attained
0:10:58from speech science
0:10:59that has been reflected by local talks are
0:11:03such as yesterday's talk, talking about how some patterns have been shaped by you by
0:11:09?? and perceptionists. They were really playing a role in deep generative model.
0:11:14But very hard to do that in deep neural network.
0:11:17So with this main message in mind
0:11:20I'm going to go through three parts of the talk as I put them in
0:11:24my abstract here.
0:11:25So I need to go very briefly
0:11:27through all these three topics.
0:11:30Okay, so the first part is to give very brief history of how deep speech
0:11:36recognition started.
0:11:38So this is a very simple list. There are so many papers around. Before the
0:11:43rise of the deep learning around
0:11:452009 and 2010. There are lots of papers around. So I hope I actually have
0:11:50a reasonable
0:11:52sample of the work around here.
0:11:54So I don't have time to go through, especially for those of you who are
0:11:58in ?? open house
0:12:00There was in 1988, I think in 1988
0:12:03ASRU and at that time there's no U, it's just
0:12:05ASR. And there is some very nice paper around here and then quickly
0:12:10you know it was
0:12:11superseded by the hidden Markov model approach.
0:12:15So I'm not going to go through all these
0:12:17so except to point out that
0:12:20neural network
0:12:22has been very popular for awhile.
0:12:24But towards this you know,
0:12:26plus ten years
0:12:28before the deep learning actually took over neural network approach
0:12:33essentially didn't really make
0:12:36such a strong impact compared with deep learning network that people have been seeing.
0:12:41So I just give you one example to just show you how unpopular
0:12:45the neural network was at that time.
0:12:48So this is about 2008 or 2006, about nine years ago.
0:12:53So this is the optimization that I think
0:12:56is predecessor
0:12:57of ?? IOPPA.
0:12:58So they actually got several of together, locked us up into hotel
0:13:03near Washington, DC.
0:13:05airport somewhere.
0:13:07Essentiall the goal is to say that well the speech
0:13:09recognition is stuck there, so you come over here and help us brainstorm next generation
0:13:15of speech recognition and understand technology
0:13:18and then we actually spent about four or five days in the hotel and at
0:13:22the end we wrote very thick report,
0:13:25twenty some pages of report.
0:13:26So there is some interesting discussion about history and the idea is that
0:13:31if government give you unlimited resource and gives you fifteen years what is it you
0:13:35can't do, right?
0:13:36So most of the people in our discussion,
0:13:39we all focused on neural network, essentially
0:13:41margin is here,
0:13:42macro-random field is here, conditional-random field is here and graphical model here.
0:13:48So it
0:13:50that was just couple of years before that deep learning actually came out at that
0:13:54so neural network was actually one of the
0:13:57two's around.
0:13:58Haven't really make a big impact.
0:14:01So on the other hand the graphical model was actually mentioned here because it's related
0:14:06to deep generative model.
0:14:08So I'm going to show you a little bit, well this is slide about deep
0:14:12generative model, actually I made some list over here.
0:14:15One of the
0:14:18but anyway so. This let's go over here.
0:14:21I just want to highlight couple of
0:14:25related to
0:14:27introduction of deep neural network in the field.
0:14:30Okay so one of, this is ?? John Riddle?
0:14:32actually we spent a summer in ?? in 1989,
0:14:37or 1988.
0:14:39Fifteen and some years ago. So we spent really interesting summer altogether.
0:14:46and that's kind of the model, deep generative model, the two versions we actually put
0:14:52and at the end we actually brought a very thick report that were about eighty
0:14:56pages of report.
0:14:58So this is deep generative model and it turned out that this model
0:15:02actually both of those models are implemented in neural networking.
0:15:06And thinking about neural network as simply just function of function of mapping
0:15:09so if you map the hidden representation
0:15:12from you know
0:15:13as part of deep generative model into whatever observation you have
0:15:18MFCC. Everybody used MFCC at the time.
0:15:22You actually need to have done the mapping and that was done in neural network
0:15:27in both versions
0:15:28and this is statistical version which we
0:15:31call the hidden dynamic model. It's one of the conversion
0:15:34of deep generative model.
0:15:36It didn't succeed. I'll show you the reason why. Now we understood what.
0:15:40Okay, so it's interesting enough in this
0:15:43model we actually used, if you read the report, it actually turned out that model
0:15:47was here since Geoff told me that
0:15:49the video for this workshop is still around there so it's called ?? sign. I
0:15:53think I mentioned to ?? pick it out.
0:15:56It turned out that learning of this workshop, which details are in this report
0:16:00is actually use the back propagation to do it. Now direction isn't from up to
0:16:03down, since your model is
0:16:05top down, the propagation must be bottom up.
0:16:08So nowadays
0:16:10when we do speech recognition the error
0:16:14function is a softmax or sometime you can use the mean square error.
0:16:18And the measure is in terms of your label.
0:16:23This is the opposite. The error is measured in terms
0:16:26of matching between how generative model can match with the observation. And then when you
0:16:31want to
0:16:31learn you go bottom up learning. Which actually turned out to be better propagation. So
0:16:35that propagation doesn't have to be done (up to bottom)
0:16:37it can be bottom up. Depending on what kind of models you have.
0:16:40But key is that this is
0:16:41a gradient descent method.
0:16:44So actually we got disappointing result for switchboard. You know because we tended to be
0:16:48a bit off game.
0:16:49And now we understand why. Not at that time. I'm sure some of you experienced
0:16:52it. I have a lot
0:16:53of thinking about how deep learning and this can be integrated together.
0:16:59So at the same time
0:17:02Okay so this is a fairly simple model, okay. So you have this hidden representation
0:17:07and it has
0:17:08specific constrains built into the model,
0:17:11by the way which is very hard to do when you do bottom-up neural network.
0:17:15And for generative model
0:17:16you can put them very easily down there, so for example
0:17:18articulatory trajectory has to be smooth
0:17:22and then specific form of the smoothness can be built indirectly
0:17:26by simply writing the generative probabilities. Not in the deep neural network.
0:17:31So at the same time
0:17:33we actually, also this was done in ??
0:17:38and we were able to even put this nonlinear phonology in terms of
0:17:43writing the phonemes into the invidiual constituents at the top level and ?? also has
0:17:49very nice paper, some fifteen years ago, talking about this.
0:17:53And also the robustness can be directly integrated into
0:17:57articulator model simply by generative model. Now for deep neural network it's very hard to
0:18:01For example you can actually
0:18:05this is not meant to be seen. Essentially this is one of the conditional likelihood
0:18:10that covers
0:18:11one of the links. So everytime you have got the link
0:18:15you have conditional dependency parent to children that have differnt neighbours.
0:18:22And then you can specify them in terms of
0:18:24conditional distribution. Once you do that you formed a model
0:18:27you can embed
0:18:28whatever knowledge you have, you think is good, into the system. But anyway
0:18:33but the problem is that the learning is very hard
0:18:35and that problem of the learning in machine community only was solved just within last
0:18:42At that time we just didn't really know. We were so naive.
0:18:47We didn't really understand all the limitations of learning. So just to show you we
0:18:50talk, okay. One of the
0:18:51things we did was that, I actually worked on this with my colleagues Hagai Attias.
0:18:55He is actually one of the
0:18:56he is my colleague working not far away from me at that time, some ten
0:19:01years ago.
0:19:02So he was the one who invented this very initial base. Which is very well
0:19:07So the idea was as follows. You have to break up these pieces into the
0:19:11modules, right.
0:19:12For each module you have this, this is actually
0:19:17dependence of the continuous hidden representation
0:19:20and it turned out that the way to learn this,
0:19:23you know in a principle, what is to do is EM (Expectation maximization). It's variational
0:19:26So the idea is very crazy.
0:19:28So you said you cannot solve that regressively and that's well known. It's loopy neural
0:19:34network. Then you just cut all important things you
0:19:37carry out. Hoping that M-Step can make it up. That's very crazy idea.
0:19:41And that's the best around time that was there.
0:19:43But it turned out that you've got the auxiliary function and you form is still
0:19:48something very
0:19:49similar to our EM, you know in HMM. For the general model you don't have
0:19:55to look you can get rigorous solution.
0:19:57But now when you have deep it's very hard. You have to make up for
0:20:01it. And that ?? is just as ??bad-ass
0:20:03many people could ?? on deep neural network. This ?? deep generative model
0:20:08probably have more
0:20:09than otherwise. Although they patched themselves
0:20:12to be
0:20:13you know very rigorous. But if you really walk on that, so I can pick
0:20:17out of this, so it's
0:20:18for this approach we get surprisingly good inference results for continuous variables.
0:20:22And in one version what we did was actually we used phonemes
0:20:27you know as a hidden representation and it turned out it tracked. And once you
0:20:31do this you
0:20:32check the phoneme really precisely.
0:20:34As a byproduct this worked as we created
0:20:38this worked as we created database for formant tracking
0:20:42but if we actually do
0:20:45inference only the linguistic unit which is the problem
0:20:48of recognition we didn't really make much progress on this.
0:20:51But anyway so I'm going to show you some of these preliminary results to show
0:20:56you how this
0:20:57is one way that led to the deep neural network.
0:21:00So when we actually simplify the model in order to finish the decoding we actually,
0:21:07this is actually ?? result
0:21:09and we would bring out all of analysis for different kinds of phones.
0:21:12So when we use this kind of generative model with deep structure it actually corrected
0:21:17many errors
0:21:18which are related the short phones.
0:21:20And you understand why because you designed model to make that happen and then you
0:21:24know if
0:21:25everything is done recently well you actually get results. So we actually look
0:21:28at not only corrected short phones for the vowel
0:21:32but also it correct the a lots of
0:21:34consonants because they're up with each other.
0:21:36It's just because the model design whatever hidden trajectory that you get
0:21:40it's influenced, the parts of the vowel is influenced
0:21:45by the adjacent sound.
0:21:47And that's
0:21:47this is due to the coarticulation.
0:21:49This work will be very naturally built into the system
0:21:51and one of things I am very much struggling with deep neural network is that
0:21:55you can't even build this kind of
0:21:56information that easily, okay.
0:21:59This is to convince you how things can be breached.
0:22:03It's very easy to interpret the results. So we look at the error we
0:22:07know wow these are quite a big data assumption.
0:22:11Without the have to go through for example in this these examples of these are
0:22:14the same sounds, okay.
0:22:15You just speak fast then you get something like this
0:22:17and then we actually looked at the error and we said Ohh.
0:22:20You know
0:22:22that's exactly what happened. You know mistake was made in the
0:22:27Gaussian Mixture Model because it doesn't take into account these particular dynamics. Now this one
0:22:31was pulling correct error
0:22:32And I'm going to show you in deep neural network things are reversed, so that's
0:22:37related to ??. But in the same time
0:22:39in machine learning community also the speech
0:22:42there is a very interesting model for the deep generative model developed
0:22:46and that's called the Deep Belief Network.
0:22:48so in the earlier literature before about three or four years ago
0:22:52DBN, Deep Belief Network, NTA I mix each other, even by the authors
0:22:56it's just because most people don't understand what it is
0:22:59so this is very interesting paper that is starting in 2006
0:23:02many people, most people in machine learning, regard this paper to be the start of
0:23:07deep learning.
0:23:08And thus the generative model so you prefer to say deep
0:23:12generative model actually started the deep learning rather than deep neural network.
0:23:17But this model has some intriguing probabilities
0:23:21that really at the time attracted my attention here.
0:23:25It's totally not obvious, okay.
0:23:28So for those of you who know RBM and DBM you know when you are
0:23:32stacking up this undirected model
0:23:34sever time you get DBN, that's
0:23:37you might think that the whole thing will be undirected,
0:23:40you know bottom-up machine, no. It's actually directed model coming down.
0:23:44You have to read this paper to understand why.
0:23:47So why do they? I said someone was wrong. I couldn't understand what happened.
0:23:50But on the other hand it's much simpler than the model I showed you earlier
0:23:54for deep network we get the temporal dynamics.
0:23:56This one it's not temporal dynamics over here.
0:24:01the most intriguing aspect of DBN
0:24:03as described in this paper is that inference is easy.
0:24:06Normally you think inference is hard. That's the tradition.
0:24:10It's given fact if you have these multiple dependencies on the top it's very hard
0:24:15to make voice
0:24:16and there's special constraint built into this model. Namely the restriction in the connections of
0:24:22because of that it makes inference. It's just a special case.
0:24:25This is very intriguing, so I thought this idea may help
0:24:29the deep general model I showed you earlier.
0:24:32So he came to reason me, you know. We discussed it.
0:24:36It took him a while to explain what this paper is.
0:24:40Most of people at Microsoft at that time couldn't understand what's going on.
0:24:45So now let's see how
0:24:46and then of course what we get together this deep generative model
0:24:50and the other deep generative model I talked about with you I actually worked on
0:24:54for almost ten
0:24:54years at Microsoft. We were working very hard on this.
0:24:57And then we came up with the conclusion that well we have to use fewer
0:25:00clues to fix problem.
0:25:01And they don't match, okay. The reason why they don't match is whole new story
0:25:05why they don't match.
0:25:06The main reason is actually not just temporal difference, it's the way you prioritize
0:25:12the model and also the way to represent
0:25:15the information is very different
0:25:17despite the fact that they're both generative models.
0:25:19It turned out that this model is very good for speech synhesis and ?? has
0:25:22very nice paper
0:25:23using this model to do synthesis. And it's very nice to do
0:25:26image generation. I can see that very nice probably.
0:25:30Not for continuous speech it is very hard to do
0:25:33and for speech for general synthesis it's good it's because if you have segment with
0:25:39context into account, like syllable in Chinese it is good, but for English it is
0:25:42not that easy to do.
0:25:44But anyway so we need to have few kluges to fix together, to merge these
0:25:48two models together.
0:25:49And that sort of led to the end.
0:25:51So the first kluge is that
0:25:54you know
0:25:55the temporal dependency is very hard. If you have temporal dependency you automatically loop and
0:26:00everybody in machine learning at that time knew, most of speech persons, so I thought
0:26:05machine learning that I show you early on actually just didn't work well, it didn't
0:26:09worked out well. And most of people who were
0:26:12very much versed in machine learning who say there's no way to learn that.
0:26:15Then cut the dependency. It's way to do it, cut the dependency in the hidden
0:26:20dimension, in the hidden revision
0:26:21and loose all the powers of
0:26:23deep generative model
0:26:25and that's the Geoff Hinton's idea, well it doesn't matter, just use a big window.
0:26:30If it fixes the clues and that actually
0:26:34is one of things that actually helped
0:26:36to solve the problem
0:26:38and the second Kluge is that you can reverse direction
0:26:41the inference in generative model is very hard to do as I showed earlier.
0:26:45Now if you reverse direction
0:26:48from top-down to bottom-up
0:26:52and then you don't have to solve that problem. And that's why it would be
0:26:56just a deep neural network, okay. Of course
0:26:58everybody said: we don't know how to train them, that was in 2009.
0:27:02Most people don't know how to ??
0:27:03and then he said that's how DBN can help.
0:27:07And then he did a fair amount of work on DBN to initialize that ??
0:27:12So this is very well-timed academic-industrial collaboration. First of all
0:27:16it's because speech recognition industry has been searching for new solutions when principle
0:27:22deep generative model could not deliver, okay. Everybody
0:27:24was very upset about this at the time.
0:27:27And at the same time academia developed deep learning tool
0:27:30DBN, DNN, all the hybrid stuff that's going on.
0:27:33And also CUDA library was released around that time. It's very recent times.
0:27:40So this is probably one of the earliest catching on
0:27:44for this GPU computing power over here.
0:27:47And then of course big training data in ASR that has been around
0:27:52and most people, if you actually do
0:27:55Gaussian Mixture Model for HMM where a lot of data performance accelerates, right.
0:28:00And then this is one of things that in the end really is powerful. You
0:28:04can increase the size and depth
0:28:07you know put in a lot of things
0:28:08into to make it really powerful.
0:28:11And that's the scalability advantage that I showed you early on. That's not the case
0:28:15for any shallow model.
0:28:18Okay, so in 2009 I and three of my colleagues didn't know what's
0:28:23happening. So we actually got together to
0:28:26to do this
0:28:27to this workshop
0:28:28to show that
0:28:29this is useful thing, you know, to bring stuff.
0:28:32So it wasn't popular at all. I remember
0:28:35you know Geoff Hinton and I we actually got together to
0:28:40who we should invite to give us
0:28:42speech in this workshop.
0:28:44So I remember that one invitee which shall be nameless here
0:28:47he said: Give me one week to think about, and at the end he said:
0:28:50it's not worth my time to fly to Vancouver. That's one of them.
0:28:53The second invitee, I remember this clearly, said: This is crazy idea. So in the
0:28:57e-mail he said
0:28:58What you do is not clear enough for us.
0:29:01So we said you know
0:29:02waveform may be useful for ASR.
0:29:04And then the emails said: Oh why?
0:29:07So we said that's just like using pixel for image recognition. That was popular.
0:29:12For example convolutional network there are pixels.
0:29:15We take similar approach. Except it is waveform.
0:29:17And the answer was: No, no, no that's not same as pixel. It is more
0:29:22like using photons.
0:29:23You know making kind of joke essentially. This one didn't show up either. But anyway
0:29:30anyway so this workshop actually has
0:29:34a lot of brainstorming I had to analyze, all the errors I showed you early
0:29:39But it's really good
0:29:41workshop for about four or five years that was
0:29:44five years ago now.
0:29:45So now I move to part 2
0:29:48to discuss achievements. So actually in my original post I had whole bunch of slides
0:29:53on vision.
0:29:54So the message for the vision is that if you go to vision community
0:29:59they look at deep learning to be
0:30:01just even
0:30:02maybe thirty time
0:30:04thirty times more popular than deep learning in speech.
0:30:07So they actually, the first time they did that was actually first time they
0:30:12actually got the results.
0:30:16and noone believed it's the case. At the time I was given a lecture
0:30:20at Microsoft about Deep Learning
0:30:22and then right before I, actually Bishop
0:30:25was doing the lecture together with me
0:30:30and then this deep learning just came out and Geoff Hinton sent e-mail to me:
0:30:34Look at the matching! How much bigger it is.
0:30:36And I showed them. People were like: I don't believe it. Maybe a special case.
0:30:40You know. And it turned out it's just much
0:30:42just as good.
0:30:43Even better than speech. I actually cut all the slides out. Maybe some time I
0:30:46will show you.
0:30:47So this is big area to go. So today I am going to focus on
0:30:51So one of things that we found during that time
0:30:55is that we have very interesting discovery that we actually used the model that I
0:30:59showed you there
0:31:00and also deep neural network here.
0:31:03And that actually is the number that we analyzed
0:31:06error pattern very carefully. So it's very good, you know for TIMIT.
0:31:10You can disable language model, right.
0:31:12Then you can understand the errors for acoustic ?? very effectively
0:31:15and I tried to do that afterwards, you know, to do other tasks
0:31:20and it's very hard once you put language model in there you just couldn't
0:31:23do any analysis. So it's very good at the time we did this analysis.
0:31:26So now the error pattern in the comparison
0:31:30is, I don't have time to go through except just to mention that.
0:31:33So DNN made many new errors on short undershoot vowels.
0:31:37So it sort of undo what this model is about to do
0:31:40and then we thought of why would that happen and of course at the end
0:31:43we had a very big window so if the sounds
0:31:45are very short, information is captured over here and your input is about eleven frames,
0:31:48you know, you got the fifteen frame it
0:31:50captures kind of noise coming from different phones of course error is made over here.
0:31:54So we can understand why.
0:31:56And then we asked why this model corrects errors? It's just because
0:31:59you make
0:32:00you deliberately make a hidden representation
0:32:04to reflect
0:32:05what sound pattern looks like.
0:32:07In the hidden space. And it's nice for whom you can see
0:32:10but if you have the articulations, how do they see? So sometimes we use former
0:32:14to illustrate what's going on there.
0:32:18Another important discovery at Microsoft is that we actually found that using spectrogram
0:32:23we produce much better
0:32:26autoencoding results in terms of speech analysis.
0:32:30Encoding results
0:32:32?? and that was very surprising at the time.
0:32:36And that really conforms to the basic deep learning theme that
0:32:39you know the earliest features are better
0:32:42then the processed features here. So I show you, this is actually project
0:32:48we did together in 2009.
0:32:49So we used spectrogram
0:32:51to do binary coding of
0:32:53of spectrogram.
0:32:55So I don't have time to go through that. You read the auto-encoding book if
0:33:01you can.
0:33:02In literature you can all see this.
0:33:03So the key is that
0:33:04you use the target to be the same as input and then you use small
0:33:07number of bits in the middle.
0:33:09And you want to see whether that would actually
0:33:11?? all the ?? down here. And the way to evaluate it is to look
0:33:15you know what kind of errors you have.
0:33:17So the way we did is we used the vector quantizer as a baseline
0:33:21of 312 bits.
0:33:23And then reconstruction
0:33:24looks like this. So this is the original one, this is the shallow model, right.
0:33:29Now using deep auto-encoder we get much closer to this in terms of errors
0:33:34we simply have just much lower coding error
0:33:38using identical number of bits.
0:33:39So it really shows that if you build deep structure you extract this bottom-up feature.
0:33:45Both ?? you condense more
0:33:47information in terms of reconstructing the original signal.
0:33:50And then we actually found that
0:33:53for spectrogram this result is the best.
0:33:55Now for MFCC we still get some gain, but gain is not nearly as much,
0:34:00sort of indirectly
0:34:01convinces me. There's Geoff Hinton's
0:34:03original activities ?? everybody's
0:34:06to spectogram.
0:34:07So maybe we should have do the waveform, probably not anyway.
0:34:10Okay so of course the next step is once we are all convinced that
0:34:14error analysis shows that
0:34:17deep learning can correct a lot of errors, not for all but for some
0:34:21which we understand why. You just pick up the power and also capacity they had.
0:34:27So on average it does a little bit better
0:34:29based upon
0:34:30this analysis.
0:34:33Based upon this analysis it does slightly better.
0:34:36But if you look away
0:34:38but if you look at the error pattern you really can see
0:34:41that this has a lot of power, but it also has some shortcomings as well.
0:34:45So that both have pros and cons but one's errors are very different and it
0:34:49actually gives you the hint that
0:34:51you know is worthwhile to pursuit.
0:34:53Of course this was all very interesting
0:34:56evidence to show.
0:34:57And then to scale up to industrial scale we had to do
0:35:00lot of things so many of my colleagues actually were working with me
0:35:04on this. So first of all
0:35:06we need to extend the output
0:35:08from small number of phones
0:35:11at the states
0:35:12into very large
0:35:13and that actually at that time is motivated by
0:35:16how to save huge Microsoft investment in speech decoder software.
0:35:20I mean if you don't do this
0:35:22then you know if you do some other kind of output coding
0:35:27and they would also had to ?? atypical feature to do it. The one that
0:35:31would fully believed
0:35:31that it's going to work.
0:35:32But it turned out if you need to change decoder, you know, we just have
0:35:36to say wait a little bit.
0:35:41and at the same time we found that using content dependent model gives much higher
0:35:46than content independent model for large tasks, okay.
0:35:49Now for small tasks we defined so much better. I think
0:35:53it's all related to
0:35:54a capacity saturation problem if you have too much
0:35:57but since a lot of data
0:36:01in the training for large tasks
0:36:03you actually keen
0:36:04to form a very large output and that turn out
0:36:07to have you know
0:36:09double benefit.
0:36:10One is that you increased accuracy and number two is that you don't have to
0:36:13change anything about decoder.
0:36:14And industry loves that.
0:36:17You have both
0:36:18that's actually ??. I can't recall why actually took off.
0:36:22And then we summarize what enabled this type of model
0:36:24and industrial knowledge about how to construct a very large units in DA
0:36:29is very important
0:36:30and that essentially come from
0:36:32everybody's what here
0:36:34that actually used this kind of content dependent model for Gaussian Mixture Model, you know,
0:36:39that has been around for
0:36:40almost twenty some years.
0:36:42And also
0:36:43it depends upon industrial knowledge on how to make encoding of such huge and highly
0:36:48efficient using
0:36:50our conventional
0:36:51HMM decoding technology.
0:36:53And of course how to make things practical.
0:36:57And this is also very important enabling factor. If GPU didn't come up
0:37:03roughly at time, didn't become popular at that time
0:37:06all these experiments would take months to do.
0:37:08Without all this belief, without all this fancy infrastructure.
0:37:14And then
0:37:15people may not have patiance to wait to see the results, you know push that
0:37:19So let me show you some very
0:37:22brief summary of the major
0:37:26result obtained in early days.
0:37:29So if we use three hours of training, this is TIMIT for example, we have
0:37:34this is number I show you, it's not much about ?? percent of gain.
0:37:38Now if you increase the data up to
0:37:41ten times more thirty some hours you get twenty percent error rate.
0:37:46Now if you do more.
0:37:48For SwitchBoard, this is the paper that my colleague published here,
0:37:52you get more data, another ten times so you get two orders of magnitude to
0:37:58and the relative gain actually
0:38:00sort of
0:38:01increase, you know, ten percent, twenty percent, thirty percent. This is actually
0:38:06so of course if you increase
0:38:08the size of training data
0:38:10the baseline will increase as well, but relative gain is even bigger.
0:38:14And if people look at this result there's
0:38:17in their mind who would say not to use that.
0:38:20And that's how
0:38:21and then of course a lot of companies
0:38:24you know
0:38:26actually still
0:38:28implement, DNN is fairly easy to implement for everybody because
0:38:33I missed one of the points over there. It actually turned out if you use
0:38:37large amount of data
0:38:38it turned out that the original
0:38:41idea of using DBN to regularize that model doesn't lead to
0:38:44be that anymore. And in the beginning ?? how it happened.
0:38:49But anyway, so now let me come back to the main thing of the talk.
0:38:53How generative model
0:38:54and deep neural network may be helping each other.
0:38:57So the kluge one was that to use this to be
0:39:02at that time
0:39:03we have to keep this now for this conference we see
0:39:07?? using LSTM with neural network and that fixed this problem.
0:39:12So this problem is fixed.
0:39:14This problem is fixed automatically.
0:39:17At that time
0:39:19we thought we need to use DBN. Now with use of big data there's no
0:39:23need anymore.
0:39:24And that's very well understood now. Actually there are many ways to understand that. You
0:39:28can think about as
0:39:29regulization view point
0:39:31and yesterday at the table with students I mentioned that and people said: What is
0:39:37And you have to understand more in terms of the optimization view point
0:39:41so actually if you stare at back-propagation formula for ten minutes you figure out why.
0:39:47Which I actually have slide there, it's very easy to understand why from many perspectives.
0:39:52With a lots of data you really don't need that.
0:39:54And that's automatically fixed.
0:39:57You know kind of by industrialization we tried lots of data
0:40:00it's fixed and now this is not fixed yet. So this is actually the main
0:40:04that I'm going to use for the next twenty minutes.
0:40:07So before I do that I will actually try to summarize some of
0:40:11the major ... actually I and my colleagues wrote this book
0:40:14and in this chapter we actually grouped
0:40:16the major advancement of deep neural network into several categories
0:40:22so I'm going to go through that quickly.
0:40:24So one is the optimization,
0:40:27So I think the most important advancement
0:40:31over the previous, you know the early success of the I showed you early on
0:40:36what's the development of sequence discriminative training and
0:40:39this contributed additional ten percent of error rate reduction.
0:40:42Also many groups of people have done this.
0:40:45Like for us at Microsoft, you know this is our first intern coming to our
0:40:49place to do this.
0:40:50And we tried on TIMIT we didn't know all the subtleties of the importance of
0:40:56regularization and
0:40:56we got all the formula right, all of everything right
0:40:59and the result wasn't very good.
0:41:01But I think
0:41:02Interspeech accepting our paper and this we understand that this
0:41:06and then later on
0:41:09we got more a more papers, actually a lot of papers were published in Interspeech.
0:41:13That's very good.
0:41:15Okay now, the next theme is about 'Towards Raw Input', okay.
0:41:21So what I showed you early on was the speech coding and analysis part
0:41:26that we know that is good. We don't need MFCC anymore.
0:41:29So it was bye MFCC, so
0:41:31probably it will disappear
0:41:33in our community. Slowly over the next few years.
0:41:36And also we want to say bye to Fourier transforms, so I put the question
0:41:42mark here partly because
0:41:43actually, so for this Interspeech I think two days ago Herman ?? had a very
0:41:48nice paper on
0:41:49this and I encourage everybody to take a look at.
0:41:52You just put the raw information in there
0:41:55which was done actually about three years ago by Geoff Hinton students, they truly believed
0:42:00it. I couldn't
0:42:01I tried that about 2004, that was the hidden Markov model
0:42:05And we understood all kind of problem, how to normalize users input and I say
0:42:09it's crazy
0:42:10and then when they published the result
0:42:14ICASSP. I looked at these results and error was terrible. I mean there's so much
0:42:17of error.
0:42:17So nobody took attention. And this year we brought the attention to this.
0:42:21And the result is almost as good as using, you know,
0:42:25using Fourier transforms.
0:42:27So far we don't want to throw away yet,
0:42:29but maybe next year people may throw that away.
0:42:33Nice thing is .. I was very curious about this. I say
0:42:37at the terms of that to get that result they just randomize everything rather than
0:42:41using Fourier transforms
0:42:42to initialize it and that's very intriguing.
0:42:46Too many references to list I was running all the time. I had ?? list.
0:42:50But yesterday when I went through this adaptation session there's so many good papers around.
0:42:55I just don't have patience for them anymore.
0:42:57So go back to ?? adaptation papers. There are a lot of new
0:43:02advancements. So another important thing is transfer learning
0:43:05at that place very important role in multi-lingual acoustic modelling.
0:43:10So that was tutorial that I was .. actually Tanja was giving in a workshop
0:43:17I was attending.
0:43:18I also mention that
0:43:20for generative model
0:43:22for shallow model before
0:43:24this one almost never
0:43:28of course
0:43:30actually improved things.
0:43:32But it never actually beat the baseline
0:43:36in terms of ..
0:43:39so think about cross-lingual for example, multi-lingual and cross-lingual
0:43:42and deep learning actually beat the baseline. So there's whole bunch
0:43:44papers in this area which I won't have time to go through all here.
0:43:47Another important innovation is nonlinear regularization, so for
0:43:50regulation dropout if you don't dropout it's good to know.
0:43:54And this is special technique. Essentially it's just 'kill all you know' or
0:43:57randomly and you get the better result.
0:44:03And in terms of output units
0:44:06is very popular units is to rectify linear units
0:44:09and now there's some very interesting
0:44:11many interesting theoretical analogies why this is better than this.
0:44:16At least while in my experience .. actually I programmed this, it's change of our
0:44:21to go from this to this.
0:44:23Deep learning
0:44:24really increases.
0:44:26And we understand now why it happens.
0:44:29Also (in terms of) accuracy different groups report different results.
0:44:32Some groups reports they reduced error rate, some groups .. nobody reported increase in error
0:44:37rates for now.
0:44:38So in any case (it) speed up
0:44:40the convergence dramatically.
0:44:43So I'm going to show you another architecture over here which is going to link
0:44:49a generative model.
0:44:51So this is a model called Deep Stacking Network.
0:44:55But its very design is deep neural network, okay. It's information from bottom up.
0:45:00So the difference between this model and conventional deep neural network is that
0:45:04for every single layer you can actually
0:45:07integrate the input for each layer and then do some special processing here.
0:45:15Especially you can alternate
0:45:17layers into linear and nonlinear, if you do that you can dramatically increase your
0:45:23speech convergence
0:45:26in deep learning.
0:45:27And there's some another theoretical analysis which is actually put in one of the books
0:45:31I wrote.
0:45:32So you actually can convert many complex
0:45:37non-convex problem into
0:45:41kind of ??property measure problem related to
0:45:44convex optimization so we can understand our probability ??.
0:45:46So we did that a few years ago and we wrote a paper on this.
0:45:49And this idea can also be used for this
0:45:53potential network, which I don't have the time to go through here. And the reason
0:45:56why I bring that up is
0:45:57because it's actually related to some recent work
0:46:00that I have seen
0:46:01for generative model which were taking convertion of each other, so let me compare between
0:46:07two of
0:46:08them to give you some example to show how to
0:46:11networks can help each other.
0:46:13So when developped this deep stacking network the activation function had to be fixed.
0:46:20Either logistic or ReLu which are both
0:46:22reasonably well
0:46:23you know compared to
0:46:25with each other.
0:46:28Now look at this architecture.
0:46:31Almost identical architecture.
0:46:33So now
0:46:35if you change the
0:46:38activation function to be something very strange, I don't expect you to know anything about
0:46:43and this is actually work done by Mitsubishi people.
0:46:46There's a very nice paper over here in the technical ??
0:46:50I spent a lot of time talking to them and they even came to
0:46:52Microsoft, so actually I listened to some of them and their demo.
0:46:56So the activation function for this model is called the Deep Unfolding Model
0:47:00that's is derived from inference method in generative model.
0:47:06Which is not fixed as in the ?? I showed you earlier. So to stop
0:47:11this model .. it looks like deep neural network, right?
0:47:14But the beginning
0:47:16the initial phase of their generative model which is specific about,
0:47:20I hope many of you know the non-negative matrix factorization. This is specific technique
0:47:26which actually is a shallow generative model.
0:47:29It actually makes a very simple assumption that
0:47:33observed noisy speech or mixed speakers' speech is the sum of two sources
0:47:40in spectral domain.
0:47:41What was they make the assumption
0:47:43and then they of course they have to enforce that each
0:47:46you know
0:47:47each vector is positive because of the magnitude of spectra.
0:47:52What they do is an iterative technique and that becomes a iterative technique.
0:47:58And that
0:47:59model automatically embed the main knowledge about how observation
0:48:04is obtained, you know, through the mix between the two.
0:48:08And then this work essentially said how to apply that inference technique iteration. Every single
0:48:13iteration I treat that as a different
0:48:18After this they do the back propagation training.
0:48:21And the backward iteration is possible
0:48:25the problem is very simple, so the application here is a speech enhancement
0:48:29therefore objective function is a mean-square error, very easy. So the generative model
0:48:34actually generative model gives you
0:48:40the generative observation
0:48:42and then
0:48:43your output is clean speech.
0:48:45Okay then you do mean-square error you actually adapt all this way
0:48:48and the results are very impressive. So now this is why
0:48:52I showed you can design deep neural network
0:48:55if we use this
0:48:57type of
0:48:58activation function you automatically build in the constraints that you use in the generative model
0:49:03and that's
0:49:04very good example to show
0:49:06the message that I'm going to,
0:49:09actually I put in the beginning of the (presentation) it's
0:49:11hope of deep generative model. So this is
0:49:14shallow model and it's easy to do it. Now for deep generative model
0:49:18it's very hard to do.
0:49:19And one of reasons I put this as a topic today is partly because
0:49:25all this conference
0:49:27it's just three months ago
0:49:30in Beijing's ICML conference
0:49:33there's a very nice development
0:49:35of deep generative models' learning methods.
0:49:40They actually linked this
0:49:42neural network and Bayes net together
0:49:44through some transformation
0:49:46and because of that .. the main idea of .. whole bunch of papers including
0:49:51Michael Jordan,
0:49:52whole bunch, you know, a lot of very well known people
0:49:54in machine learning for deep generative model
0:49:56so the main
0:49:58point of this set of work, I just want to use one simple sentence to
0:50:03summarize them,
0:50:03is that
0:50:04when you originally tried to do E step I showed you early on
0:50:09you have to factorize them in order to get each step done
0:50:12and that was approximation
0:50:13and there was very nice ?? developped. A ?? so large it's practically useless
0:50:18in terms of inferring the top layer
0:50:24discrete event.
0:50:25Now the whole point is that now we can relax that constraint for factorization
0:50:30and like before three years ago if you do that if you use a rigorous
0:50:36you don't get any reasonable analytical solution so you cannot do EM.
0:50:42Now this
0:50:43idea is to say that while you can approximate
0:50:48that factorisation,
0:50:49you can approximate that dependency in E step learning
0:50:52not through
0:50:55factorization which is called mean field approximation,
0:50:57but use deep neural network to approximate.
0:51:01So this is example to show that deep neural network actually help you to solve
0:51:05deep generative model problem and
0:51:07so this is well know Max Welling, a very good friend of mine in machine
0:51:14And he told me that the paper never show that.
0:51:17And they really developed the
0:51:20the theorem to prove that if network is large enough
0:51:24the approximation error can approach
0:51:26zero. Therefore the variational learnings
0:51:31can be eliminated and that's a very engine
0:51:33developed that really give me a little evidence to show that,
0:51:36to see that this is
0:51:38a promising approach. I think machine learning community development tool,
0:51:42our speech community developed verification
0:51:45and also methodology as well,
0:51:47but if
0:51:48you know we actually cross connect
0:51:50to each other we are gonna to make much more progress and that this type
0:51:55of development
0:51:56gives some
0:51:58promising direction
0:52:00towards the main message I put out at the beginning.
0:52:03Okay, so now I am gonna show you some deeper results that I want to
0:52:07show you.
0:52:09Another better architecture that we have known is what's called the reccurent network, if you
0:52:14this Beaufays' paper LSTM, look at that result. For
0:52:18voice search the error rate jumped down to about ten percent. That's very impressive result.
0:52:22Another type of architecture is to integrate the convolution
0:52:27and non-convolution together. That was ??
0:52:30in the previous result. As the author worth of any better result is in though.
0:52:33So these are the state-of-the-art for switchboard (SWBD) task.
0:52:37So now I'm going to concentrate on this type of
0:52:40recurrent network here.
0:52:43Okay, so this coming down to one of my main messages here.
0:52:47So we fixed this kluge
0:52:51a recurrent network.
0:52:54We also fix this kluge automatically
0:53:00just using big data.
0:53:01Now how do we fix this kluge?
0:53:05So first of all I'll show you some analysis on recurrent network vs. deep generative
0:53:11so that's called hidden dynamic model I showed you early on, okay.
0:53:14And so far analysis hasn't been applied to LSTM.
0:53:17So some further analysis may
0:53:20actually automatically give rise to LSTM using some analysis on this.
0:53:24So this analysis is very preliminary
0:53:27and so if you stare at the equotation
0:53:29for recurrent network it looks like best one. So essentially you have state of the
0:53:33art equotation
0:53:34and it's recursive.
0:53:36from previous hidden layer to this.
0:53:40And then you get the output
0:53:43that produces the label.
0:53:45Now if you look at this deep generative model - hidden dynamic model
0:53:48identical equotation,
0:53:50okay? Now what's the differece?
0:53:52The difference is that the input now is the label. Actually if you put the
0:53:57you cannot drive it. So you have to make some connection between labels and continuous
0:54:02and that's what in phonetic
0:54:03people call phonology to phonetic interface, okay.
0:54:06So we use some very basic assumption
0:54:08that the interface is simply, that each label corresponds to target vector,
0:54:14actually the way that we implement early distribution, you can do that to account for
0:54:18differences, etcetera. Now the output
0:54:21for this recursion gives you the observation
0:54:24and that's the recurrent filter type of model.
0:54:28And that's engineering model and there's neural network model, okay. So every time I was
0:54:32?? I called ?? on this.
0:54:34So we fully understood all the constrains for this type of model.
0:54:39Now for this model it looks the same, right?
0:54:41So if you reverse direction you convert one model to another.
0:54:44And for this model it's very easy to put a constraint, for example
0:54:49the dynamics
0:54:53matrix here that governs
0:54:56the internal dynamics in the hidden domain actually can be made sparse and then you
0:55:00can put
0:55:02realistic constrain there for example in our
0:55:04earlier implementation of this we put this critical dynamics
0:55:08therefore you can guarantee it doesn't oscillate. When we do articulation we need phone boundaries.
0:55:12This is the speech production mechanism
0:55:15you can put them simply to fix the sparse matrix.
0:55:17Actually one of the slides I'm gonna show you is all about this.
0:55:22In this one we cannot do it, everything has to be a structure.
0:55:25There's just no way you can say that why, you want that dynamics
0:55:29to behave in certain way.
0:55:32You just don't have any mechanism to design the structure of this and this is
0:55:36very natural, it's by physical
0:55:37properties that design this. Now because of
0:55:40this correspondence and because of the fact that now we can do
0:55:44deep inference
0:55:47if all this machine learning technology actually are fully developed
0:55:51we can very naturally bridge the two (models together).
0:55:53It turned out if you do more
0:55:55rigorous analysis
0:55:57making the inference of this to be fancier
0:56:00our hope that
0:56:04kind of unit would automatically emerge from this type of model so that has not
0:56:08been shown yet.
0:56:10So of course this is just, you know, very high-level view comparison between the two
0:56:15there are a lot of detail comparison you can make in order to bridge the
0:56:19so actually my colleague Dong Yu wrote this book that's just coming out very soon.
0:56:26So in one of the chapters we put all these comparisons: interpretability, parametrization, methods
0:56:32of learning and nature of representation and all the differences.
0:56:36So it gives a chance to actually understand
0:56:38how deep generative model in terms of dynamics
0:56:42and recurrent network in terms of recurrence can
0:56:44be matched with each other, so I will read that over here.
0:56:48So I have the final five, three more minutes, five more minutes. I will go
0:56:53very quickly.
0:56:54Everytime I talk about it I was running out of time.
0:56:59so the key concept is called embedding.
0:57:01Okay, so actually you can find the literature in nineties, eighties to have this
0:57:07basic idea around.
0:57:09For example in this special issue of
0:57:12Artifical Intelligence, very nice paper over here, I had chance to read them all.
0:57:15And very insightful and some of the chapters over here are very good.
0:57:18So the idea is that each physical entity or linguistic
0:57:23you know
0:57:25word, phrase, but even whole article, whole paragraph
0:57:29can be embedded into
0:57:30continuous-space vector. It could be big ??, you know.
0:57:34Just to let you know it's special issue on this topic.
0:57:38And that's why it's important concept.
0:57:41The second important concept, which is much more advanced
0:57:44which is described by a few books over here. I really enjoyed reading some of
0:57:49those and I invite those
0:57:50people come to visit me.
0:57:52We have a lot to discuss on that. You can actually even embed the structure
0:57:57next structure symmetric into a vector
0:58:01where you can recover the structure completely through the vector
0:58:04operation and the concept is called tensor-product representation.
0:58:08So I don't have .. if only I had three hours I can go through
0:58:11all of this.
0:58:11But for now I'm going to elaborate about this for next two minutes.
0:58:17this is the neural network recurent model and this is very nice, I mean this
0:58:21is fairly informational paper
0:58:22to show that embedding can be done as part of the
0:58:25as a byproduct of the recurrent neural network that
0:58:28paper was published in Interspeech several years ago.
0:58:34And then I'll talk very quickly about semantic embedding at MSR, so
0:58:39the difference between this set of work and the previous work was that
0:58:42everything is completely unsupervised
0:58:44so in the company if you have supervision you should grab it, right.
0:58:48So we actually took initiative to actually take some
0:58:51very smart
0:58:52exploitation of supervision signals
0:58:54at virtually no cost.
0:58:57So the idea here was that this is the model that we have essentially for
0:59:01each branch it's deep neural network. Now different
0:59:03branches can actually link together
0:59:05through what's called the, you know, cosine distance.
0:59:08So that
0:59:09distance can be measured
0:59:10in terms of
0:59:11a vector, in a vector space.
0:59:13And now we do MMI learning,
0:59:16so if you get hot dog in this one, if your document is talking about
0:59:20fast food or something, even if
0:59:22there's no word in common you pick up.
0:59:24And because of supervision actually link them together.
0:59:27Like if you have dog racing here
0:59:29they have the same word although they will be very far apart from each other.
0:59:33And that can be automatically done.
0:59:37And that some people told me that topic model can do
0:59:39similar things, so if we compare that with the topical model
0:59:42it turned out that ??
0:59:45and using this
0:59:46deep semantic model
0:59:48can do much, much better.
0:59:49So, now multi-modal. Just one more slide.
0:59:53So it turned out that not only text you can embed into it,
0:59:57image can be embedded, speech can be embedded and can do something very similar
1:00:01to the one I showed you earlier.
1:00:03And this is the paper that was in yesterday talk about embedding.
1:00:09That's ver nice, I mean it's very similar concept.
1:00:12So I looked at this and I said wow it's just like the model that
1:00:15we did for the text.
1:00:16But it turned out that application is very different.
1:00:18So actually
1:00:20I don't have time to go through here. I encourage to read on some papers
1:00:24over here. Let's skip this.
1:00:25So this was just to show you some application for this
1:00:27semantic model. You can do all the things. From web search
1:00:30we apply them, quite nicely. For machine translation you have one entity
1:00:34to be one language
1:00:37some of the list of the paper that were published you can find some detail.
1:00:40You actually can make summary, summarization and entity ranking.
1:00:45So let's skip this. This is final slide, the real final slide.
1:00:49I don't have any summary slides, this is my summary slide.
1:00:51So I copied the main message here now. Elaborate could be more. After going through
1:00:55whole hour of presentation.
1:00:57Now in terms of application we have seen
1:01:00speech recognition.
1:01:01The green is
1:01:03neural network, the red is deep generative model. So
1:01:07I say a few words about deep generative model and dynamic model
1:01:11that's generative models side and LSTM is other side. Now speech enhancement
1:01:16I showed you these types of models
1:01:19and then
1:01:20on the generative model side I showed you this one
1:01:25and this is shallow generative model that actually can
1:01:28give rise to deep structure which is corresponding to
1:01:33stacking network I showed you early on. Now for algorytm we have get back propagation
1:01:38That's single unchallenged
1:01:40algorytm for deep neural network.
1:01:42Now for deep generative model there are two algorytms. They are both called
1:01:47So one is called Belief Propagation, for those of you who know machine learning.
1:01:51The other one is BP, same as this.
1:01:54That only came up within two years.
1:01:57Due to this new advancement
1:02:00of porting deep neural network
1:02:02into the inference step
1:02:04of this type of model, so I call BP and BP.
1:02:08And in terms of neuroscience you call this one to be wake and you call
1:02:11the other sleep.
1:02:12And in the sleep you generate things you get hallucination and then when you're awake
1:02:16you have perception.
1:02:17You get information there. I think that's all I want to say. Thank you very
1:02:29Okay. Anyone one or two quick questions?
1:02:37Very interesting talk.
1:02:40I don't want to talk about your main point which is very interesting.
1:02:43Actually just very briefly about one of your side messages which is about waveforms.
1:02:48Which is about waveforms. So you know the ?? paper there weren't really putting in
1:02:54They are putting in the waveforms, take the absolute value, floor it, take all
1:02:58logarithm, average over, but you know so you had to do a lot of things.
1:03:03Secondly the other papers that there's been a modest
1:03:07amount of work in last few years on doing this sort of thing,
1:03:10pretty generally people do it with matched training test conditions
1:03:14if you have mismatched conditions, good luck with
1:03:16waveform. I always hate to say something is impossible but good luck.
1:03:24Thank you very much. ?? good for everything.
1:03:27And look at presentation that was very nice, thank you.
1:03:32Any other quick questions?
1:03:36If not I invite Haizhou
1:03:40to give a plaque.