0:00:15 | Like the first of thing. I will thank the organisers for having |
---|
0:00:19 | this opportunity to share with you |
---|
0:00:22 | some of my personal views |
---|
0:00:25 | on this very hot topic here. So, |
---|
0:00:29 | I think the goal of this tutorial really is to |
---|
0:00:33 | help diversifying the deep learning approach. Just like the theme of this conference, |
---|
0:00:40 | Interspeech For Diversifying |
---|
0:00:42 | the Language, okay. |
---|
0:00:45 | So I have a long list of people to thank. Oh, so I want. |
---|
0:00:49 | Yeah, thank you. |
---|
0:00:50 | So I have long ... long list of people here to thank. |
---|
0:00:53 | Especially Geoff Hinton. I worked with him for some period of time. |
---|
0:00:58 | And Dong Yu and whole bunch of Microsoft colleagues. |
---|
0:01:02 | Oh, who, |
---|
0:01:04 | hmm, |
---|
0:01:05 | contributed a lot to the material |
---|
0:01:08 | I'm going to go through. |
---|
0:01:10 | And also I would like to thank many of the colleagues sitting here who had |
---|
0:01:14 | a lot of discussions with me. |
---|
0:01:16 | And their opinions also shaped some of the content that I am going to go |
---|
0:01:20 | through with you over the next hour. |
---|
0:01:23 | Yeah, so the main message of this talk |
---|
0:01:26 | is that deep learning is not the same as deep neural network. I think in |
---|
0:01:30 | this community most of people |
---|
0:01:31 | mistake deep learning with deep neural network. |
---|
0:01:36 | And most ... |
---|
0:01:38 | So deep learning is something that everybody here would know. I mean just look at |
---|
0:01:42 | ... I think I counted close to 90 papers somewhere |
---|
0:01:44 | related to the deep learning or approaching at least. Kind of the number of papers |
---|
0:01:50 | exponentially |
---|
0:01:50 | increasing over last twelve years. |
---|
0:01:53 | So deep neural network is essentially neural network |
---|
0:01:56 | you can unfold that in space. You form a big network. |
---|
0:02:01 | OR |
---|
0:02:02 | AND |
---|
0:02:03 | Either way or both. You can unfold that over time. If you don't unfold |
---|
0:02:07 | that neural network over time because of reccurent network, okay. |
---|
0:02:11 | But there's another very big branch of deep learning, which I would call Deep Generative |
---|
0:02:16 | Model. |
---|
0:02:17 | Like a type of neural network you can also unfold in space and in time. |
---|
0:02:22 | If it's unfolded in time, you would call it a dynamic model. |
---|
0:02:26 | Essentially the same concept. You unfold the network. |
---|
0:02:31 | Oh. You know |
---|
0:02:32 | in same direction in terms of time |
---|
0:02:36 | But in terms of space they are unfolded in the |
---|
0:02:39 | opposite direction. So I'm gonna elaborate this part. And for example |
---|
0:02:43 | our very commonly used model. |
---|
0:02:46 | You know a Gaussian Mixture Model, hidden Markov model, really has the |
---|
0:02:53 | neural network unfolding in time. |
---|
0:02:57 | But if you make that unfolding in space you get big Generative Model |
---|
0:03:00 | which hasn't been very popular in our community. |
---|
0:03:05 | You know I'm going to survey whole bunch of work related to this area, ah, |
---|
0:03:09 | you know through the my discussion with many people here. |
---|
0:03:14 | But anyway so the main message of this talk is eventually to |
---|
0:03:20 | hope and I think there's a promising direction that is already taking place in machine |
---|
0:03:26 | learning community |
---|
0:03:27 | I don't know how many of you actually went to International Conference on Machine Learning |
---|
0:03:30 | (ICML) this year, just a couple of months ago in Beijing. |
---|
0:03:33 | But there's huge ammount of work in Deep Generative Model and some very interesting |
---|
0:03:36 | development, which I think I'd like to share with you at high level, |
---|
0:03:41 | so you can see that all this deep learning, although it just started in terms |
---|
0:03:46 | of application in our |
---|
0:03:47 | speech community, we should be very proud of that. |
---|
0:03:51 | Hmm, now, |
---|
0:03:52 | In number of machine learning communities there's huge amount of work going on |
---|
0:03:57 | in Deep Generative Model. So I hope I can share with you some of recent |
---|
0:03:59 | development with you to |
---|
0:04:02 | to enforce the message that |
---|
0:04:05 | a good combination between the two |
---|
0:04:08 | which have |
---|
0:04:10 | complementary strengths and weaknesses can be actually get together to further |
---|
0:04:15 | advance deep learning in our community here. |
---|
0:04:19 | Okay, so now. These are very big slides. I'm not going to go through all |
---|
0:04:23 | of details. I'm just going to highlight a few |
---|
0:04:25 | things so in order to enforce the message that |
---|
0:04:30 | generative model and |
---|
0:04:31 | neural network model can be helping each other. |
---|
0:04:34 | I'm just going to highlight a few key attributes of |
---|
0:04:38 | both approaches. They are very different approaches. |
---|
0:04:41 | I'm going to highlight that very briefly. First of all |
---|
0:04:45 | in terms of structure they are both graphical in nature as a network, okay. |
---|
0:04:48 | You think about this deep generative model, typically some of these |
---|
0:04:56 | we call that a Dynamic Bayesian network. You actually have joint probability between ?? label |
---|
0:05:01 | and the observation. |
---|
0:05:03 | And which is not the case for deep neural network, |
---|
0:05:05 | okay. |
---|
0:05:06 | In the literature you see many other terms |
---|
0:05:10 | that relate to deep generative model like probabilistic graphical model, |
---|
0:05:14 | such as stochastic neurons, |
---|
0:05:17 | sometimes it's called the stochastic generative network as you see in literature. They all belong |
---|
0:05:21 | to this |
---|
0:05:22 | category. So if your mindset is over here, even though you see some neural words |
---|
0:05:28 | describing that |
---|
0:05:29 | you know you won't be able to read all this literature, so the mindset is |
---|
0:05:32 | very difficult when you study these two. |
---|
0:05:34 | So the strenght |
---|
0:05:35 | of deep generative model is that, |
---|
0:05:39 | this is very important to me, |
---|
0:05:42 | how to interpret, okay. |
---|
0:05:44 | So everybody that I talked, including the lunchtime when I talk to students, |
---|
0:05:49 | they complain. I say: have you heard about deep neural network? and everybody says yes, |
---|
0:05:52 | we do. |
---|
0:05:54 | To what extent have you started looking to that? and they said we don't want |
---|
0:05:57 | to do that because we cannot |
---|
0:05:58 | even interpret what's in the hidden layer, right. |
---|
0:06:01 | And that's true |
---|
0:06:02 | and that actually is quite very deciding. I mean if you |
---|
0:06:05 | read into this ?? science literature in terms of connectionist model |
---|
0:06:09 | really the whole design is that you need to have a representation here to be |
---|
0:06:13 | distributed. |
---|
0:06:13 | So each neuron can represent different concept |
---|
0:06:17 | and each |
---|
0:06:18 | concept can be represented by different neurons, so the very design |
---|
0:06:21 | it's not meant to be interpretable, |
---|
0:06:23 | okay. |
---|
0:06:24 | And that actually creates some difficulty for many |
---|
0:06:27 | and this model is just opposite. It's very easy to interpret because the very nature |
---|
0:06:33 | of generative story. |
---|
0:06:34 | You can tell what the process is |
---|
0:06:36 | and then of course if you want to do |
---|
0:06:39 | a classification or some other application in machine learning |
---|
0:06:42 | you simply just have to .. |
---|
0:06:44 | and for forecast we simply have base route to invert that, that's exactly what in |
---|
0:06:48 | our community |
---|
0:06:49 | we have been doing for thirty years hidden Markov model. You get the prior, you |
---|
0:06:52 | get generative model and |
---|
0:06:53 | you multiply them and then you do it. Except at that time we didn't know |
---|
0:06:57 | how to make that |
---|
0:06:58 | deep for this type of model. And there are some piece of work that I'm |
---|
0:07:01 | going to survey. |
---|
0:07:02 | So that's one big part of the advantage of this model. |
---|
0:07:05 | Of course everybody know that what I just mentioned there. |
---|
0:07:09 | In deep generative model actually the information flow is from top to down. |
---|
0:07:13 | You actually have .. what top simply means is that you know you get a |
---|
0:07:16 | label or you get a higher level concept |
---|
0:07:18 | and the lower level down simply means you can rotate to fit into that. |
---|
0:07:22 | Everybody know that in a neural network |
---|
0:07:25 | the information flow is from bottom to up, okay. So you fit the data and |
---|
0:07:29 | you compute whatever output and then |
---|
0:07:30 | you go either way you want. |
---|
0:07:31 | In this case |
---|
0:07:33 | the information come from top to down. You generate the information |
---|
0:07:37 | and then if you want to do classification, you know, any other machine |
---|
0:07:42 | learning applications, you know you can do Bayesian. Bayesian is very essential for this. |
---|
0:07:49 | But there's whole list of those. I don't have time to go through, but just |
---|
0:07:52 | you know those are high lights, these |
---|
0:07:54 | we have to say. So the main strenght of deep neural network that actually gained |
---|
0:07:59 | popularity |
---|
0:07:59 | over the previous years, really is mainly due to these strenghts. |
---|
0:08:04 | It's easier to do a computation in terms of |
---|
0:08:10 | so this what I wrote is a regular compute, okay. |
---|
0:08:13 | So if you |
---|
0:08:14 | look into exact what kind of compute is involved here |
---|
0:08:17 | it's just the millions of millions of millions of times of computing |
---|
0:08:21 | of the big matrix by a vector. |
---|
0:08:23 | You do that many times. ?? place very small model role |
---|
0:08:27 | it's very regular. |
---|
0:08:28 | And therefore GPU is really |
---|
0:08:31 | ideally suited for this kind of computation |
---|
0:08:33 | and that's not the case for this model. |
---|
0:08:36 | So if you compare between these two then you really will understand that if you |
---|
0:08:41 | can pull |
---|
0:08:42 | some of these advantages into this model |
---|
0:08:44 | and pull some of this advantage in this column into this one |
---|
0:08:48 | you have integrated model. And that's kind of the message I'm going to convey and |
---|
0:08:53 | I'm going to |
---|
0:08:54 | give you example to show how this can be done. |
---|
0:08:57 | Okay, so in terms of interpretability it's very much related to |
---|
0:09:04 | how to incorporate the main knowledge |
---|
0:09:06 | and network constraint into the model. And for deep neural network it's very hard. |
---|
0:09:12 | What people have done that, I have seen many people in this conference and also |
---|
0:09:16 | in a ?? |
---|
0:09:17 | tried very hard it's not very natural. |
---|
0:09:20 | What is |
---|
0:09:22 | This is very easy |
---|
0:09:23 | I mean you can code your domain control knowledge directly into the system. For example |
---|
0:09:29 | like distorted speech, voice speech, you know |
---|
0:09:32 | in the summation, into special domain, summation of |
---|
0:09:35 | either wave-form domain is a noise |
---|
0:09:37 | plus |
---|
0:09:38 | the clean speech you get by observation. That's so simple you just cut that into |
---|
0:09:43 | one layer, into summation or |
---|
0:09:44 | you can call them in terms of Bayesian probability very easily. |
---|
0:09:47 | This is not that easy to do. People tried to do that, it's not just |
---|
0:09:51 | as easy. |
---|
0:09:51 | So to encode |
---|
0:09:53 | a domain knowledge into network constraint of the problem |
---|
0:09:57 | into |
---|
0:09:58 | your deep learning system. This has great advantage. |
---|
0:10:01 | So I'm actually, I mean this is just a random selection |
---|
0:10:03 | of things you know. There's very nice paper over here |
---|
0:10:06 | Acoustic Phonetics. |
---|
0:10:08 | All this knowledge at speech production |
---|
0:10:11 | and this kind of nonlinear |
---|
0:10:12 | phonology |
---|
0:10:14 | and this is an example of this is noise robust. You put the phase information |
---|
0:10:19 | of the speech and noise you can come up with |
---|
0:10:22 | very nice conditional distribution. It's kind of complicated |
---|
0:10:24 | but this one can be put directly |
---|
0:10:26 | into generative model and this is some example of this. Whereas in deep neural networks |
---|
0:10:31 | it's very hard to do. |
---|
0:10:33 | So the question is that do we want to throw away all these knowledge in |
---|
0:10:36 | the deep learning |
---|
0:10:37 | and my answer is of course no. Most of people will say no, okay. |
---|
0:10:45 | And people from the outside of speech (community) there was a yes. I talk about |
---|
0:10:48 | some people in machine learning, |
---|
0:10:49 | anyway so since this is speech conference I really want to emphasise that. |
---|
0:10:54 | So the real |
---|
0:10:55 | solid reliable knowledge that we attained |
---|
0:10:58 | from speech science |
---|
0:10:59 | that has been reflected by local talks are |
---|
0:11:03 | such as yesterday's talk, talking about how some patterns have been shaped by you by |
---|
0:11:09 | ?? and perceptionists. They were really playing a role in deep generative model. |
---|
0:11:14 | But very hard to do that in deep neural network. |
---|
0:11:17 | So with this main message in mind |
---|
0:11:20 | I'm going to go through three parts of the talk as I put them in |
---|
0:11:24 | my abstract here. |
---|
0:11:25 | So I need to go very briefly |
---|
0:11:27 | through all these three topics. |
---|
0:11:30 | Okay, so the first part is to give very brief history of how deep speech |
---|
0:11:36 | recognition started. |
---|
0:11:38 | So this is a very simple list. There are so many papers around. Before the |
---|
0:11:43 | rise of the deep learning around |
---|
0:11:45 | 2009 and 2010. There are lots of papers around. So I hope I actually have |
---|
0:11:50 | a reasonable |
---|
0:11:52 | sample of the work around here. |
---|
0:11:54 | So I don't have time to go through, especially for those of you who are |
---|
0:11:58 | in ?? open house |
---|
0:12:00 | There was in 1988, I think in 1988 |
---|
0:12:03 | ASRU and at that time there's no U, it's just |
---|
0:12:05 | ASR. And there is some very nice paper around here and then quickly |
---|
0:12:10 | you know it was |
---|
0:12:10 | superseded |
---|
0:12:11 | superseded by the hidden Markov model approach. |
---|
0:12:15 | So I'm not going to go through all these |
---|
0:12:17 | so except to point out that |
---|
0:12:20 | neural network |
---|
0:12:22 | has been very popular for awhile. |
---|
0:12:24 | But towards this you know, |
---|
0:12:26 | plus ten years |
---|
0:12:28 | before the deep learning actually took over neural network approach |
---|
0:12:33 | essentially didn't really make |
---|
0:12:36 | such a strong impact compared with deep learning network that people have been seeing. |
---|
0:12:41 | So I just give you one example to just show you how unpopular |
---|
0:12:45 | the neural network was at that time. |
---|
0:12:48 | So this is about 2008 or 2006, about nine years ago. |
---|
0:12:53 | So this is the optimization that I think |
---|
0:12:56 | is predecessor |
---|
0:12:57 | of ?? IOPPA. |
---|
0:12:58 | So they actually got several of together, locked us up into hotel |
---|
0:13:03 | near Washington, DC. |
---|
0:13:05 | airport somewhere. |
---|
0:13:07 | Essentiall the goal is to say that well the speech |
---|
0:13:09 | recognition is stuck there, so you come over here and help us brainstorm next generation |
---|
0:13:15 | of speech recognition and understand technology |
---|
0:13:18 | and then we actually spent about four or five days in the hotel and at |
---|
0:13:22 | the end we wrote very thick report, |
---|
0:13:25 | twenty some pages of report. |
---|
0:13:26 | So there is some interesting discussion about history and the idea is that |
---|
0:13:31 | if government give you unlimited resource and gives you fifteen years what is it you |
---|
0:13:35 | can't do, right? |
---|
0:13:36 | So most of the people in our discussion, |
---|
0:13:39 | we all focused on neural network, essentially |
---|
0:13:41 | margin is here, |
---|
0:13:42 | macro-random field is here, conditional-random field is here and graphical model here. |
---|
0:13:48 | So it |
---|
0:13:50 | that was just couple of years before that deep learning actually came out at that |
---|
0:13:54 | time |
---|
0:13:54 | so neural network was actually one of the |
---|
0:13:57 | two's around. |
---|
0:13:58 | Haven't really make a big impact. |
---|
0:14:01 | So on the other hand the graphical model was actually mentioned here because it's related |
---|
0:14:06 | to deep generative model. |
---|
0:14:08 | So I'm going to show you a little bit, well this is slide about deep |
---|
0:14:12 | generative model, actually I made some list over here. |
---|
0:14:15 | One of the |
---|
0:14:18 | but anyway so. This let's go over here. |
---|
0:14:21 | I just want to highlight couple of |
---|
0:14:25 | related to |
---|
0:14:27 | introduction of deep neural network in the field. |
---|
0:14:30 | Okay so one of, this is ?? John Riddle? |
---|
0:14:32 | actually we spent a summer in ?? in 1989, |
---|
0:14:37 | or 1988. |
---|
0:14:39 | Fifteen and some years ago. So we spent really interesting summer altogether. |
---|
0:14:45 | So |
---|
0:14:46 | and that's kind of the model, deep generative model, the two versions we actually put |
---|
0:14:52 | together |
---|
0:14:52 | and at the end we actually brought a very thick report that were about eighty |
---|
0:14:56 | pages of report. |
---|
0:14:58 | So this is deep generative model and it turned out that this model |
---|
0:15:02 | actually both of those models are implemented in neural networking. |
---|
0:15:06 | And thinking about neural network as simply just function of function of mapping |
---|
0:15:09 | so if you map the hidden representation |
---|
0:15:12 | from you know |
---|
0:15:13 | as part of deep generative model into whatever observation you have |
---|
0:15:18 | MFCC. Everybody used MFCC at the time. |
---|
0:15:22 | You actually need to have done the mapping and that was done in neural network |
---|
0:15:27 | in both versions |
---|
0:15:28 | and this is statistical version which we |
---|
0:15:31 | call the hidden dynamic model. It's one of the conversion |
---|
0:15:34 | of deep generative model. |
---|
0:15:36 | It didn't succeed. I'll show you the reason why. Now we understood what. |
---|
0:15:40 | Okay, so it's interesting enough in this |
---|
0:15:43 | model we actually used, if you read the report, it actually turned out that model |
---|
0:15:47 | was here since Geoff told me that |
---|
0:15:49 | the video for this workshop is still around there so it's called ?? sign. I |
---|
0:15:53 | think I mentioned to ?? pick it out. |
---|
0:15:56 | It turned out that learning of this workshop, which details are in this report |
---|
0:16:00 | is actually use the back propagation to do it. Now direction isn't from up to |
---|
0:16:03 | down, since your model is |
---|
0:16:05 | top down, the propagation must be bottom up. |
---|
0:16:08 | So nowadays |
---|
0:16:10 | when we do speech recognition the error |
---|
0:16:14 | function is a softmax or sometime you can use the mean square error. |
---|
0:16:18 | And the measure is in terms of your label. |
---|
0:16:23 | This is the opposite. The error is measured in terms |
---|
0:16:26 | of matching between how generative model can match with the observation. And then when you |
---|
0:16:31 | want to |
---|
0:16:31 | learn you go bottom up learning. Which actually turned out to be better propagation. So |
---|
0:16:35 | that propagation doesn't have to be done (up to bottom) |
---|
0:16:37 | it can be bottom up. Depending on what kind of models you have. |
---|
0:16:40 | But key is that this is |
---|
0:16:41 | a gradient descent method. |
---|
0:16:44 | So actually we got disappointing result for switchboard. You know because we tended to be |
---|
0:16:48 | a bit off game. |
---|
0:16:49 | And now we understand why. Not at that time. I'm sure some of you experienced |
---|
0:16:52 | it. I have a lot |
---|
0:16:53 | of thinking about how deep learning and this can be integrated together. |
---|
0:16:59 | So at the same time |
---|
0:17:02 | Okay so this is a fairly simple model, okay. So you have this hidden representation |
---|
0:17:07 | and it has |
---|
0:17:08 | specific constrains built into the model, |
---|
0:17:11 | by the way which is very hard to do when you do bottom-up neural network. |
---|
0:17:15 | And for generative model |
---|
0:17:16 | you can put them very easily down there, so for example |
---|
0:17:18 | articulatory trajectory has to be smooth |
---|
0:17:22 | and then specific form of the smoothness can be built indirectly |
---|
0:17:26 | by simply writing the generative probabilities. Not in the deep neural network. |
---|
0:17:31 | So at the same time |
---|
0:17:33 | we actually, also this was done in ?? |
---|
0:17:38 | and we were able to even put this nonlinear phonology in terms of |
---|
0:17:43 | writing the phonemes into the invidiual constituents at the top level and ?? also has |
---|
0:17:49 | very nice paper, some fifteen years ago, talking about this. |
---|
0:17:53 | And also the robustness can be directly integrated into |
---|
0:17:57 | articulator model simply by generative model. Now for deep neural network it's very hard to |
---|
0:18:01 | do. |
---|
0:18:01 | For example you can actually |
---|
0:18:05 | this is not meant to be seen. Essentially this is one of the conditional likelihood |
---|
0:18:10 | that covers |
---|
0:18:11 | one of the links. So everytime you have got the link |
---|
0:18:15 | you have conditional dependency parent to children that have differnt neighbours. |
---|
0:18:22 | And then you can specify them in terms of |
---|
0:18:24 | conditional distribution. Once you do that you formed a model |
---|
0:18:27 | you can embed |
---|
0:18:28 | whatever knowledge you have, you think is good, into the system. But anyway |
---|
0:18:33 | but the problem is that the learning is very hard |
---|
0:18:35 | and that problem of the learning in machine community only was solved just within last |
---|
0:18:41 | year. |
---|
0:18:42 | At that time we just didn't really know. We were so naive. |
---|
0:18:47 | We didn't really understand all the limitations of learning. So just to show you we |
---|
0:18:50 | talk, okay. One of the |
---|
0:18:51 | things we did was that, I actually worked on this with my colleagues Hagai Attias. |
---|
0:18:55 | He is actually one of the |
---|
0:18:56 | he is my colleague working not far away from me at that time, some ten |
---|
0:19:01 | years ago. |
---|
0:19:02 | So he was the one who invented this very initial base. Which is very well |
---|
0:19:06 | known. |
---|
0:19:07 | So the idea was as follows. You have to break up these pieces into the |
---|
0:19:11 | modules, right. |
---|
0:19:12 | For each module you have this, this is actually |
---|
0:19:14 | continuous |
---|
0:19:17 | dependence of the continuous hidden representation |
---|
0:19:20 | and it turned out that the way to learn this, |
---|
0:19:23 | you know in a principle, what is to do is EM (Expectation maximization). It's variational |
---|
0:19:26 | EM. |
---|
0:19:26 | So the idea is very crazy. |
---|
0:19:28 | So you said you cannot solve that regressively and that's well known. It's loopy neural |
---|
0:19:34 | network. Then you just cut all important things you |
---|
0:19:37 | carry out. Hoping that M-Step can make it up. That's very crazy idea. |
---|
0:19:41 | And that's the best around time that was there. |
---|
0:19:43 | But it turned out that you've got the auxiliary function and you form is still |
---|
0:19:48 | something very |
---|
0:19:49 | similar to our EM, you know in HMM. For the general model you don't have |
---|
0:19:55 | to look you can get rigorous solution. |
---|
0:19:57 | But now when you have deep it's very hard. You have to make up for |
---|
0:20:01 | it. And that ?? is just as ??bad-ass |
---|
0:20:03 | many people could ?? on deep neural network. This ?? deep generative model |
---|
0:20:08 | probably have more |
---|
0:20:09 | than otherwise. Although they patched themselves |
---|
0:20:12 | to be |
---|
0:20:13 | you know very rigorous. But if you really walk on that, so I can pick |
---|
0:20:17 | out of this, so it's |
---|
0:20:18 | for this approach we get surprisingly good inference results for continuous variables. |
---|
0:20:22 | And in one version what we did was actually we used phonemes |
---|
0:20:27 | you know as a hidden representation and it turned out it tracked. And once you |
---|
0:20:31 | do this you |
---|
0:20:32 | check the phoneme really precisely. |
---|
0:20:34 | As a byproduct this worked as we created |
---|
0:20:38 | this worked as we created database for formant tracking |
---|
0:20:42 | but if we actually do |
---|
0:20:45 | inference only the linguistic unit which is the problem |
---|
0:20:48 | of recognition we didn't really make much progress on this. |
---|
0:20:51 | But anyway so I'm going to show you some of these preliminary results to show |
---|
0:20:56 | you how this |
---|
0:20:57 | is one way that led to the deep neural network. |
---|
0:21:00 | So when we actually simplify the model in order to finish the decoding we actually, |
---|
0:21:07 | this is actually ?? result |
---|
0:21:09 | and we would bring out all of analysis for different kinds of phones. |
---|
0:21:12 | So when we use this kind of generative model with deep structure it actually corrected |
---|
0:21:17 | many errors |
---|
0:21:18 | which are related the short phones. |
---|
0:21:20 | And you understand why because you designed model to make that happen and then you |
---|
0:21:24 | know if |
---|
0:21:25 | everything is done recently well you actually get results. So we actually look |
---|
0:21:28 | at not only corrected short phones for the vowel |
---|
0:21:32 | but also it correct the a lots of |
---|
0:21:34 | consonants because they're up with each other. |
---|
0:21:36 | It's just because the model design whatever hidden trajectory that you get |
---|
0:21:40 | it's influenced, the parts of the vowel is influenced |
---|
0:21:45 | by the adjacent sound. |
---|
0:21:47 | And that's |
---|
0:21:47 | this is due to the coarticulation. |
---|
0:21:49 | This work will be very naturally built into the system |
---|
0:21:51 | and one of things I am very much struggling with deep neural network is that |
---|
0:21:55 | you can't even build this kind of |
---|
0:21:56 | information that easily, okay. |
---|
0:21:59 | This is to convince you how things can be breached. |
---|
0:22:03 | It's very easy to interpret the results. So we look at the error we |
---|
0:22:07 | know wow these are quite a big data assumption. |
---|
0:22:11 | Without the have to go through for example in this these examples of these are |
---|
0:22:14 | the same sounds, okay. |
---|
0:22:15 | You just speak fast then you get something like this |
---|
0:22:17 | and then we actually looked at the error and we said Ohh. |
---|
0:22:20 | You know |
---|
0:22:22 | that's exactly what happened. You know mistake was made in the |
---|
0:22:27 | Gaussian Mixture Model because it doesn't take into account these particular dynamics. Now this one |
---|
0:22:31 | was pulling correct error |
---|
0:22:32 | And I'm going to show you in deep neural network things are reversed, so that's |
---|
0:22:37 | related to ??. But in the same time |
---|
0:22:39 | in machine learning community also the speech |
---|
0:22:42 | there is a very interesting model for the deep generative model developed |
---|
0:22:46 | and that's called the Deep Belief Network. |
---|
0:22:47 | Okay, |
---|
0:22:48 | so in the earlier literature before about three or four years ago |
---|
0:22:52 | DBN, Deep Belief Network, NTA I mix each other, even by the authors |
---|
0:22:56 | it's just because most people don't understand what it is |
---|
0:22:59 | so this is very interesting paper that is starting in 2006 |
---|
0:23:02 | many people, most people in machine learning, regard this paper to be the start of |
---|
0:23:07 | deep learning. |
---|
0:23:08 | And thus the generative model so you prefer to say deep |
---|
0:23:12 | generative model actually started the deep learning rather than deep neural network. |
---|
0:23:17 | But this model has some intriguing probabilities |
---|
0:23:21 | that really at the time attracted my attention here. |
---|
0:23:25 | It's totally not obvious, okay. |
---|
0:23:28 | So for those of you who know RBM and DBM you know when you are |
---|
0:23:32 | stacking up this undirected model |
---|
0:23:34 | sever time you get DBN, that's |
---|
0:23:37 | you might think that the whole thing will be undirected, |
---|
0:23:40 | you know bottom-up machine, no. It's actually directed model coming down. |
---|
0:23:44 | You have to read this paper to understand why. |
---|
0:23:47 | So why do they? I said someone was wrong. I couldn't understand what happened. |
---|
0:23:50 | But on the other hand it's much simpler than the model I showed you earlier |
---|
0:23:54 | for deep network we get the temporal dynamics. |
---|
0:23:56 | This one it's not temporal dynamics over here. |
---|
0:23:59 | So |
---|
0:24:01 | the most intriguing aspect of DBN |
---|
0:24:03 | as described in this paper is that inference is easy. |
---|
0:24:06 | Normally you think inference is hard. That's the tradition. |
---|
0:24:10 | It's given fact if you have these multiple dependencies on the top it's very hard |
---|
0:24:15 | to make voice |
---|
0:24:16 | and there's special constraint built into this model. Namely the restriction in the connections of |
---|
0:24:21 | RBM |
---|
0:24:22 | because of that it makes inference. It's just a special case. |
---|
0:24:25 | This is very intriguing, so I thought this idea may help |
---|
0:24:29 | the deep general model I showed you earlier. |
---|
0:24:32 | So he came to reason me, you know. We discussed it. |
---|
0:24:36 | It took him a while to explain what this paper is. |
---|
0:24:40 | Most of people at Microsoft at that time couldn't understand what's going on. |
---|
0:24:45 | So now let's see how |
---|
0:24:46 | and then of course what we get together this deep generative model |
---|
0:24:50 | and the other deep generative model I talked about with you I actually worked on |
---|
0:24:54 | for almost ten |
---|
0:24:54 | years at Microsoft. We were working very hard on this. |
---|
0:24:57 | And then we came up with the conclusion that well we have to use fewer |
---|
0:25:00 | clues to fix problem. |
---|
0:25:01 | And they don't match, okay. The reason why they don't match is whole new story |
---|
0:25:05 | why they don't match. |
---|
0:25:06 | The main reason is actually not just temporal difference, it's the way you prioritize |
---|
0:25:12 | the model and also the way to represent |
---|
0:25:15 | the information is very different |
---|
0:25:17 | despite the fact that they're both generative models. |
---|
0:25:19 | It turned out that this model is very good for speech synhesis and ?? has |
---|
0:25:22 | very nice paper |
---|
0:25:23 | using this model to do synthesis. And it's very nice to do |
---|
0:25:26 | image generation. I can see that very nice probably. |
---|
0:25:30 | Not for continuous speech it is very hard to do |
---|
0:25:33 | and for speech for general synthesis it's good it's because if you have segment with |
---|
0:25:38 | whole |
---|
0:25:39 | context into account, like syllable in Chinese it is good, but for English it is |
---|
0:25:42 | not that easy to do. |
---|
0:25:44 | But anyway so we need to have few kluges to fix together, to merge these |
---|
0:25:48 | two models together. |
---|
0:25:49 | And that sort of led to the end. |
---|
0:25:51 | So the first kluge is that |
---|
0:25:54 | you know |
---|
0:25:55 | the temporal dependency is very hard. If you have temporal dependency you automatically loop and |
---|
0:26:00 | then |
---|
0:26:00 | everybody in machine learning at that time knew, most of speech persons, so I thought |
---|
0:26:05 | that |
---|
0:26:05 | machine learning that I show you early on actually just didn't work well, it didn't |
---|
0:26:09 | worked out well. And most of people who were |
---|
0:26:12 | very much versed in machine learning who say there's no way to learn that. |
---|
0:26:15 | Then cut the dependency. It's way to do it, cut the dependency in the hidden |
---|
0:26:20 | dimension, in the hidden revision |
---|
0:26:21 | and loose all the powers of |
---|
0:26:23 | deep generative model |
---|
0:26:25 | and that's the Geoff Hinton's idea, well it doesn't matter, just use a big window. |
---|
0:26:30 | If it fixes the clues and that actually |
---|
0:26:34 | is one of things that actually helped |
---|
0:26:36 | to solve the problem |
---|
0:26:38 | and the second Kluge is that you can reverse direction |
---|
0:26:40 | because |
---|
0:26:41 | the inference in generative model is very hard to do as I showed earlier. |
---|
0:26:45 | Now if you reverse direction |
---|
0:26:48 | from top-down to bottom-up |
---|
0:26:52 | and then you don't have to solve that problem. And that's why it would be |
---|
0:26:56 | just a deep neural network, okay. Of course |
---|
0:26:58 | everybody said: we don't know how to train them, that was in 2009. |
---|
0:27:02 | Most people don't know how to ?? |
---|
0:27:03 | and then he said that's how DBN can help. |
---|
0:27:07 | And then he did a fair amount of work on DBN to initialize that ?? |
---|
0:27:12 | approach. |
---|
0:27:12 | So this is very well-timed academic-industrial collaboration. First of all |
---|
0:27:16 | it's because speech recognition industry has been searching for new solutions when principle |
---|
0:27:22 | deep generative model could not deliver, okay. Everybody |
---|
0:27:24 | was very upset about this at the time. |
---|
0:27:27 | And at the same time academia developed deep learning tool |
---|
0:27:30 | DBN, DNN, all the hybrid stuff that's going on. |
---|
0:27:33 | And also CUDA library was released around that time. It's very recent times. |
---|
0:27:40 | So this is probably one of the earliest catching on |
---|
0:27:44 | for this GPU computing power over here. |
---|
0:27:47 | And then of course big training data in ASR that has been around |
---|
0:27:52 | and most people, if you actually do |
---|
0:27:55 | Gaussian Mixture Model for HMM where a lot of data performance accelerates, right. |
---|
0:28:00 | And then this is one of things that in the end really is powerful. You |
---|
0:28:04 | can increase the size and depth |
---|
0:28:06 | and |
---|
0:28:07 | you know put in a lot of things |
---|
0:28:08 | into to make it really powerful. |
---|
0:28:11 | And that's the scalability advantage that I showed you early on. That's not the case |
---|
0:28:15 | for any shallow model. |
---|
0:28:18 | Okay, so in 2009 I and three of my colleagues didn't know what's |
---|
0:28:23 | happening. So we actually got together to |
---|
0:28:26 | to do this |
---|
0:28:27 | to this workshop |
---|
0:28:28 | to show that |
---|
0:28:29 | this is useful thing, you know, to bring stuff. |
---|
0:28:32 | So it wasn't popular at all. I remember |
---|
0:28:35 | you know Geoff Hinton and I we actually got together to |
---|
0:28:40 | who we should invite to give us |
---|
0:28:42 | speech in this workshop. |
---|
0:28:44 | So I remember that one invitee which shall be nameless here |
---|
0:28:47 | he said: Give me one week to think about, and at the end he said: |
---|
0:28:50 | it's not worth my time to fly to Vancouver. That's one of them. |
---|
0:28:53 | The second invitee, I remember this clearly, said: This is crazy idea. So in the |
---|
0:28:57 | e-mail he said |
---|
0:28:58 | What you do is not clear enough for us. |
---|
0:29:01 | So we said you know |
---|
0:29:02 | waveform may be useful for ASR. |
---|
0:29:04 | And then the emails said: Oh why? |
---|
0:29:07 | So we said that's just like using pixel for image recognition. That was popular. |
---|
0:29:12 | For example convolutional network there are pixels. |
---|
0:29:15 | We take similar approach. Except it is waveform. |
---|
0:29:17 | And the answer was: No, no, no that's not same as pixel. It is more |
---|
0:29:22 | like using photons. |
---|
0:29:23 | You know making kind of joke essentially. This one didn't show up either. But anyway |
---|
0:29:28 | so |
---|
0:29:30 | anyway so this workshop actually has |
---|
0:29:34 | a lot of brainstorming I had to analyze, all the errors I showed you early |
---|
0:29:38 | on. |
---|
0:29:39 | But it's really good |
---|
0:29:41 | workshop for about four or five years that was |
---|
0:29:44 | five years ago now. |
---|
0:29:45 | So now I move to part 2 |
---|
0:29:48 | to discuss achievements. So actually in my original post I had whole bunch of slides |
---|
0:29:53 | on vision. |
---|
0:29:54 | So the message for the vision is that if you go to vision community |
---|
0:29:59 | they look at deep learning to be |
---|
0:30:01 | just even |
---|
0:30:02 | maybe thirty time |
---|
0:30:04 | thirty times more popular than deep learning in speech. |
---|
0:30:07 | So they actually, the first time they did that was actually first time they |
---|
0:30:12 | actually got the results. |
---|
0:30:16 | and noone believed it's the case. At the time I was given a lecture |
---|
0:30:20 | at Microsoft about Deep Learning |
---|
0:30:22 | and then right before I, actually Bishop |
---|
0:30:25 | was doing the lecture together with me |
---|
0:30:30 | and then this deep learning just came out and Geoff Hinton sent e-mail to me: |
---|
0:30:34 | Look at the matching! How much bigger it is. |
---|
0:30:36 | And I showed them. People were like: I don't believe it. Maybe a special case. |
---|
0:30:40 | You know. And it turned out it's just much |
---|
0:30:42 | just as good. |
---|
0:30:43 | Even better than speech. I actually cut all the slides out. Maybe some time I |
---|
0:30:46 | will show you. |
---|
0:30:47 | So this is big area to go. So today I am going to focus on |
---|
0:30:50 | speech. |
---|
0:30:51 | So one of things that we found during that time |
---|
0:30:55 | is that we have very interesting discovery that we actually used the model that I |
---|
0:30:59 | showed you there |
---|
0:31:00 | and also deep neural network here. |
---|
0:31:03 | And that actually is the number that we analyzed |
---|
0:31:06 | error pattern very carefully. So it's very good, you know for TIMIT. |
---|
0:31:10 | You can disable language model, right. |
---|
0:31:12 | Then you can understand the errors for acoustic ?? very effectively |
---|
0:31:15 | and I tried to do that afterwards, you know, to do other tasks |
---|
0:31:20 | and it's very hard once you put language model in there you just couldn't |
---|
0:31:23 | do any analysis. So it's very good at the time we did this analysis. |
---|
0:31:26 | So now the error pattern in the comparison |
---|
0:31:30 | is, I don't have time to go through except just to mention that. |
---|
0:31:33 | So DNN made many new errors on short undershoot vowels. |
---|
0:31:37 | So it sort of undo what this model is about to do |
---|
0:31:40 | and then we thought of why would that happen and of course at the end |
---|
0:31:43 | we had a very big window so if the sounds |
---|
0:31:45 | are very short, information is captured over here and your input is about eleven frames, |
---|
0:31:48 | you know, you got the fifteen frame it |
---|
0:31:50 | captures kind of noise coming from different phones of course error is made over here. |
---|
0:31:54 | So we can understand why. |
---|
0:31:56 | And then we asked why this model corrects errors? It's just because |
---|
0:31:59 | you make |
---|
0:32:00 | you deliberately make a hidden representation |
---|
0:32:04 | to reflect |
---|
0:32:05 | what sound pattern looks like. |
---|
0:32:07 | In the hidden space. And it's nice for whom you can see |
---|
0:32:10 | but if you have the articulations, how do they see? So sometimes we use former |
---|
0:32:14 | to illustrate what's going on there. |
---|
0:32:18 | Another important discovery at Microsoft is that we actually found that using spectrogram |
---|
0:32:23 | we produce much better |
---|
0:32:26 | autoencoding results in terms of speech analysis. |
---|
0:32:30 | Encoding results |
---|
0:32:32 | ?? and that was very surprising at the time. |
---|
0:32:36 | And that really conforms to the basic deep learning theme that |
---|
0:32:39 | you know the earliest features are better |
---|
0:32:42 | then the processed features here. So I show you, this is actually project |
---|
0:32:48 | we did together in 2009. |
---|
0:32:49 | So we used spectrogram |
---|
0:32:51 | to do binary coding of |
---|
0:32:53 | of spectrogram. |
---|
0:32:55 | So I don't have time to go through that. You read the auto-encoding book if |
---|
0:33:01 | you can. |
---|
0:33:02 | In literature you can all see this. |
---|
0:33:03 | So the key is that |
---|
0:33:04 | you use the target to be the same as input and then you use small |
---|
0:33:07 | number of bits in the middle. |
---|
0:33:09 | And you want to see whether that would actually |
---|
0:33:11 | ?? all the ?? down here. And the way to evaluate it is to look |
---|
0:33:15 | at |
---|
0:33:15 | you know what kind of errors you have. |
---|
0:33:17 | So the way we did is we used the vector quantizer as a baseline |
---|
0:33:21 | of 312 bits. |
---|
0:33:23 | And then reconstruction |
---|
0:33:24 | looks like this. So this is the original one, this is the shallow model, right. |
---|
0:33:29 | Now using deep auto-encoder we get much closer to this in terms of errors |
---|
0:33:34 | we simply have just much lower coding error |
---|
0:33:38 | using identical number of bits. |
---|
0:33:39 | So it really shows that if you build deep structure you extract this bottom-up feature. |
---|
0:33:45 | Both ?? you condense more |
---|
0:33:47 | information in terms of reconstructing the original signal. |
---|
0:33:50 | And then we actually found that |
---|
0:33:53 | for spectrogram this result is the best. |
---|
0:33:55 | Now for MFCC we still get some gain, but gain is not nearly as much, |
---|
0:34:00 | sort of indirectly |
---|
0:34:01 | convinces me. There's Geoff Hinton's |
---|
0:34:03 | original activities ?? everybody's |
---|
0:34:06 | to spectogram. |
---|
0:34:07 | So maybe we should have do the waveform, probably not anyway. |
---|
0:34:10 | Okay so of course the next step is once we are all convinced that |
---|
0:34:14 | error analysis shows that |
---|
0:34:17 | deep learning can correct a lot of errors, not for all but for some |
---|
0:34:21 | which we understand why. You just pick up the power and also capacity they had. |
---|
0:34:27 | So on average it does a little bit better |
---|
0:34:29 | based upon |
---|
0:34:30 | this analysis. |
---|
0:34:33 | Based upon this analysis it does slightly better. |
---|
0:34:36 | But if you look away |
---|
0:34:38 | but if you look at the error pattern you really can see |
---|
0:34:41 | that this has a lot of power, but it also has some shortcomings as well. |
---|
0:34:45 | So that both have pros and cons but one's errors are very different and it |
---|
0:34:49 | actually gives you the hint that |
---|
0:34:51 | you know is worthwhile to pursuit. |
---|
0:34:53 | Of course this was all very interesting |
---|
0:34:56 | evidence to show. |
---|
0:34:57 | And then to scale up to industrial scale we had to do |
---|
0:35:00 | lot of things so many of my colleagues actually were working with me |
---|
0:35:04 | on this. So first of all |
---|
0:35:06 | we need to extend the output |
---|
0:35:08 | from small number of phones |
---|
0:35:11 | at the states |
---|
0:35:12 | into very large |
---|
0:35:13 | and that actually at that time is motivated by |
---|
0:35:16 | how to save huge Microsoft investment in speech decoder software. |
---|
0:35:20 | I mean if you don't do this |
---|
0:35:22 | then you know if you do some other kind of output coding |
---|
0:35:27 | and they would also had to ?? atypical feature to do it. The one that |
---|
0:35:31 | would fully believed |
---|
0:35:31 | that it's going to work. |
---|
0:35:32 | But it turned out if you need to change decoder, you know, we just have |
---|
0:35:36 | to say wait a little bit. |
---|
0:35:38 | So |
---|
0:35:41 | and at the same time we found that using content dependent model gives much higher |
---|
0:35:46 | accuracy |
---|
0:35:46 | than content independent model for large tasks, okay. |
---|
0:35:49 | Now for small tasks we defined so much better. I think |
---|
0:35:53 | it's all related to |
---|
0:35:54 | a capacity saturation problem if you have too much |
---|
0:35:57 | but since a lot of data |
---|
0:35:59 | in |
---|
0:36:01 | in the training for large tasks |
---|
0:36:03 | you actually keen |
---|
0:36:04 | to form a very large output and that turn out |
---|
0:36:07 | to have you know |
---|
0:36:09 | double benefit. |
---|
0:36:10 | One is that you increased accuracy and number two is that you don't have to |
---|
0:36:13 | change anything about decoder. |
---|
0:36:14 | And industry loves that. |
---|
0:36:17 | You have both |
---|
0:36:18 | that's actually ??. I can't recall why actually took off. |
---|
0:36:22 | And then we summarize what enabled this type of model |
---|
0:36:24 | and industrial knowledge about how to construct a very large units in DA |
---|
0:36:29 | is very important |
---|
0:36:30 | and that essentially come from |
---|
0:36:32 | everybody's what here |
---|
0:36:34 | that actually used this kind of content dependent model for Gaussian Mixture Model, you know, |
---|
0:36:39 | that has been around for |
---|
0:36:40 | almost twenty some years. |
---|
0:36:42 | And also |
---|
0:36:43 | it depends upon industrial knowledge on how to make encoding of such huge and highly |
---|
0:36:48 | efficient using |
---|
0:36:50 | our conventional |
---|
0:36:51 | HMM decoding technology. |
---|
0:36:53 | And of course how to make things practical. |
---|
0:36:57 | And this is also very important enabling factor. If GPU didn't come up |
---|
0:37:03 | roughly at time, didn't become popular at that time |
---|
0:37:06 | all these experiments would take months to do. |
---|
0:37:08 | Without all this belief, without all this fancy infrastructure. |
---|
0:37:14 | And then |
---|
0:37:15 | people may not have patiance to wait to see the results, you know push that |
---|
0:37:18 | forward. |
---|
0:37:19 | So let me show you some very |
---|
0:37:22 | brief summary of the major |
---|
0:37:26 | result obtained in early days. |
---|
0:37:29 | So if we use three hours of training, this is TIMIT for example, we have |
---|
0:37:34 | got |
---|
0:37:34 | this is number I show you, it's not much about ?? percent of gain. |
---|
0:37:38 | Now if you increase the data up to |
---|
0:37:41 | ten times more thirty some hours you get twenty percent error rate. |
---|
0:37:46 | Now if you do more. |
---|
0:37:48 | For SwitchBoard, this is the paper that my colleague published here, |
---|
0:37:52 | you get more data, another ten times so you get two orders of magnitude to |
---|
0:37:57 | increase |
---|
0:37:58 | and the relative gain actually |
---|
0:38:00 | sort of |
---|
0:38:01 | increase, you know, ten percent, twenty percent, thirty percent. This is actually |
---|
0:38:06 | so of course if you increase |
---|
0:38:08 | the size of training data |
---|
0:38:10 | the baseline will increase as well, but relative gain is even bigger. |
---|
0:38:14 | And if people look at this result there's |
---|
0:38:16 | nobody |
---|
0:38:17 | in their mind who would say not to use that. |
---|
0:38:20 | And that's how |
---|
0:38:21 | and then of course a lot of companies |
---|
0:38:24 | you know |
---|
0:38:26 | actually still |
---|
0:38:28 | implement, DNN is fairly easy to implement for everybody because |
---|
0:38:33 | I missed one of the points over there. It actually turned out if you use |
---|
0:38:37 | large amount of data |
---|
0:38:38 | it turned out that the original |
---|
0:38:41 | idea of using DBN to regularize that model doesn't lead to |
---|
0:38:44 | be that anymore. And in the beginning ?? how it happened. |
---|
0:38:49 | But anyway, so now let me come back to the main thing of the talk. |
---|
0:38:53 | How generative model |
---|
0:38:54 | and deep neural network may be helping each other. |
---|
0:38:57 | So the kluge one was that to use this to be |
---|
0:39:02 | at that time |
---|
0:39:03 | we have to keep this now for this conference we see |
---|
0:39:07 | ?? using LSTM with neural network and that fixed this problem. |
---|
0:39:12 | So this problem is fixed. |
---|
0:39:14 | This problem is fixed automatically. |
---|
0:39:17 | At that time |
---|
0:39:19 | we thought we need to use DBN. Now with use of big data there's no |
---|
0:39:23 | need anymore. |
---|
0:39:24 | And that's very well understood now. Actually there are many ways to understand that. You |
---|
0:39:28 | can think about as |
---|
0:39:29 | regulization view point |
---|
0:39:31 | and yesterday at the table with students I mentioned that and people said: What is |
---|
0:39:36 | regularization? |
---|
0:39:37 | And you have to understand more in terms of the optimization view point |
---|
0:39:41 | so actually if you stare at back-propagation formula for ten minutes you figure out why. |
---|
0:39:47 | Which I actually have slide there, it's very easy to understand why from many perspectives. |
---|
0:39:52 | With a lots of data you really don't need that. |
---|
0:39:54 | And that's automatically fixed. |
---|
0:39:57 | You know kind of by industrialization we tried lots of data |
---|
0:40:00 | it's fixed and now this is not fixed yet. So this is actually the main |
---|
0:40:03 | topic |
---|
0:40:04 | that I'm going to use for the next twenty minutes. |
---|
0:40:07 | So before I do that I will actually try to summarize some of |
---|
0:40:11 | the major ... actually I and my colleagues wrote this book |
---|
0:40:14 | and in this chapter we actually grouped |
---|
0:40:16 | the major advancement of deep neural network into several categories |
---|
0:40:22 | so I'm going to go through that quickly. |
---|
0:40:24 | So one is the optimization, |
---|
0:40:26 | innovation. |
---|
0:40:27 | So I think the most important advancement |
---|
0:40:31 | over the previous, you know the early success of the I showed you early on |
---|
0:40:36 | what's the development of sequence discriminative training and |
---|
0:40:39 | this contributed additional ten percent of error rate reduction. |
---|
0:40:42 | Also many groups of people have done this. |
---|
0:40:45 | Like for us at Microsoft, you know this is our first intern coming to our |
---|
0:40:49 | place to do this. |
---|
0:40:50 | And we tried on TIMIT we didn't know all the subtleties of the importance of |
---|
0:40:56 | regularization and |
---|
0:40:56 | we got all the formula right, all of everything right |
---|
0:40:59 | and the result wasn't very good. |
---|
0:41:01 | But I think |
---|
0:41:02 | Interspeech accepting our paper and this we understand that this |
---|
0:41:06 | and then later on |
---|
0:41:09 | we got more a more papers, actually a lot of papers were published in Interspeech. |
---|
0:41:13 | That's very good. |
---|
0:41:15 | Okay now, the next theme is about 'Towards Raw Input', okay. |
---|
0:41:21 | So what I showed you early on was the speech coding and analysis part |
---|
0:41:26 | that we know that is good. We don't need MFCC anymore. |
---|
0:41:29 | So it was bye MFCC, so |
---|
0:41:31 | probably it will disappear |
---|
0:41:33 | in our community. Slowly over the next few years. |
---|
0:41:36 | And also we want to say bye to Fourier transforms, so I put the question |
---|
0:41:42 | mark here partly because |
---|
0:41:43 | actually, so for this Interspeech I think two days ago Herman ?? had a very |
---|
0:41:48 | nice paper on |
---|
0:41:49 | this and I encourage everybody to take a look at. |
---|
0:41:52 | You just put the raw information in there |
---|
0:41:55 | which was done actually about three years ago by Geoff Hinton students, they truly believed |
---|
0:42:00 | it. I couldn't |
---|
0:42:01 | I tried that about 2004, that was the hidden Markov model |
---|
0:42:04 | error. |
---|
0:42:05 | And we understood all kind of problem, how to normalize users input and I say |
---|
0:42:09 | it's crazy |
---|
0:42:10 | and then when they published the result |
---|
0:42:13 | in |
---|
0:42:14 | ICASSP. I looked at these results and error was terrible. I mean there's so much |
---|
0:42:17 | of error. |
---|
0:42:17 | So nobody took attention. And this year we brought the attention to this. |
---|
0:42:21 | And the result is almost as good as using, you know, |
---|
0:42:25 | using Fourier transforms. |
---|
0:42:27 | So far we don't want to throw away yet, |
---|
0:42:29 | but maybe next year people may throw that away. |
---|
0:42:33 | Nice thing is .. I was very curious about this. I say |
---|
0:42:37 | at the terms of that to get that result they just randomize everything rather than |
---|
0:42:41 | using Fourier transforms |
---|
0:42:42 | to initialize it and that's very intriguing. |
---|
0:42:46 | Too many references to list I was running all the time. I had ?? list. |
---|
0:42:50 | But yesterday when I went through this adaptation session there's so many good papers around. |
---|
0:42:55 | I just don't have patience for them anymore. |
---|
0:42:57 | So go back to ?? adaptation papers. There are a lot of new |
---|
0:43:02 | advancements. So another important thing is transfer learning |
---|
0:43:05 | at that place very important role in multi-lingual acoustic modelling. |
---|
0:43:10 | So that was tutorial that I was .. actually Tanja was giving in a workshop |
---|
0:43:17 | I was attending. |
---|
0:43:18 | I also mention that |
---|
0:43:20 | for generative model |
---|
0:43:22 | for shallow model before |
---|
0:43:24 | this one almost never |
---|
0:43:26 | multilingual |
---|
0:43:28 | of course |
---|
0:43:28 | modelling |
---|
0:43:30 | actually improved things. |
---|
0:43:32 | But it never actually beat the baseline |
---|
0:43:36 | in terms of .. |
---|
0:43:39 | so think about cross-lingual for example, multi-lingual and cross-lingual |
---|
0:43:42 | and deep learning actually beat the baseline. So there's whole bunch |
---|
0:43:44 | papers in this area which I won't have time to go through all here. |
---|
0:43:47 | Another important innovation is nonlinear regularization, so for |
---|
0:43:50 | regulation dropout if you don't dropout it's good to know. |
---|
0:43:54 | And this is special technique. Essentially it's just 'kill all you know' or |
---|
0:43:57 | randomly and you get the better result. |
---|
0:44:03 | And in terms of output units |
---|
0:44:05 | now |
---|
0:44:06 | is very popular units is to rectify linear units |
---|
0:44:09 | and now there's some very interesting |
---|
0:44:11 | many interesting theoretical analogies why this is better than this. |
---|
0:44:16 | At least while in my experience .. actually I programmed this, it's change of our |
---|
0:44:20 | lifes |
---|
0:44:21 | to go from this to this. |
---|
0:44:23 | Deep learning |
---|
0:44:24 | really increases. |
---|
0:44:26 | And we understand now why it happens. |
---|
0:44:29 | Also (in terms of) accuracy different groups report different results. |
---|
0:44:32 | Some groups reports they reduced error rate, some groups .. nobody reported increase in error |
---|
0:44:37 | rates for now. |
---|
0:44:38 | So in any case (it) speed up |
---|
0:44:40 | the convergence dramatically. |
---|
0:44:43 | So I'm going to show you another architecture over here which is going to link |
---|
0:44:48 | to |
---|
0:44:49 | a generative model. |
---|
0:44:51 | So this is a model called Deep Stacking Network. |
---|
0:44:55 | But its very design is deep neural network, okay. It's information from bottom up. |
---|
0:45:00 | So the difference between this model and conventional deep neural network is that |
---|
0:45:04 | for every single layer you can actually |
---|
0:45:07 | integrate the input for each layer and then do some special processing here. |
---|
0:45:15 | Especially you can alternate |
---|
0:45:17 | layers into linear and nonlinear, if you do that you can dramatically increase your |
---|
0:45:23 | speech convergence |
---|
0:45:26 | in deep learning. |
---|
0:45:27 | And there's some another theoretical analysis which is actually put in one of the books |
---|
0:45:31 | I wrote. |
---|
0:45:32 | So you actually can convert many complex |
---|
0:45:35 | propagation, |
---|
0:45:37 | non-convex problem into |
---|
0:45:38 | somewhat |
---|
0:45:41 | kind of ??property measure problem related to |
---|
0:45:44 | convex optimization so we can understand our probability ??. |
---|
0:45:46 | So we did that a few years ago and we wrote a paper on this. |
---|
0:45:49 | And this idea can also be used for this |
---|
0:45:53 | potential network, which I don't have the time to go through here. And the reason |
---|
0:45:56 | why I bring that up is |
---|
0:45:57 | because it's actually related to some recent work |
---|
0:46:00 | that I have seen |
---|
0:46:01 | for generative model which were taking convertion of each other, so let me compare between |
---|
0:46:07 | two of |
---|
0:46:08 | them to give you some example to show how to |
---|
0:46:10 | both |
---|
0:46:11 | networks can help each other. |
---|
0:46:13 | So when developped this deep stacking network the activation function had to be fixed. |
---|
0:46:20 | Either logistic or ReLu which are both |
---|
0:46:22 | reasonably well |
---|
0:46:23 | you know compared to |
---|
0:46:25 | with each other. |
---|
0:46:28 | Now look at this architecture. |
---|
0:46:31 | Almost identical architecture. |
---|
0:46:33 | So now |
---|
0:46:35 | if you change the |
---|
0:46:38 | activation function to be something very strange, I don't expect you to know anything about |
---|
0:46:42 | this |
---|
0:46:43 | and this is actually work done by Mitsubishi people. |
---|
0:46:46 | There's a very nice paper over here in the technical ?? |
---|
0:46:50 | I spent a lot of time talking to them and they even came to |
---|
0:46:52 | Microsoft, so actually I listened to some of them and their demo. |
---|
0:46:56 | So the activation function for this model is called the Deep Unfolding Model |
---|
0:47:00 | that's is derived from inference method in generative model. |
---|
0:47:06 | Which is not fixed as in the ?? I showed you earlier. So to stop |
---|
0:47:11 | this model .. it looks like deep neural network, right? |
---|
0:47:14 | But the beginning |
---|
0:47:16 | the initial phase of their generative model which is specific about, |
---|
0:47:20 | I hope many of you know the non-negative matrix factorization. This is specific technique |
---|
0:47:26 | which actually is a shallow generative model. |
---|
0:47:29 | It actually makes a very simple assumption that |
---|
0:47:32 | the |
---|
0:47:33 | observed noisy speech or mixed speakers' speech is the sum of two sources |
---|
0:47:40 | in spectral domain. |
---|
0:47:41 | What was they make the assumption |
---|
0:47:43 | and then they of course they have to enforce that each |
---|
0:47:46 | you know |
---|
0:47:47 | each vector is positive because of the magnitude of spectra. |
---|
0:47:52 | What they do is an iterative technique and that becomes a iterative technique. |
---|
0:47:58 | And that |
---|
0:47:59 | model automatically embed the main knowledge about how observation |
---|
0:48:04 | is obtained, you know, through the mix between the two. |
---|
0:48:08 | And then this work essentially said how to apply that inference technique iteration. Every single |
---|
0:48:13 | iteration I treat that as a different |
---|
0:48:16 | layer. |
---|
0:48:18 | After this they do the back propagation training. |
---|
0:48:21 | And the backward iteration is possible |
---|
0:48:24 | because |
---|
0:48:25 | the problem is very simple, so the application here is a speech enhancement |
---|
0:48:29 | therefore objective function is a mean-square error, very easy. So the generative model |
---|
0:48:34 | actually generative model gives you |
---|
0:48:39 | the |
---|
0:48:40 | the generative observation |
---|
0:48:42 | and then |
---|
0:48:43 | your output is clean speech. |
---|
0:48:45 | Okay then you do mean-square error you actually adapt all this way |
---|
0:48:48 | and the results are very impressive. So now this is why |
---|
0:48:52 | I showed you can design deep neural network |
---|
0:48:55 | if we use this |
---|
0:48:57 | type of |
---|
0:48:58 | activation function you automatically build in the constraints that you use in the generative model |
---|
0:49:03 | and that's |
---|
0:49:04 | very good example to show |
---|
0:49:06 | the message that I'm going to, |
---|
0:49:09 | actually I put in the beginning of the (presentation) it's |
---|
0:49:11 | hope of deep generative model. So this is |
---|
0:49:14 | shallow model and it's easy to do it. Now for deep generative model |
---|
0:49:18 | it's very hard to do. |
---|
0:49:19 | And one of reasons I put this as a topic today is partly because |
---|
0:49:25 | all this conference |
---|
0:49:27 | it's just three months ago |
---|
0:49:30 | in Beijing's ICML conference |
---|
0:49:33 | there's a very nice development |
---|
0:49:35 | of deep generative models' learning methods. |
---|
0:49:40 | They actually linked this |
---|
0:49:42 | neural network and Bayes net together |
---|
0:49:44 | through some transformation |
---|
0:49:46 | and because of that .. the main idea of .. whole bunch of papers including |
---|
0:49:51 | Michael Jordan, |
---|
0:49:52 | whole bunch, you know, a lot of very well known people |
---|
0:49:54 | in machine learning for deep generative model |
---|
0:49:56 | so the main |
---|
0:49:58 | point of this set of work, I just want to use one simple sentence to |
---|
0:50:03 | summarize them, |
---|
0:50:03 | is that |
---|
0:50:04 | when you originally tried to do E step I showed you early on |
---|
0:50:09 | you have to factorize them in order to get each step done |
---|
0:50:12 | and that was approximation |
---|
0:50:13 | and there was very nice ?? developped. A ?? so large it's practically useless |
---|
0:50:18 | in terms of inferring the top layer |
---|
0:50:24 | discrete event. |
---|
0:50:25 | Now the whole point is that now we can relax that constraint for factorization |
---|
0:50:30 | and like before three years ago if you do that if you use a rigorous |
---|
0:50:35 | dependency |
---|
0:50:36 | you don't get any reasonable analytical solution so you cannot do EM. |
---|
0:50:42 | Now this |
---|
0:50:43 | idea is to say that while you can approximate |
---|
0:50:48 | that factorisation, |
---|
0:50:49 | you can approximate that dependency in E step learning |
---|
0:50:52 | not through |
---|
0:50:55 | factorization which is called mean field approximation, |
---|
0:50:57 | but use deep neural network to approximate. |
---|
0:51:01 | So this is example to show that deep neural network actually help you to solve |
---|
0:51:05 | deep generative model problem and |
---|
0:51:07 | so this is well know Max Welling, a very good friend of mine in machine |
---|
0:51:12 | learning. |
---|
0:51:14 | And he told me that the paper never show that. |
---|
0:51:17 | And they really developed the |
---|
0:51:20 | the theorem to prove that if network is large enough |
---|
0:51:24 | the approximation error can approach |
---|
0:51:26 | zero. Therefore the variational learnings |
---|
0:51:31 | can be eliminated and that's a very engine |
---|
0:51:33 | developed that really give me a little evidence to show that, |
---|
0:51:36 | to see that this is |
---|
0:51:38 | a promising approach. I think machine learning community development tool, |
---|
0:51:42 | our speech community developed verification |
---|
0:51:45 | and also methodology as well, |
---|
0:51:47 | but if |
---|
0:51:48 | you know we actually cross connect |
---|
0:51:50 | to each other we are gonna to make much more progress and that this type |
---|
0:51:55 | of development |
---|
0:51:55 | really |
---|
0:51:56 | gives some |
---|
0:51:58 | promising direction |
---|
0:52:00 | towards the main message I put out at the beginning. |
---|
0:52:03 | Okay, so now I am gonna show you some deeper results that I want to |
---|
0:52:07 | show you. |
---|
0:52:09 | Another better architecture that we have known is what's called the reccurent network, if you |
---|
0:52:14 | read |
---|
0:52:14 | this Beaufays' paper LSTM, look at that result. For |
---|
0:52:18 | voice search the error rate jumped down to about ten percent. That's very impressive result. |
---|
0:52:22 | Another type of architecture is to integrate the convolution |
---|
0:52:27 | and non-convolution together. That was ?? |
---|
0:52:30 | in the previous result. As the author worth of any better result is in though. |
---|
0:52:33 | ?? |
---|
0:52:33 | So these are the state-of-the-art for switchboard (SWBD) task. |
---|
0:52:37 | So now I'm going to concentrate on this type of |
---|
0:52:40 | recurrent network here. |
---|
0:52:43 | Okay, so this coming down to one of my main messages here. |
---|
0:52:47 | So we fixed this kluge |
---|
0:52:51 | by |
---|
0:52:51 | a recurrent network. |
---|
0:52:54 | We also fix this kluge automatically |
---|
0:52:58 | by |
---|
0:53:00 | just using big data. |
---|
0:53:01 | Now how do we fix this kluge? |
---|
0:53:05 | So first of all I'll show you some analysis on recurrent network vs. deep generative |
---|
0:53:11 | model |
---|
0:53:11 | so that's called hidden dynamic model I showed you early on, okay. |
---|
0:53:14 | And so far analysis hasn't been applied to LSTM. |
---|
0:53:17 | So some further analysis may |
---|
0:53:20 | actually automatically give rise to LSTM using some analysis on this. |
---|
0:53:24 | So this analysis is very preliminary |
---|
0:53:27 | and so if you stare at the equotation |
---|
0:53:29 | for recurrent network it looks like best one. So essentially you have state of the |
---|
0:53:33 | art equotation |
---|
0:53:34 | and it's recursive. |
---|
0:53:35 | Okay, |
---|
0:53:36 | from previous hidden layer to this. |
---|
0:53:40 | And then you get the output |
---|
0:53:43 | that produces the label. |
---|
0:53:45 | Now if you look at this deep generative model - hidden dynamic model |
---|
0:53:48 | identical equotation, |
---|
0:53:50 | okay? Now what's the differece? |
---|
0:53:52 | The difference is that the input now is the label. Actually if you put the |
---|
0:53:56 | label |
---|
0:53:57 | you cannot drive it. So you have to make some connection between labels and continuous |
---|
0:54:01 | variable |
---|
0:54:02 | and that's what in phonetic |
---|
0:54:03 | people call phonology to phonetic interface, okay. |
---|
0:54:06 | So we use some very basic assumption |
---|
0:54:08 | that the interface is simply, that each label corresponds to target vector, |
---|
0:54:14 | actually the way that we implement early distribution, you can do that to account for |
---|
0:54:18 | speaker |
---|
0:54:18 | differences, etcetera. Now the output |
---|
0:54:21 | for this recursion gives you the observation |
---|
0:54:24 | and that's the recurrent filter type of model. |
---|
0:54:28 | And that's engineering model and there's neural network model, okay. So every time I was |
---|
0:54:32 | teaching |
---|
0:54:32 | ?? I called ?? on this. |
---|
0:54:34 | So we fully understood all the constrains for this type of model. |
---|
0:54:39 | Now for this model it looks the same, right? |
---|
0:54:41 | So if you reverse direction you convert one model to another. |
---|
0:54:44 | And for this model it's very easy to put a constraint, for example |
---|
0:54:49 | the dynamics |
---|
0:54:50 | of |
---|
0:54:53 | matrix here that governs |
---|
0:54:56 | the internal dynamics in the hidden domain actually can be made sparse and then you |
---|
0:55:00 | can put |
---|
0:55:02 | realistic constrain there for example in our |
---|
0:55:04 | earlier implementation of this we put this critical dynamics |
---|
0:55:08 | therefore you can guarantee it doesn't oscillate. When we do articulation we need phone boundaries. |
---|
0:55:12 | This is the speech production mechanism |
---|
0:55:15 | you can put them simply to fix the sparse matrix. |
---|
0:55:17 | Actually one of the slides I'm gonna show you is all about this. |
---|
0:55:22 | In this one we cannot do it, everything has to be a structure. |
---|
0:55:25 | There's just no way you can say that why, you want that dynamics |
---|
0:55:29 | to behave in certain way. |
---|
0:55:32 | You just don't have any mechanism to design the structure of this and this is |
---|
0:55:36 | very natural, it's by physical |
---|
0:55:37 | properties that design this. Now because of |
---|
0:55:40 | this correspondence and because of the fact that now we can do |
---|
0:55:44 | deep inference |
---|
0:55:47 | if all this machine learning technology actually are fully developed |
---|
0:55:51 | we can very naturally bridge the two (models together). |
---|
0:55:53 | It turned out if you do more |
---|
0:55:55 | rigorous analysis |
---|
0:55:56 | by |
---|
0:55:57 | making the inference of this to be fancier |
---|
0:56:00 | our hope that |
---|
0:56:02 | this |
---|
0:56:03 | multiplicative |
---|
0:56:04 | kind of unit would automatically emerge from this type of model so that has not |
---|
0:56:08 | been shown yet. |
---|
0:56:10 | So of course this is just, you know, very high-level view comparison between the two |
---|
0:56:15 | there are a lot of detail comparison you can make in order to bridge the |
---|
0:56:19 | two, |
---|
0:56:19 | so actually my colleague Dong Yu wrote this book that's just coming out very soon. |
---|
0:56:26 | So in one of the chapters we put all these comparisons: interpretability, parametrization, methods |
---|
0:56:32 | of learning and nature of representation and all the differences. |
---|
0:56:36 | So it gives a chance to actually understand |
---|
0:56:38 | how deep generative model in terms of dynamics |
---|
0:56:42 | and recurrent network in terms of recurrence can |
---|
0:56:44 | be matched with each other, so I will read that over here. |
---|
0:56:48 | So I have the final five, three more minutes, five more minutes. I will go |
---|
0:56:53 | very quickly. |
---|
0:56:54 | Everytime I talk about it I was running out of time. |
---|
0:56:57 | So |
---|
0:56:59 | so the key concept is called embedding. |
---|
0:57:01 | Okay, so actually you can find the literature in nineties, eighties to have this |
---|
0:57:07 | basic idea around. |
---|
0:57:09 | For example in this special issue of |
---|
0:57:12 | Artifical Intelligence, very nice paper over here, I had chance to read them all. |
---|
0:57:15 | And very insightful and some of the chapters over here are very good. |
---|
0:57:18 | So the idea is that each physical entity or linguistic |
---|
0:57:23 | you know |
---|
0:57:24 | entity: |
---|
0:57:25 | word, phrase, but even whole article, whole paragraph |
---|
0:57:29 | can be embedded into |
---|
0:57:30 | continuous-space vector. It could be big ??, you know. |
---|
0:57:34 | Just to let you know it's special issue on this topic. |
---|
0:57:38 | And that's why it's important concept. |
---|
0:57:41 | The second important concept, which is much more advanced |
---|
0:57:44 | which is described by a few books over here. I really enjoyed reading some of |
---|
0:57:49 | those and I invite those |
---|
0:57:50 | people come to visit me. |
---|
0:57:52 | We have a lot to discuss on that. You can actually even embed the structure |
---|
0:57:56 | into |
---|
0:57:57 | next structure symmetric into a vector |
---|
0:58:01 | where you can recover the structure completely through the vector |
---|
0:58:04 | operation and the concept is called tensor-product representation. |
---|
0:58:08 | So I don't have .. if only I had three hours I can go through |
---|
0:58:11 | all of this. |
---|
0:58:11 | But for now I'm going to elaborate about this for next two minutes. |
---|
0:58:16 | So |
---|
0:58:17 | this is the neural network recurent model and this is very nice, I mean this |
---|
0:58:21 | is fairly informational paper |
---|
0:58:22 | to show that embedding can be done as part of the |
---|
0:58:25 | as a byproduct of the recurrent neural network that |
---|
0:58:28 | paper was published in Interspeech several years ago. |
---|
0:58:34 | And then I'll talk very quickly about semantic embedding at MSR, so |
---|
0:58:39 | the difference between this set of work and the previous work was that |
---|
0:58:42 | everything is completely unsupervised |
---|
0:58:44 | so in the company if you have supervision you should grab it, right. |
---|
0:58:48 | So we actually took initiative to actually take some |
---|
0:58:51 | very smart |
---|
0:58:52 | exploitation of supervision signals |
---|
0:58:54 | at virtually no cost. |
---|
0:58:57 | So the idea here was that this is the model that we have essentially for |
---|
0:59:01 | each branch it's deep neural network. Now different |
---|
0:59:03 | branches can actually link together |
---|
0:59:05 | through what's called the, you know, cosine distance. |
---|
0:59:08 | So that |
---|
0:59:09 | distance can be measured |
---|
0:59:10 | in terms of |
---|
0:59:11 | a vector, in a vector space. |
---|
0:59:13 | And now we do MMI learning, |
---|
0:59:16 | so if you get hot dog in this one, if your document is talking about |
---|
0:59:20 | fast food or something, even if |
---|
0:59:22 | there's no word in common you pick up. |
---|
0:59:24 | And because of supervision actually link them together. |
---|
0:59:27 | Like if you have dog racing here |
---|
0:59:29 | they have the same word although they will be very far apart from each other. |
---|
0:59:33 | And that can be automatically done. |
---|
0:59:37 | And that some people told me that topic model can do |
---|
0:59:39 | similar things, so if we compare that with the topical model |
---|
0:59:42 | it turned out that ?? |
---|
0:59:45 | and using this |
---|
0:59:46 | deep semantic model |
---|
0:59:48 | can do much, much better. |
---|
0:59:49 | So, now multi-modal. Just one more slide. |
---|
0:59:53 | So it turned out that not only text you can embed into it, |
---|
0:59:57 | image can be embedded, speech can be embedded and can do something very similar |
---|
1:00:01 | to the one I showed you earlier. |
---|
1:00:03 | And this is the paper that was in yesterday talk about embedding. |
---|
1:00:09 | That's ver nice, I mean it's very similar concept. |
---|
1:00:12 | So I looked at this and I said wow it's just like the model that |
---|
1:00:15 | we did for the text. |
---|
1:00:16 | But it turned out that application is very different. |
---|
1:00:18 | So actually |
---|
1:00:20 | I don't have time to go through here. I encourage to read on some papers |
---|
1:00:24 | over here. Let's skip this. |
---|
1:00:25 | So this was just to show you some application for this |
---|
1:00:27 | semantic model. You can do all the things. From web search |
---|
1:00:30 | we apply them, quite nicely. For machine translation you have one entity |
---|
1:00:34 | to be one language |
---|
1:00:37 | some of the list of the paper that were published you can find some detail. |
---|
1:00:40 | You actually can make summary, summarization and entity ranking. |
---|
1:00:45 | So let's skip this. This is final slide, the real final slide. |
---|
1:00:49 | I don't have any summary slides, this is my summary slide. |
---|
1:00:51 | So I copied the main message here now. Elaborate could be more. After going through |
---|
1:00:55 | whole hour of presentation. |
---|
1:00:57 | Now in terms of application we have seen |
---|
1:01:00 | speech recognition. |
---|
1:01:01 | The green is |
---|
1:01:03 | neural network, the red is deep generative model. So |
---|
1:01:07 | I say a few words about deep generative model and dynamic model |
---|
1:01:11 | that's generative models side and LSTM is other side. Now speech enhancement |
---|
1:01:16 | I showed you these types of models |
---|
1:01:19 | and then |
---|
1:01:20 | on the generative model side I showed you this one |
---|
1:01:25 | and this is shallow generative model that actually can |
---|
1:01:28 | give rise to deep structure which is corresponding to |
---|
1:01:31 | deep |
---|
1:01:33 | stacking network I showed you early on. Now for algorytm we have get back propagation |
---|
1:01:37 | here. |
---|
1:01:38 | That's single unchallenged |
---|
1:01:40 | algorytm for deep neural network. |
---|
1:01:42 | Now for deep generative model there are two algorytms. They are both called |
---|
1:01:45 | BP. |
---|
1:01:47 | So one is called Belief Propagation, for those of you who know machine learning. |
---|
1:01:51 | The other one is BP, same as this. |
---|
1:01:54 | That only came up within two years. |
---|
1:01:57 | Due to this new advancement |
---|
1:02:00 | of porting deep neural network |
---|
1:02:02 | into the inference step |
---|
1:02:04 | of this type of model, so I call BP and BP. |
---|
1:02:08 | And in terms of neuroscience you call this one to be wake and you call |
---|
1:02:11 | the other sleep. |
---|
1:02:12 | And in the sleep you generate things you get hallucination and then when you're awake |
---|
1:02:16 | you have perception. |
---|
1:02:17 | You get information there. I think that's all I want to say. Thank you very |
---|
1:02:20 | much. |
---|
1:02:29 | Okay. Anyone one or two quick questions? |
---|
1:02:37 | Very interesting talk. |
---|
1:02:40 | I don't want to talk about your main point which is very interesting. |
---|
1:02:43 | Actually just very briefly about one of your side messages which is about waveforms. |
---|
1:02:48 | Which is about waveforms. So you know the ?? paper there weren't really putting in |
---|
1:02:54 | waveforms. |
---|
1:02:54 | They are putting in the waveforms, take the absolute value, floor it, take all |
---|
1:02:58 | logarithm, average over, but you know so you had to do a lot of things. |
---|
1:03:03 | Secondly the other papers that there's been a modest |
---|
1:03:07 | amount of work in last few years on doing this sort of thing, |
---|
1:03:10 | pretty generally people do it with matched training test conditions |
---|
1:03:14 | if you have mismatched conditions, good luck with |
---|
1:03:16 | waveform. I always hate to say something is impossible but good luck. |
---|
1:03:24 | Thank you very much. ?? good for everything. |
---|
1:03:27 | And look at presentation that was very nice, thank you. |
---|
1:03:32 | Any other quick questions? |
---|
1:03:36 | If not I invite Haizhou |
---|
1:03:40 | to give a plaque. |
---|