0:00:15 | Like the first of thing. I will thank the organisers for having |
---|---|

0:00:19 | this opportunity to share with you |

0:00:22 | some of my personal views |

0:00:25 | on this very hot topic here. So, |

0:00:29 | I think the goal of this tutorial really is to |

0:00:33 | help diversifying the deep learning approach. Just like the theme of this conference, |

0:00:40 | Interspeech For Diversifying |

0:00:42 | the Language, okay. |

0:00:45 | So I have a long list of people to thank. Oh, so I want. |

0:00:49 | Yeah, thank you. |

0:00:50 | So I have long ... long list of people here to thank. |

0:00:53 | Especially Geoff Hinton. I worked with him for some period of time. |

0:00:58 | And Dong Yu and whole bunch of Microsoft colleagues. |

0:01:02 | Oh, who, |

0:01:04 | hmm, |

0:01:05 | contributed a lot to the material |

0:01:08 | I'm going to go through. |

0:01:10 | And also I would like to thank many of the colleagues sitting here who had |

0:01:14 | a lot of discussions with me. |

0:01:16 | And their opinions also shaped some of the content that I am going to go |

0:01:20 | through with you over the next hour. |

0:01:23 | Yeah, so the main message of this talk |

0:01:26 | is that deep learning is not the same as deep neural network. I think in |

0:01:30 | this community most of people |

0:01:31 | mistake deep learning with deep neural network. |

0:01:36 | And most ... |

0:01:38 | So deep learning is something that everybody here would know. I mean just look at |

0:01:42 | ... I think I counted close to 90 papers somewhere |

0:01:44 | related to the deep learning or approaching at least. Kind of the number of papers |

0:01:50 | exponentially |

0:01:50 | increasing over last twelve years. |

0:01:53 | So deep neural network is essentially neural network |

0:01:56 | you can unfold that in space. You form a big network. |

0:02:01 | OR |

0:02:02 | AND |

0:02:03 | Either way or both. You can unfold that over time. If you don't unfold |

0:02:07 | that neural network over time because of reccurent network, okay. |

0:02:11 | But there's another very big branch of deep learning, which I would call Deep Generative |

0:02:16 | Model. |

0:02:17 | Like a type of neural network you can also unfold in space and in time. |

0:02:22 | If it's unfolded in time, you would call it a dynamic model. |

0:02:26 | Essentially the same concept. You unfold the network. |

0:02:31 | Oh. You know |

0:02:32 | in same direction in terms of time |

0:02:36 | But in terms of space they are unfolded in the |

0:02:39 | opposite direction. So I'm gonna elaborate this part. And for example |

0:02:43 | our very commonly used model. |

0:02:46 | You know a Gaussian Mixture Model, hidden Markov model, really has the |

0:02:53 | neural network unfolding in time. |

0:02:57 | But if you make that unfolding in space you get big Generative Model |

0:03:00 | which hasn't been very popular in our community. |

0:03:05 | You know I'm going to survey whole bunch of work related to this area, ah, |

0:03:09 | you know through the my discussion with many people here. |

0:03:14 | But anyway so the main message of this talk is eventually to |

0:03:20 | hope and I think there's a promising direction that is already taking place in machine |

0:03:26 | learning community |

0:03:27 | I don't know how many of you actually went to International Conference on Machine Learning |

0:03:30 | (ICML) this year, just a couple of months ago in Beijing. |

0:03:33 | But there's huge ammount of work in Deep Generative Model and some very interesting |

0:03:36 | development, which I think I'd like to share with you at high level, |

0:03:41 | so you can see that all this deep learning, although it just started in terms |

0:03:46 | of application in our |

0:03:47 | speech community, we should be very proud of that. |

0:03:51 | Hmm, now, |

0:03:52 | In number of machine learning communities there's huge amount of work going on |

0:03:57 | in Deep Generative Model. So I hope I can share with you some of recent |

0:03:59 | development with you to |

0:04:02 | to enforce the message that |

0:04:05 | a good combination between the two |

0:04:08 | which have |

0:04:10 | complementary strengths and weaknesses can be actually get together to further |

0:04:15 | advance deep learning in our community here. |

0:04:19 | Okay, so now. These are very big slides. I'm not going to go through all |

0:04:23 | of details. I'm just going to highlight a few |

0:04:25 | things so in order to enforce the message that |

0:04:30 | generative model and |

0:04:31 | neural network model can be helping each other. |

0:04:34 | I'm just going to highlight a few key attributes of |

0:04:38 | both approaches. They are very different approaches. |

0:04:41 | I'm going to highlight that very briefly. First of all |

0:04:45 | in terms of structure they are both graphical in nature as a network, okay. |

0:04:48 | You think about this deep generative model, typically some of these |

0:04:56 | we call that a Dynamic Bayesian network. You actually have joint probability between ?? label |

0:05:01 | and the observation. |

0:05:03 | And which is not the case for deep neural network, |

0:05:05 | okay. |

0:05:06 | In the literature you see many other terms |

0:05:10 | that relate to deep generative model like probabilistic graphical model, |

0:05:14 | such as stochastic neurons, |

0:05:17 | sometimes it's called the stochastic generative network as you see in literature. They all belong |

0:05:21 | to this |

0:05:22 | category. So if your mindset is over here, even though you see some neural words |

0:05:28 | describing that |

0:05:29 | you know you won't be able to read all this literature, so the mindset is |

0:05:32 | very difficult when you study these two. |

0:05:34 | So the strenght |

0:05:35 | of deep generative model is that, |

0:05:39 | this is very important to me, |

0:05:42 | how to interpret, okay. |

0:05:44 | So everybody that I talked, including the lunchtime when I talk to students, |

0:05:49 | they complain. I say: have you heard about deep neural network? and everybody says yes, |

0:05:52 | we do. |

0:05:54 | To what extent have you started looking to that? and they said we don't want |

0:05:57 | to do that because we cannot |

0:05:58 | even interpret what's in the hidden layer, right. |

0:06:01 | And that's true |

0:06:02 | and that actually is quite very deciding. I mean if you |

0:06:05 | read into this ?? science literature in terms of connectionist model |

0:06:09 | really the whole design is that you need to have a representation here to be |

0:06:13 | distributed. |

0:06:13 | So each neuron can represent different concept |

0:06:17 | and each |

0:06:18 | concept can be represented by different neurons, so the very design |

0:06:21 | it's not meant to be interpretable, |

0:06:23 | okay. |

0:06:24 | And that actually creates some difficulty for many |

0:06:27 | and this model is just opposite. It's very easy to interpret because the very nature |

0:06:33 | of generative story. |

0:06:34 | You can tell what the process is |

0:06:36 | and then of course if you want to do |

0:06:39 | a classification or some other application in machine learning |

0:06:42 | you simply just have to .. |

0:06:44 | and for forecast we simply have base route to invert that, that's exactly what in |

0:06:48 | our community |

0:06:49 | we have been doing for thirty years hidden Markov model. You get the prior, you |

0:06:52 | get generative model and |

0:06:53 | you multiply them and then you do it. Except at that time we didn't know |

0:06:57 | how to make that |

0:06:58 | deep for this type of model. And there are some piece of work that I'm |

0:07:01 | going to survey. |

0:07:02 | So that's one big part of the advantage of this model. |

0:07:05 | Of course everybody know that what I just mentioned there. |

0:07:09 | In deep generative model actually the information flow is from top to down. |

0:07:13 | You actually have .. what top simply means is that you know you get a |

0:07:16 | label or you get a higher level concept |

0:07:18 | and the lower level down simply means you can rotate to fit into that. |

0:07:22 | Everybody know that in a neural network |

0:07:25 | the information flow is from bottom to up, okay. So you fit the data and |

0:07:29 | you compute whatever output and then |

0:07:30 | you go either way you want. |

0:07:31 | In this case |

0:07:33 | the information come from top to down. You generate the information |

0:07:37 | and then if you want to do classification, you know, any other machine |

0:07:42 | learning applications, you know you can do Bayesian. Bayesian is very essential for this. |

0:07:49 | But there's whole list of those. I don't have time to go through, but just |

0:07:52 | you know those are high lights, these |

0:07:54 | we have to say. So the main strenght of deep neural network that actually gained |

0:07:59 | popularity |

0:07:59 | over the previous years, really is mainly due to these strenghts. |

0:08:04 | It's easier to do a computation in terms of |

0:08:10 | so this what I wrote is a regular compute, okay. |

0:08:13 | So if you |

0:08:14 | look into exact what kind of compute is involved here |

0:08:17 | it's just the millions of millions of millions of times of computing |

0:08:21 | of the big matrix by a vector. |

0:08:23 | You do that many times. ?? place very small model role |

0:08:27 | it's very regular. |

0:08:28 | And therefore GPU is really |

0:08:31 | ideally suited for this kind of computation |

0:08:33 | and that's not the case for this model. |

0:08:36 | So if you compare between these two then you really will understand that if you |

0:08:41 | can pull |

0:08:42 | some of these advantages into this model |

0:08:44 | and pull some of this advantage in this column into this one |

0:08:48 | you have integrated model. And that's kind of the message I'm going to convey and |

0:08:53 | I'm going to |

0:08:54 | give you example to show how this can be done. |

0:08:57 | Okay, so in terms of interpretability it's very much related to |

0:09:04 | how to incorporate the main knowledge |

0:09:06 | and network constraint into the model. And for deep neural network it's very hard. |

0:09:12 | What people have done that, I have seen many people in this conference and also |

0:09:16 | in a ?? |

0:09:17 | tried very hard it's not very natural. |

0:09:20 | What is |

0:09:22 | This is very easy |

0:09:23 | I mean you can code your domain control knowledge directly into the system. For example |

0:09:29 | like distorted speech, voice speech, you know |

0:09:32 | in the summation, into special domain, summation of |

0:09:35 | either wave-form domain is a noise |

0:09:37 | plus |

0:09:38 | the clean speech you get by observation. That's so simple you just cut that into |

0:09:43 | one layer, into summation or |

0:09:44 | you can call them in terms of Bayesian probability very easily. |

0:09:47 | This is not that easy to do. People tried to do that, it's not just |

0:09:51 | as easy. |

0:09:51 | So to encode |

0:09:53 | a domain knowledge into network constraint of the problem |

0:09:57 | into |

0:09:58 | your deep learning system. This has great advantage. |

0:10:01 | So I'm actually, I mean this is just a random selection |

0:10:03 | of things you know. There's very nice paper over here |

0:10:06 | Acoustic Phonetics. |

0:10:08 | All this knowledge at speech production |

0:10:11 | and this kind of nonlinear |

0:10:12 | phonology |

0:10:14 | and this is an example of this is noise robust. You put the phase information |

0:10:19 | of the speech and noise you can come up with |

0:10:22 | very nice conditional distribution. It's kind of complicated |

0:10:24 | but this one can be put directly |

0:10:26 | into generative model and this is some example of this. Whereas in deep neural networks |

0:10:31 | it's very hard to do. |

0:10:33 | So the question is that do we want to throw away all these knowledge in |

0:10:36 | the deep learning |

0:10:37 | and my answer is of course no. Most of people will say no, okay. |

0:10:45 | And people from the outside of speech (community) there was a yes. I talk about |

0:10:48 | some people in machine learning, |

0:10:49 | anyway so since this is speech conference I really want to emphasise that. |

0:10:54 | So the real |

0:10:55 | solid reliable knowledge that we attained |

0:10:58 | from speech science |

0:10:59 | that has been reflected by local talks are |

0:11:03 | such as yesterday's talk, talking about how some patterns have been shaped by you by |

0:11:09 | ?? and perceptionists. They were really playing a role in deep generative model. |

0:11:14 | But very hard to do that in deep neural network. |

0:11:17 | So with this main message in mind |

0:11:20 | I'm going to go through three parts of the talk as I put them in |

0:11:24 | my abstract here. |

0:11:25 | So I need to go very briefly |

0:11:27 | through all these three topics. |

0:11:30 | Okay, so the first part is to give very brief history of how deep speech |

0:11:36 | recognition started. |

0:11:38 | So this is a very simple list. There are so many papers around. Before the |

0:11:43 | rise of the deep learning around |

0:11:45 | 2009 and 2010. There are lots of papers around. So I hope I actually have |

0:11:50 | a reasonable |

0:11:52 | sample of the work around here. |

0:11:54 | So I don't have time to go through, especially for those of you who are |

0:11:58 | in ?? open house |

0:12:00 | There was in 1988, I think in 1988 |

0:12:03 | ASRU and at that time there's no U, it's just |

0:12:05 | ASR. And there is some very nice paper around here and then quickly |

0:12:10 | you know it was |

0:12:10 | superseded |

0:12:11 | superseded by the hidden Markov model approach. |

0:12:15 | So I'm not going to go through all these |

0:12:17 | so except to point out that |

0:12:20 | neural network |

0:12:22 | has been very popular for awhile. |

0:12:24 | But towards this you know, |

0:12:26 | plus ten years |

0:12:28 | before the deep learning actually took over neural network approach |

0:12:33 | essentially didn't really make |

0:12:36 | such a strong impact compared with deep learning network that people have been seeing. |

0:12:41 | So I just give you one example to just show you how unpopular |

0:12:45 | the neural network was at that time. |

0:12:48 | So this is about 2008 or 2006, about nine years ago. |

0:12:53 | So this is the optimization that I think |

0:12:56 | is predecessor |

0:12:57 | of ?? IOPPA. |

0:12:58 | So they actually got several of together, locked us up into hotel |

0:13:03 | near Washington, DC. |

0:13:05 | airport somewhere. |

0:13:07 | Essentiall the goal is to say that well the speech |

0:13:09 | recognition is stuck there, so you come over here and help us brainstorm next generation |

0:13:15 | of speech recognition and understand technology |

0:13:18 | and then we actually spent about four or five days in the hotel and at |

0:13:22 | the end we wrote very thick report, |

0:13:25 | twenty some pages of report. |

0:13:26 | So there is some interesting discussion about history and the idea is that |

0:13:31 | if government give you unlimited resource and gives you fifteen years what is it you |

0:13:35 | can't do, right? |

0:13:36 | So most of the people in our discussion, |

0:13:39 | we all focused on neural network, essentially |

0:13:41 | margin is here, |

0:13:42 | macro-random field is here, conditional-random field is here and graphical model here. |

0:13:48 | So it |

0:13:50 | that was just couple of years before that deep learning actually came out at that |

0:13:54 | time |

0:13:54 | so neural network was actually one of the |

0:13:57 | two's around. |

0:13:58 | Haven't really make a big impact. |

0:14:01 | So on the other hand the graphical model was actually mentioned here because it's related |

0:14:06 | to deep generative model. |

0:14:08 | So I'm going to show you a little bit, well this is slide about deep |

0:14:12 | generative model, actually I made some list over here. |

0:14:15 | One of the |

0:14:18 | but anyway so. This let's go over here. |

0:14:21 | I just want to highlight couple of |

0:14:25 | related to |

0:14:27 | introduction of deep neural network in the field. |

0:14:30 | Okay so one of, this is ?? John Riddle? |

0:14:32 | actually we spent a summer in ?? in 1989, |

0:14:37 | or 1988. |

0:14:39 | Fifteen and some years ago. So we spent really interesting summer altogether. |

0:14:45 | So |

0:14:46 | and that's kind of the model, deep generative model, the two versions we actually put |

0:14:52 | together |

0:14:52 | and at the end we actually brought a very thick report that were about eighty |

0:14:56 | pages of report. |

0:14:58 | So this is deep generative model and it turned out that this model |

0:15:02 | actually both of those models are implemented in neural networking. |

0:15:06 | And thinking about neural network as simply just function of function of mapping |

0:15:09 | so if you map the hidden representation |

0:15:12 | from you know |

0:15:13 | as part of deep generative model into whatever observation you have |

0:15:18 | MFCC. Everybody used MFCC at the time. |

0:15:22 | You actually need to have done the mapping and that was done in neural network |

0:15:27 | in both versions |

0:15:28 | and this is statistical version which we |

0:15:31 | call the hidden dynamic model. It's one of the conversion |

0:15:34 | of deep generative model. |

0:15:36 | It didn't succeed. I'll show you the reason why. Now we understood what. |

0:15:40 | Okay, so it's interesting enough in this |

0:15:43 | model we actually used, if you read the report, it actually turned out that model |

0:15:47 | was here since Geoff told me that |

0:15:49 | the video for this workshop is still around there so it's called ?? sign. I |

0:15:53 | think I mentioned to ?? pick it out. |

0:15:56 | It turned out that learning of this workshop, which details are in this report |

0:16:00 | is actually use the back propagation to do it. Now direction isn't from up to |

0:16:03 | down, since your model is |

0:16:05 | top down, the propagation must be bottom up. |

0:16:08 | So nowadays |

0:16:10 | when we do speech recognition the error |

0:16:14 | function is a softmax or sometime you can use the mean square error. |

0:16:18 | And the measure is in terms of your label. |

0:16:23 | This is the opposite. The error is measured in terms |

0:16:26 | of matching between how generative model can match with the observation. And then when you |

0:16:31 | want to |

0:16:31 | learn you go bottom up learning. Which actually turned out to be better propagation. So |

0:16:35 | that propagation doesn't have to be done (up to bottom) |

0:16:37 | it can be bottom up. Depending on what kind of models you have. |

0:16:40 | But key is that this is |

0:16:41 | a gradient descent method. |

0:16:44 | So actually we got disappointing result for switchboard. You know because we tended to be |

0:16:48 | a bit off game. |

0:16:49 | And now we understand why. Not at that time. I'm sure some of you experienced |

0:16:52 | it. I have a lot |

0:16:53 | of thinking about how deep learning and this can be integrated together. |

0:16:59 | So at the same time |

0:17:02 | Okay so this is a fairly simple model, okay. So you have this hidden representation |

0:17:07 | and it has |

0:17:08 | specific constrains built into the model, |

0:17:11 | by the way which is very hard to do when you do bottom-up neural network. |

0:17:15 | And for generative model |

0:17:16 | you can put them very easily down there, so for example |

0:17:18 | articulatory trajectory has to be smooth |

0:17:22 | and then specific form of the smoothness can be built indirectly |

0:17:26 | by simply writing the generative probabilities. Not in the deep neural network. |

0:17:31 | So at the same time |

0:17:33 | we actually, also this was done in ?? |

0:17:38 | and we were able to even put this nonlinear phonology in terms of |

0:17:43 | writing the phonemes into the invidiual constituents at the top level and ?? also has |

0:17:49 | very nice paper, some fifteen years ago, talking about this. |

0:17:53 | And also the robustness can be directly integrated into |

0:17:57 | articulator model simply by generative model. Now for deep neural network it's very hard to |

0:18:01 | do. |

0:18:01 | For example you can actually |

0:18:05 | this is not meant to be seen. Essentially this is one of the conditional likelihood |

0:18:10 | that covers |

0:18:11 | one of the links. So everytime you have got the link |

0:18:15 | you have conditional dependency parent to children that have differnt neighbours. |

0:18:22 | And then you can specify them in terms of |

0:18:24 | conditional distribution. Once you do that you formed a model |

0:18:27 | you can embed |

0:18:28 | whatever knowledge you have, you think is good, into the system. But anyway |

0:18:33 | but the problem is that the learning is very hard |

0:18:35 | and that problem of the learning in machine community only was solved just within last |

0:18:41 | year. |

0:18:42 | At that time we just didn't really know. We were so naive. |

0:18:47 | We didn't really understand all the limitations of learning. So just to show you we |

0:18:50 | talk, okay. One of the |

0:18:51 | things we did was that, I actually worked on this with my colleagues Hagai Attias. |

0:18:55 | He is actually one of the |

0:18:56 | he is my colleague working not far away from me at that time, some ten |

0:19:01 | years ago. |

0:19:02 | So he was the one who invented this very initial base. Which is very well |

0:19:06 | known. |

0:19:07 | So the idea was as follows. You have to break up these pieces into the |

0:19:11 | modules, right. |

0:19:12 | For each module you have this, this is actually |

0:19:14 | continuous |

0:19:17 | dependence of the continuous hidden representation |

0:19:20 | and it turned out that the way to learn this, |

0:19:23 | you know in a principle, what is to do is EM (Expectation maximization). It's variational |

0:19:26 | EM. |

0:19:26 | So the idea is very crazy. |

0:19:28 | So you said you cannot solve that regressively and that's well known. It's loopy neural |

0:19:34 | network. Then you just cut all important things you |

0:19:37 | carry out. Hoping that M-Step can make it up. That's very crazy idea. |

0:19:41 | And that's the best around time that was there. |

0:19:43 | But it turned out that you've got the auxiliary function and you form is still |

0:19:48 | something very |

0:19:49 | similar to our EM, you know in HMM. For the general model you don't have |

0:19:55 | to look you can get rigorous solution. |

0:19:57 | But now when you have deep it's very hard. You have to make up for |

0:20:01 | it. And that ?? is just as ??bad-ass |

0:20:03 | many people could ?? on deep neural network. This ?? deep generative model |

0:20:08 | probably have more |

0:20:09 | than otherwise. Although they patched themselves |

0:20:12 | to be |

0:20:13 | you know very rigorous. But if you really walk on that, so I can pick |

0:20:17 | out of this, so it's |

0:20:18 | for this approach we get surprisingly good inference results for continuous variables. |

0:20:22 | And in one version what we did was actually we used phonemes |

0:20:27 | you know as a hidden representation and it turned out it tracked. And once you |

0:20:31 | do this you |

0:20:32 | check the phoneme really precisely. |

0:20:34 | As a byproduct this worked as we created |

0:20:38 | this worked as we created database for formant tracking |

0:20:42 | but if we actually do |

0:20:45 | inference only the linguistic unit which is the problem |

0:20:48 | of recognition we didn't really make much progress on this. |

0:20:51 | But anyway so I'm going to show you some of these preliminary results to show |

0:20:56 | you how this |

0:20:57 | is one way that led to the deep neural network. |

0:21:00 | So when we actually simplify the model in order to finish the decoding we actually, |

0:21:07 | this is actually ?? result |

0:21:09 | and we would bring out all of analysis for different kinds of phones. |

0:21:12 | So when we use this kind of generative model with deep structure it actually corrected |

0:21:17 | many errors |

0:21:18 | which are related the short phones. |

0:21:20 | And you understand why because you designed model to make that happen and then you |

0:21:24 | know if |

0:21:25 | everything is done recently well you actually get results. So we actually look |

0:21:28 | at not only corrected short phones for the vowel |

0:21:32 | but also it correct the a lots of |

0:21:34 | consonants because they're up with each other. |

0:21:36 | It's just because the model design whatever hidden trajectory that you get |

0:21:40 | it's influenced, the parts of the vowel is influenced |

0:21:45 | by the adjacent sound. |

0:21:47 | And that's |

0:21:47 | this is due to the coarticulation. |

0:21:49 | This work will be very naturally built into the system |

0:21:51 | and one of things I am very much struggling with deep neural network is that |

0:21:55 | you can't even build this kind of |

0:21:56 | information that easily, okay. |

0:21:59 | This is to convince you how things can be breached. |

0:22:03 | It's very easy to interpret the results. So we look at the error we |

0:22:07 | know wow these are quite a big data assumption. |

0:22:11 | Without the have to go through for example in this these examples of these are |

0:22:14 | the same sounds, okay. |

0:22:15 | You just speak fast then you get something like this |

0:22:17 | and then we actually looked at the error and we said Ohh. |

0:22:20 | You know |

0:22:22 | that's exactly what happened. You know mistake was made in the |

0:22:27 | Gaussian Mixture Model because it doesn't take into account these particular dynamics. Now this one |

0:22:31 | was pulling correct error |

0:22:32 | And I'm going to show you in deep neural network things are reversed, so that's |

0:22:37 | related to ??. But in the same time |

0:22:39 | in machine learning community also the speech |

0:22:42 | there is a very interesting model for the deep generative model developed |

0:22:46 | and that's called the Deep Belief Network. |

0:22:47 | Okay, |

0:22:48 | so in the earlier literature before about three or four years ago |

0:22:52 | DBN, Deep Belief Network, NTA I mix each other, even by the authors |

0:22:56 | it's just because most people don't understand what it is |

0:22:59 | so this is very interesting paper that is starting in 2006 |

0:23:02 | many people, most people in machine learning, regard this paper to be the start of |

0:23:07 | deep learning. |

0:23:08 | And thus the generative model so you prefer to say deep |

0:23:12 | generative model actually started the deep learning rather than deep neural network. |

0:23:17 | But this model has some intriguing probabilities |

0:23:21 | that really at the time attracted my attention here. |

0:23:25 | It's totally not obvious, okay. |

0:23:28 | So for those of you who know RBM and DBM you know when you are |

0:23:32 | stacking up this undirected model |

0:23:34 | sever time you get DBN, that's |

0:23:37 | you might think that the whole thing will be undirected, |

0:23:40 | you know bottom-up machine, no. It's actually directed model coming down. |

0:23:44 | You have to read this paper to understand why. |

0:23:47 | So why do they? I said someone was wrong. I couldn't understand what happened. |

0:23:50 | But on the other hand it's much simpler than the model I showed you earlier |

0:23:54 | for deep network we get the temporal dynamics. |

0:23:56 | This one it's not temporal dynamics over here. |

0:23:59 | So |

0:24:01 | the most intriguing aspect of DBN |

0:24:03 | as described in this paper is that inference is easy. |

0:24:06 | Normally you think inference is hard. That's the tradition. |

0:24:10 | It's given fact if you have these multiple dependencies on the top it's very hard |

0:24:15 | to make voice |

0:24:16 | and there's special constraint built into this model. Namely the restriction in the connections of |

0:24:21 | RBM |

0:24:22 | because of that it makes inference. It's just a special case. |

0:24:25 | This is very intriguing, so I thought this idea may help |

0:24:29 | the deep general model I showed you earlier. |

0:24:32 | So he came to reason me, you know. We discussed it. |

0:24:36 | It took him a while to explain what this paper is. |

0:24:40 | Most of people at Microsoft at that time couldn't understand what's going on. |

0:24:45 | So now let's see how |

0:24:46 | and then of course what we get together this deep generative model |

0:24:50 | and the other deep generative model I talked about with you I actually worked on |

0:24:54 | for almost ten |

0:24:54 | years at Microsoft. We were working very hard on this. |

0:24:57 | And then we came up with the conclusion that well we have to use fewer |

0:25:00 | clues to fix problem. |

0:25:01 | And they don't match, okay. The reason why they don't match is whole new story |

0:25:05 | why they don't match. |

0:25:06 | The main reason is actually not just temporal difference, it's the way you prioritize |

0:25:12 | the model and also the way to represent |

0:25:15 | the information is very different |

0:25:17 | despite the fact that they're both generative models. |

0:25:19 | It turned out that this model is very good for speech synhesis and ?? has |

0:25:22 | very nice paper |

0:25:23 | using this model to do synthesis. And it's very nice to do |

0:25:26 | image generation. I can see that very nice probably. |

0:25:30 | Not for continuous speech it is very hard to do |

0:25:33 | and for speech for general synthesis it's good it's because if you have segment with |

0:25:38 | whole |

0:25:39 | context into account, like syllable in Chinese it is good, but for English it is |

0:25:42 | not that easy to do. |

0:25:44 | But anyway so we need to have few kluges to fix together, to merge these |

0:25:48 | two models together. |

0:25:49 | And that sort of led to the end. |

0:25:51 | So the first kluge is that |

0:25:54 | you know |

0:25:55 | the temporal dependency is very hard. If you have temporal dependency you automatically loop and |

0:26:00 | then |

0:26:00 | everybody in machine learning at that time knew, most of speech persons, so I thought |

0:26:05 | that |

0:26:05 | machine learning that I show you early on actually just didn't work well, it didn't |

0:26:09 | worked out well. And most of people who were |

0:26:12 | very much versed in machine learning who say there's no way to learn that. |

0:26:15 | Then cut the dependency. It's way to do it, cut the dependency in the hidden |

0:26:20 | dimension, in the hidden revision |

0:26:21 | and loose all the powers of |

0:26:23 | deep generative model |

0:26:25 | and that's the Geoff Hinton's idea, well it doesn't matter, just use a big window. |

0:26:30 | If it fixes the clues and that actually |

0:26:34 | is one of things that actually helped |

0:26:36 | to solve the problem |

0:26:38 | and the second Kluge is that you can reverse direction |

0:26:40 | because |

0:26:41 | the inference in generative model is very hard to do as I showed earlier. |

0:26:45 | Now if you reverse direction |

0:26:48 | from top-down to bottom-up |

0:26:52 | and then you don't have to solve that problem. And that's why it would be |

0:26:56 | just a deep neural network, okay. Of course |

0:26:58 | everybody said: we don't know how to train them, that was in 2009. |

0:27:02 | Most people don't know how to ?? |

0:27:03 | and then he said that's how DBN can help. |

0:27:07 | And then he did a fair amount of work on DBN to initialize that ?? |

0:27:12 | approach. |

0:27:12 | So this is very well-timed academic-industrial collaboration. First of all |

0:27:16 | it's because speech recognition industry has been searching for new solutions when principle |

0:27:22 | deep generative model could not deliver, okay. Everybody |

0:27:24 | was very upset about this at the time. |

0:27:27 | And at the same time academia developed deep learning tool |

0:27:30 | DBN, DNN, all the hybrid stuff that's going on. |

0:27:33 | And also CUDA library was released around that time. It's very recent times. |

0:27:40 | So this is probably one of the earliest catching on |

0:27:44 | for this GPU computing power over here. |

0:27:47 | And then of course big training data in ASR that has been around |

0:27:52 | and most people, if you actually do |

0:27:55 | Gaussian Mixture Model for HMM where a lot of data performance accelerates, right. |

0:28:00 | And then this is one of things that in the end really is powerful. You |

0:28:04 | can increase the size and depth |

0:28:06 | and |

0:28:07 | you know put in a lot of things |

0:28:08 | into to make it really powerful. |

0:28:11 | And that's the scalability advantage that I showed you early on. That's not the case |

0:28:15 | for any shallow model. |

0:28:18 | Okay, so in 2009 I and three of my colleagues didn't know what's |

0:28:23 | happening. So we actually got together to |

0:28:26 | to do this |

0:28:27 | to this workshop |

0:28:28 | to show that |

0:28:29 | this is useful thing, you know, to bring stuff. |

0:28:32 | So it wasn't popular at all. I remember |

0:28:35 | you know Geoff Hinton and I we actually got together to |

0:28:40 | who we should invite to give us |

0:28:42 | speech in this workshop. |

0:28:44 | So I remember that one invitee which shall be nameless here |

0:28:47 | he said: Give me one week to think about, and at the end he said: |

0:28:50 | it's not worth my time to fly to Vancouver. That's one of them. |

0:28:53 | The second invitee, I remember this clearly, said: This is crazy idea. So in the |

0:28:57 | e-mail he said |

0:28:58 | What you do is not clear enough for us. |

0:29:01 | So we said you know |

0:29:02 | waveform may be useful for ASR. |

0:29:04 | And then the emails said: Oh why? |

0:29:07 | So we said that's just like using pixel for image recognition. That was popular. |

0:29:12 | For example convolutional network there are pixels. |

0:29:15 | We take similar approach. Except it is waveform. |

0:29:17 | And the answer was: No, no, no that's not same as pixel. It is more |

0:29:22 | like using photons. |

0:29:23 | You know making kind of joke essentially. This one didn't show up either. But anyway |

0:29:28 | so |

0:29:30 | anyway so this workshop actually has |

0:29:34 | a lot of brainstorming I had to analyze, all the errors I showed you early |

0:29:38 | on. |

0:29:39 | But it's really good |

0:29:41 | workshop for about four or five years that was |

0:29:44 | five years ago now. |

0:29:45 | So now I move to part 2 |

0:29:48 | to discuss achievements. So actually in my original post I had whole bunch of slides |

0:29:53 | on vision. |

0:29:54 | So the message for the vision is that if you go to vision community |

0:29:59 | they look at deep learning to be |

0:30:01 | just even |

0:30:02 | maybe thirty time |

0:30:04 | thirty times more popular than deep learning in speech. |

0:30:07 | So they actually, the first time they did that was actually first time they |

0:30:12 | actually got the results. |

0:30:16 | and noone believed it's the case. At the time I was given a lecture |

0:30:20 | at Microsoft about Deep Learning |

0:30:22 | and then right before I, actually Bishop |

0:30:25 | was doing the lecture together with me |

0:30:30 | and then this deep learning just came out and Geoff Hinton sent e-mail to me: |

0:30:34 | Look at the matching! How much bigger it is. |

0:30:36 | And I showed them. People were like: I don't believe it. Maybe a special case. |

0:30:40 | You know. And it turned out it's just much |

0:30:42 | just as good. |

0:30:43 | Even better than speech. I actually cut all the slides out. Maybe some time I |

0:30:46 | will show you. |

0:30:47 | So this is big area to go. So today I am going to focus on |

0:30:50 | speech. |

0:30:51 | So one of things that we found during that time |

0:30:55 | is that we have very interesting discovery that we actually used the model that I |

0:30:59 | showed you there |

0:31:00 | and also deep neural network here. |

0:31:03 | And that actually is the number that we analyzed |

0:31:06 | error pattern very carefully. So it's very good, you know for TIMIT. |

0:31:10 | You can disable language model, right. |

0:31:12 | Then you can understand the errors for acoustic ?? very effectively |

0:31:15 | and I tried to do that afterwards, you know, to do other tasks |

0:31:20 | and it's very hard once you put language model in there you just couldn't |

0:31:23 | do any analysis. So it's very good at the time we did this analysis. |

0:31:26 | So now the error pattern in the comparison |

0:31:30 | is, I don't have time to go through except just to mention that. |

0:31:33 | So DNN made many new errors on short undershoot vowels. |

0:31:37 | So it sort of undo what this model is about to do |

0:31:40 | and then we thought of why would that happen and of course at the end |

0:31:43 | we had a very big window so if the sounds |

0:31:45 | are very short, information is captured over here and your input is about eleven frames, |

0:31:48 | you know, you got the fifteen frame it |

0:31:50 | captures kind of noise coming from different phones of course error is made over here. |

0:31:54 | So we can understand why. |

0:31:56 | And then we asked why this model corrects errors? It's just because |

0:31:59 | you make |

0:32:00 | you deliberately make a hidden representation |

0:32:04 | to reflect |

0:32:05 | what sound pattern looks like. |

0:32:07 | In the hidden space. And it's nice for whom you can see |

0:32:10 | but if you have the articulations, how do they see? So sometimes we use former |

0:32:14 | to illustrate what's going on there. |

0:32:18 | Another important discovery at Microsoft is that we actually found that using spectrogram |

0:32:23 | we produce much better |

0:32:26 | autoencoding results in terms of speech analysis. |

0:32:30 | Encoding results |

0:32:32 | ?? and that was very surprising at the time. |

0:32:36 | And that really conforms to the basic deep learning theme that |

0:32:39 | you know the earliest features are better |

0:32:42 | then the processed features here. So I show you, this is actually project |

0:32:48 | we did together in 2009. |

0:32:49 | So we used spectrogram |

0:32:51 | to do binary coding of |

0:32:53 | of spectrogram. |

0:32:55 | So I don't have time to go through that. You read the auto-encoding book if |

0:33:01 | you can. |

0:33:02 | In literature you can all see this. |

0:33:03 | So the key is that |

0:33:04 | you use the target to be the same as input and then you use small |

0:33:07 | number of bits in the middle. |

0:33:09 | And you want to see whether that would actually |

0:33:11 | ?? all the ?? down here. And the way to evaluate it is to look |

0:33:15 | at |

0:33:15 | you know what kind of errors you have. |

0:33:17 | So the way we did is we used the vector quantizer as a baseline |

0:33:21 | of 312 bits. |

0:33:23 | And then reconstruction |

0:33:24 | looks like this. So this is the original one, this is the shallow model, right. |

0:33:29 | Now using deep auto-encoder we get much closer to this in terms of errors |

0:33:34 | we simply have just much lower coding error |

0:33:38 | using identical number of bits. |

0:33:39 | So it really shows that if you build deep structure you extract this bottom-up feature. |

0:33:45 | Both ?? you condense more |

0:33:47 | information in terms of reconstructing the original signal. |

0:33:50 | And then we actually found that |

0:33:53 | for spectrogram this result is the best. |

0:33:55 | Now for MFCC we still get some gain, but gain is not nearly as much, |

0:34:00 | sort of indirectly |

0:34:01 | convinces me. There's Geoff Hinton's |

0:34:03 | original activities ?? everybody's |

0:34:06 | to spectogram. |

0:34:07 | So maybe we should have do the waveform, probably not anyway. |

0:34:10 | Okay so of course the next step is once we are all convinced that |

0:34:14 | error analysis shows that |

0:34:17 | deep learning can correct a lot of errors, not for all but for some |

0:34:21 | which we understand why. You just pick up the power and also capacity they had. |

0:34:27 | So on average it does a little bit better |

0:34:29 | based upon |

0:34:30 | this analysis. |

0:34:33 | Based upon this analysis it does slightly better. |

0:34:36 | But if you look away |

0:34:38 | but if you look at the error pattern you really can see |

0:34:41 | that this has a lot of power, but it also has some shortcomings as well. |

0:34:45 | So that both have pros and cons but one's errors are very different and it |

0:34:49 | actually gives you the hint that |

0:34:51 | you know is worthwhile to pursuit. |

0:34:53 | Of course this was all very interesting |

0:34:56 | evidence to show. |

0:34:57 | And then to scale up to industrial scale we had to do |

0:35:00 | lot of things so many of my colleagues actually were working with me |

0:35:04 | on this. So first of all |

0:35:06 | we need to extend the output |

0:35:08 | from small number of phones |

0:35:11 | at the states |

0:35:12 | into very large |

0:35:13 | and that actually at that time is motivated by |

0:35:16 | how to save huge Microsoft investment in speech decoder software. |

0:35:20 | I mean if you don't do this |

0:35:22 | then you know if you do some other kind of output coding |

0:35:27 | and they would also had to ?? atypical feature to do it. The one that |

0:35:31 | would fully believed |

0:35:31 | that it's going to work. |

0:35:32 | But it turned out if you need to change decoder, you know, we just have |

0:35:36 | to say wait a little bit. |

0:35:38 | So |

0:35:41 | and at the same time we found that using content dependent model gives much higher |

0:35:46 | accuracy |

0:35:46 | than content independent model for large tasks, okay. |

0:35:49 | Now for small tasks we defined so much better. I think |

0:35:53 | it's all related to |

0:35:54 | a capacity saturation problem if you have too much |

0:35:57 | but since a lot of data |

0:35:59 | in |

0:36:01 | in the training for large tasks |

0:36:03 | you actually keen |

0:36:04 | to form a very large output and that turn out |

0:36:07 | to have you know |

0:36:09 | double benefit. |

0:36:10 | One is that you increased accuracy and number two is that you don't have to |

0:36:13 | change anything about decoder. |

0:36:14 | And industry loves that. |

0:36:17 | You have both |

0:36:18 | that's actually ??. I can't recall why actually took off. |

0:36:22 | And then we summarize what enabled this type of model |

0:36:24 | and industrial knowledge about how to construct a very large units in DA |

0:36:29 | is very important |

0:36:30 | and that essentially come from |

0:36:32 | everybody's what here |

0:36:34 | that actually used this kind of content dependent model for Gaussian Mixture Model, you know, |

0:36:39 | that has been around for |

0:36:40 | almost twenty some years. |

0:36:42 | And also |

0:36:43 | it depends upon industrial knowledge on how to make encoding of such huge and highly |

0:36:48 | efficient using |

0:36:50 | our conventional |

0:36:51 | HMM decoding technology. |

0:36:53 | And of course how to make things practical. |

0:36:57 | And this is also very important enabling factor. If GPU didn't come up |

0:37:03 | roughly at time, didn't become popular at that time |

0:37:06 | all these experiments would take months to do. |

0:37:08 | Without all this belief, without all this fancy infrastructure. |

0:37:14 | And then |

0:37:15 | people may not have patiance to wait to see the results, you know push that |

0:37:18 | forward. |

0:37:19 | So let me show you some very |

0:37:22 | brief summary of the major |

0:37:26 | result obtained in early days. |

0:37:29 | So if we use three hours of training, this is TIMIT for example, we have |

0:37:34 | got |

0:37:34 | this is number I show you, it's not much about ?? percent of gain. |

0:37:38 | Now if you increase the data up to |

0:37:41 | ten times more thirty some hours you get twenty percent error rate. |

0:37:46 | Now if you do more. |

0:37:48 | For SwitchBoard, this is the paper that my colleague published here, |

0:37:52 | you get more data, another ten times so you get two orders of magnitude to |

0:37:57 | increase |

0:37:58 | and the relative gain actually |

0:38:00 | sort of |

0:38:01 | increase, you know, ten percent, twenty percent, thirty percent. This is actually |

0:38:06 | so of course if you increase |

0:38:08 | the size of training data |

0:38:10 | the baseline will increase as well, but relative gain is even bigger. |

0:38:14 | And if people look at this result there's |

0:38:16 | nobody |

0:38:17 | in their mind who would say not to use that. |

0:38:20 | And that's how |

0:38:21 | and then of course a lot of companies |

0:38:24 | you know |

0:38:26 | actually still |

0:38:28 | implement, DNN is fairly easy to implement for everybody because |

0:38:33 | I missed one of the points over there. It actually turned out if you use |

0:38:37 | large amount of data |

0:38:38 | it turned out that the original |

0:38:41 | idea of using DBN to regularize that model doesn't lead to |

0:38:44 | be that anymore. And in the beginning ?? how it happened. |

0:38:49 | But anyway, so now let me come back to the main thing of the talk. |

0:38:53 | How generative model |

0:38:54 | and deep neural network may be helping each other. |

0:38:57 | So the kluge one was that to use this to be |

0:39:02 | at that time |

0:39:03 | we have to keep this now for this conference we see |

0:39:07 | ?? using LSTM with neural network and that fixed this problem. |

0:39:12 | So this problem is fixed. |

0:39:14 | This problem is fixed automatically. |

0:39:17 | At that time |

0:39:19 | we thought we need to use DBN. Now with use of big data there's no |

0:39:23 | need anymore. |

0:39:24 | And that's very well understood now. Actually there are many ways to understand that. You |

0:39:28 | can think about as |

0:39:29 | regulization view point |

0:39:31 | and yesterday at the table with students I mentioned that and people said: What is |

0:39:36 | regularization? |

0:39:37 | And you have to understand more in terms of the optimization view point |

0:39:41 | so actually if you stare at back-propagation formula for ten minutes you figure out why. |

0:39:47 | Which I actually have slide there, it's very easy to understand why from many perspectives. |

0:39:52 | With a lots of data you really don't need that. |

0:39:54 | And that's automatically fixed. |

0:39:57 | You know kind of by industrialization we tried lots of data |

0:40:00 | it's fixed and now this is not fixed yet. So this is actually the main |

0:40:03 | topic |

0:40:04 | that I'm going to use for the next twenty minutes. |

0:40:07 | So before I do that I will actually try to summarize some of |

0:40:11 | the major ... actually I and my colleagues wrote this book |

0:40:14 | and in this chapter we actually grouped |

0:40:16 | the major advancement of deep neural network into several categories |

0:40:22 | so I'm going to go through that quickly. |

0:40:24 | So one is the optimization, |

0:40:26 | innovation. |

0:40:27 | So I think the most important advancement |

0:40:31 | over the previous, you know the early success of the I showed you early on |

0:40:36 | what's the development of sequence discriminative training and |

0:40:39 | this contributed additional ten percent of error rate reduction. |

0:40:42 | Also many groups of people have done this. |

0:40:45 | Like for us at Microsoft, you know this is our first intern coming to our |

0:40:49 | place to do this. |

0:40:50 | And we tried on TIMIT we didn't know all the subtleties of the importance of |

0:40:56 | regularization and |

0:40:56 | we got all the formula right, all of everything right |

0:40:59 | and the result wasn't very good. |

0:41:01 | But I think |

0:41:02 | Interspeech accepting our paper and this we understand that this |

0:41:06 | and then later on |

0:41:09 | we got more a more papers, actually a lot of papers were published in Interspeech. |

0:41:13 | That's very good. |

0:41:15 | Okay now, the next theme is about 'Towards Raw Input', okay. |

0:41:21 | So what I showed you early on was the speech coding and analysis part |

0:41:26 | that we know that is good. We don't need MFCC anymore. |

0:41:29 | So it was bye MFCC, so |

0:41:31 | probably it will disappear |

0:41:33 | in our community. Slowly over the next few years. |

0:41:36 | And also we want to say bye to Fourier transforms, so I put the question |

0:41:42 | mark here partly because |

0:41:43 | actually, so for this Interspeech I think two days ago Herman ?? had a very |

0:41:48 | nice paper on |

0:41:49 | this and I encourage everybody to take a look at. |

0:41:52 | You just put the raw information in there |

0:41:55 | which was done actually about three years ago by Geoff Hinton students, they truly believed |

0:42:00 | it. I couldn't |

0:42:01 | I tried that about 2004, that was the hidden Markov model |

0:42:04 | error. |

0:42:05 | And we understood all kind of problem, how to normalize users input and I say |

0:42:09 | it's crazy |

0:42:10 | and then when they published the result |

0:42:13 | in |

0:42:14 | ICASSP. I looked at these results and error was terrible. I mean there's so much |

0:42:17 | of error. |

0:42:17 | So nobody took attention. And this year we brought the attention to this. |

0:42:21 | And the result is almost as good as using, you know, |

0:42:25 | using Fourier transforms. |

0:42:27 | So far we don't want to throw away yet, |

0:42:29 | but maybe next year people may throw that away. |

0:42:33 | Nice thing is .. I was very curious about this. I say |

0:42:37 | at the terms of that to get that result they just randomize everything rather than |

0:42:41 | using Fourier transforms |

0:42:42 | to initialize it and that's very intriguing. |

0:42:46 | Too many references to list I was running all the time. I had ?? list. |

0:42:50 | But yesterday when I went through this adaptation session there's so many good papers around. |

0:42:55 | I just don't have patience for them anymore. |

0:42:57 | So go back to ?? adaptation papers. There are a lot of new |

0:43:02 | advancements. So another important thing is transfer learning |

0:43:05 | at that place very important role in multi-lingual acoustic modelling. |

0:43:10 | So that was tutorial that I was .. actually Tanja was giving in a workshop |

0:43:17 | I was attending. |

0:43:18 | I also mention that |

0:43:20 | for generative model |

0:43:22 | for shallow model before |

0:43:24 | this one almost never |

0:43:26 | multilingual |

0:43:28 | of course |

0:43:28 | modelling |

0:43:30 | actually improved things. |

0:43:32 | But it never actually beat the baseline |

0:43:36 | in terms of .. |

0:43:39 | so think about cross-lingual for example, multi-lingual and cross-lingual |

0:43:42 | and deep learning actually beat the baseline. So there's whole bunch |

0:43:44 | papers in this area which I won't have time to go through all here. |

0:43:47 | Another important innovation is nonlinear regularization, so for |

0:43:50 | regulation dropout if you don't dropout it's good to know. |

0:43:54 | And this is special technique. Essentially it's just 'kill all you know' or |

0:43:57 | randomly and you get the better result. |

0:44:03 | And in terms of output units |

0:44:05 | now |

0:44:06 | is very popular units is to rectify linear units |

0:44:09 | and now there's some very interesting |

0:44:11 | many interesting theoretical analogies why this is better than this. |

0:44:16 | At least while in my experience .. actually I programmed this, it's change of our |

0:44:20 | lifes |

0:44:21 | to go from this to this. |

0:44:23 | Deep learning |

0:44:24 | really increases. |

0:44:26 | And we understand now why it happens. |

0:44:29 | Also (in terms of) accuracy different groups report different results. |

0:44:32 | Some groups reports they reduced error rate, some groups .. nobody reported increase in error |

0:44:37 | rates for now. |

0:44:38 | So in any case (it) speed up |

0:44:40 | the convergence dramatically. |

0:44:43 | So I'm going to show you another architecture over here which is going to link |

0:44:48 | to |

0:44:49 | a generative model. |

0:44:51 | So this is a model called Deep Stacking Network. |

0:44:55 | But its very design is deep neural network, okay. It's information from bottom up. |

0:45:00 | So the difference between this model and conventional deep neural network is that |

0:45:04 | for every single layer you can actually |

0:45:07 | integrate the input for each layer and then do some special processing here. |

0:45:15 | Especially you can alternate |

0:45:17 | layers into linear and nonlinear, if you do that you can dramatically increase your |

0:45:23 | speech convergence |

0:45:26 | in deep learning. |

0:45:27 | And there's some another theoretical analysis which is actually put in one of the books |

0:45:31 | I wrote. |

0:45:32 | So you actually can convert many complex |

0:45:35 | propagation, |

0:45:37 | non-convex problem into |

0:45:38 | somewhat |

0:45:41 | kind of ??property measure problem related to |

0:45:44 | convex optimization so we can understand our probability ??. |

0:45:46 | So we did that a few years ago and we wrote a paper on this. |

0:45:49 | And this idea can also be used for this |

0:45:53 | potential network, which I don't have the time to go through here. And the reason |

0:45:56 | why I bring that up is |

0:45:57 | because it's actually related to some recent work |

0:46:00 | that I have seen |

0:46:01 | for generative model which were taking convertion of each other, so let me compare between |

0:46:07 | two of |

0:46:08 | them to give you some example to show how to |

0:46:10 | both |

0:46:11 | networks can help each other. |

0:46:13 | So when developped this deep stacking network the activation function had to be fixed. |

0:46:20 | Either logistic or ReLu which are both |

0:46:22 | reasonably well |

0:46:23 | you know compared to |

0:46:25 | with each other. |

0:46:28 | Now look at this architecture. |

0:46:31 | Almost identical architecture. |

0:46:33 | So now |

0:46:35 | if you change the |

0:46:38 | activation function to be something very strange, I don't expect you to know anything about |

0:46:42 | this |

0:46:43 | and this is actually work done by Mitsubishi people. |

0:46:46 | There's a very nice paper over here in the technical ?? |

0:46:50 | I spent a lot of time talking to them and they even came to |

0:46:52 | Microsoft, so actually I listened to some of them and their demo. |

0:46:56 | So the activation function for this model is called the Deep Unfolding Model |

0:47:00 | that's is derived from inference method in generative model. |

0:47:06 | Which is not fixed as in the ?? I showed you earlier. So to stop |

0:47:11 | this model .. it looks like deep neural network, right? |

0:47:14 | But the beginning |

0:47:16 | the initial phase of their generative model which is specific about, |

0:47:20 | I hope many of you know the non-negative matrix factorization. This is specific technique |

0:47:26 | which actually is a shallow generative model. |

0:47:29 | It actually makes a very simple assumption that |

0:47:32 | the |

0:47:33 | observed noisy speech or mixed speakers' speech is the sum of two sources |

0:47:40 | in spectral domain. |

0:47:41 | What was they make the assumption |

0:47:43 | and then they of course they have to enforce that each |

0:47:46 | you know |

0:47:47 | each vector is positive because of the magnitude of spectra. |

0:47:52 | What they do is an iterative technique and that becomes a iterative technique. |

0:47:58 | And that |

0:47:59 | model automatically embed the main knowledge about how observation |

0:48:04 | is obtained, you know, through the mix between the two. |

0:48:08 | And then this work essentially said how to apply that inference technique iteration. Every single |

0:48:13 | iteration I treat that as a different |

0:48:16 | layer. |

0:48:18 | After this they do the back propagation training. |

0:48:21 | And the backward iteration is possible |

0:48:24 | because |

0:48:25 | the problem is very simple, so the application here is a speech enhancement |

0:48:29 | therefore objective function is a mean-square error, very easy. So the generative model |

0:48:34 | actually generative model gives you |

0:48:39 | the |

0:48:40 | the generative observation |

0:48:42 | and then |

0:48:43 | your output is clean speech. |

0:48:45 | Okay then you do mean-square error you actually adapt all this way |

0:48:48 | and the results are very impressive. So now this is why |

0:48:52 | I showed you can design deep neural network |

0:48:55 | if we use this |

0:48:57 | type of |

0:48:58 | activation function you automatically build in the constraints that you use in the generative model |

0:49:03 | and that's |

0:49:04 | very good example to show |

0:49:06 | the message that I'm going to, |

0:49:09 | actually I put in the beginning of the (presentation) it's |

0:49:11 | hope of deep generative model. So this is |

0:49:14 | shallow model and it's easy to do it. Now for deep generative model |

0:49:18 | it's very hard to do. |

0:49:19 | And one of reasons I put this as a topic today is partly because |

0:49:25 | all this conference |

0:49:27 | it's just three months ago |

0:49:30 | in Beijing's ICML conference |

0:49:33 | there's a very nice development |

0:49:35 | of deep generative models' learning methods. |

0:49:40 | They actually linked this |

0:49:42 | neural network and Bayes net together |

0:49:44 | through some transformation |

0:49:46 | and because of that .. the main idea of .. whole bunch of papers including |

0:49:51 | Michael Jordan, |

0:49:52 | whole bunch, you know, a lot of very well known people |

0:49:54 | in machine learning for deep generative model |

0:49:56 | so the main |

0:49:58 | point of this set of work, I just want to use one simple sentence to |

0:50:03 | summarize them, |

0:50:03 | is that |

0:50:04 | when you originally tried to do E step I showed you early on |

0:50:09 | you have to factorize them in order to get each step done |

0:50:12 | and that was approximation |

0:50:13 | and there was very nice ?? developped. A ?? so large it's practically useless |

0:50:18 | in terms of inferring the top layer |

0:50:24 | discrete event. |

0:50:25 | Now the whole point is that now we can relax that constraint for factorization |

0:50:30 | and like before three years ago if you do that if you use a rigorous |

0:50:35 | dependency |

0:50:36 | you don't get any reasonable analytical solution so you cannot do EM. |

0:50:42 | Now this |

0:50:43 | idea is to say that while you can approximate |

0:50:48 | that factorisation, |

0:50:49 | you can approximate that dependency in E step learning |

0:50:52 | not through |

0:50:55 | factorization which is called mean field approximation, |

0:50:57 | but use deep neural network to approximate. |

0:51:01 | So this is example to show that deep neural network actually help you to solve |

0:51:05 | deep generative model problem and |

0:51:07 | so this is well know Max Welling, a very good friend of mine in machine |

0:51:12 | learning. |

0:51:14 | And he told me that the paper never show that. |

0:51:17 | And they really developed the |

0:51:20 | the theorem to prove that if network is large enough |

0:51:24 | the approximation error can approach |

0:51:26 | zero. Therefore the variational learnings |

0:51:31 | can be eliminated and that's a very engine |

0:51:33 | developed that really give me a little evidence to show that, |

0:51:36 | to see that this is |

0:51:38 | a promising approach. I think machine learning community development tool, |

0:51:42 | our speech community developed verification |

0:51:45 | and also methodology as well, |

0:51:47 | but if |

0:51:48 | you know we actually cross connect |

0:51:50 | to each other we are gonna to make much more progress and that this type |

0:51:55 | of development |

0:51:55 | really |

0:51:56 | gives some |

0:51:58 | promising direction |

0:52:00 | towards the main message I put out at the beginning. |

0:52:03 | Okay, so now I am gonna show you some deeper results that I want to |

0:52:07 | show you. |

0:52:09 | Another better architecture that we have known is what's called the reccurent network, if you |

0:52:14 | read |

0:52:14 | this Beaufays' paper LSTM, look at that result. For |

0:52:18 | voice search the error rate jumped down to about ten percent. That's very impressive result. |

0:52:22 | Another type of architecture is to integrate the convolution |

0:52:27 | and non-convolution together. That was ?? |

0:52:30 | in the previous result. As the author worth of any better result is in though. |

0:52:33 | ?? |

0:52:33 | So these are the state-of-the-art for switchboard (SWBD) task. |

0:52:37 | So now I'm going to concentrate on this type of |

0:52:40 | recurrent network here. |

0:52:43 | Okay, so this coming down to one of my main messages here. |

0:52:47 | So we fixed this kluge |

0:52:51 | by |

0:52:51 | a recurrent network. |

0:52:54 | We also fix this kluge automatically |

0:52:58 | by |

0:53:00 | just using big data. |

0:53:01 | Now how do we fix this kluge? |

0:53:05 | So first of all I'll show you some analysis on recurrent network vs. deep generative |

0:53:11 | model |

0:53:11 | so that's called hidden dynamic model I showed you early on, okay. |

0:53:14 | And so far analysis hasn't been applied to LSTM. |

0:53:17 | So some further analysis may |

0:53:20 | actually automatically give rise to LSTM using some analysis on this. |

0:53:24 | So this analysis is very preliminary |

0:53:27 | and so if you stare at the equotation |

0:53:29 | for recurrent network it looks like best one. So essentially you have state of the |

0:53:33 | art equotation |

0:53:34 | and it's recursive. |

0:53:35 | Okay, |

0:53:36 | from previous hidden layer to this. |

0:53:40 | And then you get the output |

0:53:43 | that produces the label. |

0:53:45 | Now if you look at this deep generative model - hidden dynamic model |

0:53:48 | identical equotation, |

0:53:50 | okay? Now what's the differece? |

0:53:52 | The difference is that the input now is the label. Actually if you put the |

0:53:56 | label |

0:53:57 | you cannot drive it. So you have to make some connection between labels and continuous |

0:54:01 | variable |

0:54:02 | and that's what in phonetic |

0:54:03 | people call phonology to phonetic interface, okay. |

0:54:06 | So we use some very basic assumption |

0:54:08 | that the interface is simply, that each label corresponds to target vector, |

0:54:14 | actually the way that we implement early distribution, you can do that to account for |

0:54:18 | speaker |

0:54:18 | differences, etcetera. Now the output |

0:54:21 | for this recursion gives you the observation |

0:54:24 | and that's the recurrent filter type of model. |

0:54:28 | And that's engineering model and there's neural network model, okay. So every time I was |

0:54:32 | teaching |

0:54:32 | ?? I called ?? on this. |

0:54:34 | So we fully understood all the constrains for this type of model. |

0:54:39 | Now for this model it looks the same, right? |

0:54:41 | So if you reverse direction you convert one model to another. |

0:54:44 | And for this model it's very easy to put a constraint, for example |

0:54:49 | the dynamics |

0:54:50 | of |

0:54:53 | matrix here that governs |

0:54:56 | the internal dynamics in the hidden domain actually can be made sparse and then you |

0:55:00 | can put |

0:55:02 | realistic constrain there for example in our |

0:55:04 | earlier implementation of this we put this critical dynamics |

0:55:08 | therefore you can guarantee it doesn't oscillate. When we do articulation we need phone boundaries. |

0:55:12 | This is the speech production mechanism |

0:55:15 | you can put them simply to fix the sparse matrix. |

0:55:17 | Actually one of the slides I'm gonna show you is all about this. |

0:55:22 | In this one we cannot do it, everything has to be a structure. |

0:55:25 | There's just no way you can say that why, you want that dynamics |

0:55:29 | to behave in certain way. |

0:55:32 | You just don't have any mechanism to design the structure of this and this is |

0:55:36 | very natural, it's by physical |

0:55:37 | properties that design this. Now because of |

0:55:40 | this correspondence and because of the fact that now we can do |

0:55:44 | deep inference |

0:55:47 | if all this machine learning technology actually are fully developed |

0:55:51 | we can very naturally bridge the two (models together). |

0:55:53 | It turned out if you do more |

0:55:55 | rigorous analysis |

0:55:56 | by |

0:55:57 | making the inference of this to be fancier |

0:56:00 | our hope that |

0:56:02 | this |

0:56:03 | multiplicative |

0:56:04 | kind of unit would automatically emerge from this type of model so that has not |

0:56:08 | been shown yet. |

0:56:10 | So of course this is just, you know, very high-level view comparison between the two |

0:56:15 | there are a lot of detail comparison you can make in order to bridge the |

0:56:19 | two, |

0:56:19 | so actually my colleague Dong Yu wrote this book that's just coming out very soon. |

0:56:26 | So in one of the chapters we put all these comparisons: interpretability, parametrization, methods |

0:56:32 | of learning and nature of representation and all the differences. |

0:56:36 | So it gives a chance to actually understand |

0:56:38 | how deep generative model in terms of dynamics |

0:56:42 | and recurrent network in terms of recurrence can |

0:56:44 | be matched with each other, so I will read that over here. |

0:56:48 | So I have the final five, three more minutes, five more minutes. I will go |

0:56:53 | very quickly. |

0:56:54 | Everytime I talk about it I was running out of time. |

0:56:57 | So |

0:56:59 | so the key concept is called embedding. |

0:57:01 | Okay, so actually you can find the literature in nineties, eighties to have this |

0:57:07 | basic idea around. |

0:57:09 | For example in this special issue of |

0:57:12 | Artifical Intelligence, very nice paper over here, I had chance to read them all. |

0:57:15 | And very insightful and some of the chapters over here are very good. |

0:57:18 | So the idea is that each physical entity or linguistic |

0:57:23 | you know |

0:57:24 | entity: |

0:57:25 | word, phrase, but even whole article, whole paragraph |

0:57:29 | can be embedded into |

0:57:30 | continuous-space vector. It could be big ??, you know. |

0:57:34 | Just to let you know it's special issue on this topic. |

0:57:38 | And that's why it's important concept. |

0:57:41 | The second important concept, which is much more advanced |

0:57:44 | which is described by a few books over here. I really enjoyed reading some of |

0:57:49 | those and I invite those |

0:57:50 | people come to visit me. |

0:57:52 | We have a lot to discuss on that. You can actually even embed the structure |

0:57:56 | into |

0:57:57 | next structure symmetric into a vector |

0:58:01 | where you can recover the structure completely through the vector |

0:58:04 | operation and the concept is called tensor-product representation. |

0:58:08 | So I don't have .. if only I had three hours I can go through |

0:58:11 | all of this. |

0:58:11 | But for now I'm going to elaborate about this for next two minutes. |

0:58:16 | So |

0:58:17 | this is the neural network recurent model and this is very nice, I mean this |

0:58:21 | is fairly informational paper |

0:58:22 | to show that embedding can be done as part of the |

0:58:25 | as a byproduct of the recurrent neural network that |

0:58:28 | paper was published in Interspeech several years ago. |

0:58:34 | And then I'll talk very quickly about semantic embedding at MSR, so |

0:58:39 | the difference between this set of work and the previous work was that |

0:58:42 | everything is completely unsupervised |

0:58:44 | so in the company if you have supervision you should grab it, right. |

0:58:48 | So we actually took initiative to actually take some |

0:58:51 | very smart |

0:58:52 | exploitation of supervision signals |

0:58:54 | at virtually no cost. |

0:58:57 | So the idea here was that this is the model that we have essentially for |

0:59:01 | each branch it's deep neural network. Now different |

0:59:03 | branches can actually link together |

0:59:05 | through what's called the, you know, cosine distance. |

0:59:08 | So that |

0:59:09 | distance can be measured |

0:59:10 | in terms of |

0:59:11 | a vector, in a vector space. |

0:59:13 | And now we do MMI learning, |

0:59:16 | so if you get hot dog in this one, if your document is talking about |

0:59:20 | fast food or something, even if |

0:59:22 | there's no word in common you pick up. |

0:59:24 | And because of supervision actually link them together. |

0:59:27 | Like if you have dog racing here |

0:59:29 | they have the same word although they will be very far apart from each other. |

0:59:33 | And that can be automatically done. |

0:59:37 | And that some people told me that topic model can do |

0:59:39 | similar things, so if we compare that with the topical model |

0:59:42 | it turned out that ?? |

0:59:45 | and using this |

0:59:46 | deep semantic model |

0:59:48 | can do much, much better. |

0:59:49 | So, now multi-modal. Just one more slide. |

0:59:53 | So it turned out that not only text you can embed into it, |

0:59:57 | image can be embedded, speech can be embedded and can do something very similar |

1:00:01 | to the one I showed you earlier. |

1:00:03 | And this is the paper that was in yesterday talk about embedding. |

1:00:09 | That's ver nice, I mean it's very similar concept. |

1:00:12 | So I looked at this and I said wow it's just like the model that |

1:00:15 | we did for the text. |

1:00:16 | But it turned out that application is very different. |

1:00:18 | So actually |

1:00:20 | I don't have time to go through here. I encourage to read on some papers |

1:00:24 | over here. Let's skip this. |

1:00:25 | So this was just to show you some application for this |

1:00:27 | semantic model. You can do all the things. From web search |

1:00:30 | we apply them, quite nicely. For machine translation you have one entity |

1:00:34 | to be one language |

1:00:37 | some of the list of the paper that were published you can find some detail. |

1:00:40 | You actually can make summary, summarization and entity ranking. |

1:00:45 | So let's skip this. This is final slide, the real final slide. |

1:00:49 | I don't have any summary slides, this is my summary slide. |

1:00:51 | So I copied the main message here now. Elaborate could be more. After going through |

1:00:55 | whole hour of presentation. |

1:00:57 | Now in terms of application we have seen |

1:01:00 | speech recognition. |

1:01:01 | The green is |

1:01:03 | neural network, the red is deep generative model. So |

1:01:07 | I say a few words about deep generative model and dynamic model |

1:01:11 | that's generative models side and LSTM is other side. Now speech enhancement |

1:01:16 | I showed you these types of models |

1:01:19 | and then |

1:01:20 | on the generative model side I showed you this one |

1:01:25 | and this is shallow generative model that actually can |

1:01:28 | give rise to deep structure which is corresponding to |

1:01:31 | deep |

1:01:33 | stacking network I showed you early on. Now for algorytm we have get back propagation |

1:01:37 | here. |

1:01:38 | That's single unchallenged |

1:01:40 | algorytm for deep neural network. |

1:01:42 | Now for deep generative model there are two algorytms. They are both called |

1:01:45 | BP. |

1:01:47 | So one is called Belief Propagation, for those of you who know machine learning. |

1:01:51 | The other one is BP, same as this. |

1:01:54 | That only came up within two years. |

1:01:57 | Due to this new advancement |

1:02:00 | of porting deep neural network |

1:02:02 | into the inference step |

1:02:04 | of this type of model, so I call BP and BP. |

1:02:08 | And in terms of neuroscience you call this one to be wake and you call |

1:02:11 | the other sleep. |

1:02:12 | And in the sleep you generate things you get hallucination and then when you're awake |

1:02:16 | you have perception. |

1:02:17 | You get information there. I think that's all I want to say. Thank you very |

1:02:20 | much. |

1:02:29 | Okay. Anyone one or two quick questions? |

1:02:37 | Very interesting talk. |

1:02:40 | I don't want to talk about your main point which is very interesting. |

1:02:43 | Actually just very briefly about one of your side messages which is about waveforms. |

1:02:48 | Which is about waveforms. So you know the ?? paper there weren't really putting in |

1:02:54 | waveforms. |

1:02:54 | They are putting in the waveforms, take the absolute value, floor it, take all |

1:02:58 | logarithm, average over, but you know so you had to do a lot of things. |

1:03:03 | Secondly the other papers that there's been a modest |

1:03:07 | amount of work in last few years on doing this sort of thing, |

1:03:10 | pretty generally people do it with matched training test conditions |

1:03:14 | if you have mismatched conditions, good luck with |

1:03:16 | waveform. I always hate to say something is impossible but good luck. |

1:03:24 | Thank you very much. ?? good for everything. |

1:03:27 | And look at presentation that was very nice, thank you. |

1:03:32 | Any other quick questions? |

1:03:36 | If not I invite Haizhou |

1:03:40 | to give a plaque. |