Speech Transcript - Achievements and Challenges of Deep Learning - From Speech Analysis And Recognition To Language And Multimodal Processing

0:00:15	Like the first of thing. I will thank the organisers for having
0:00:19	this opportunity to share with you
0:00:22	some of my personal views
0:00:25	on this very hot topic here. So,
0:00:29	I think the goal of this tutorial really is to
0:00:33	help diversifying the deep learning approach. Just like the theme of this conference,
0:00:40	Interspeech For Diversifying
0:00:42	the Language, okay.
0:00:45	So I have a long list of people to thank. Oh, so I want.
0:00:49	Yeah, thank you.
0:00:50	So I have long ... long list of people here to thank.
0:00:53	Especially Geoff Hinton. I worked with him for some period of time.
0:00:58	And Dong Yu and whole bunch of Microsoft colleagues.
0:01:02	Oh, who,
0:01:04	hmm,
0:01:05	contributed a lot to the material
0:01:08	I'm going to go through.
0:01:10	And also I would like to thank many of the colleagues sitting here who had
0:01:14	a lot of discussions with me.
0:01:16	And their opinions also shaped some of the content that I am going to go
0:01:20	through with you over the next hour.
0:01:23	Yeah, so the main message of this talk
0:01:26	is that deep learning is not the same as deep neural network. I think in
0:01:30	this community most of people
0:01:31	mistake deep learning with deep neural network.
0:01:36	And most ...
0:01:38	So deep learning is something that everybody here would know. I mean just look at
0:01:42	... I think I counted close to 90 papers somewhere
0:01:44	related to the deep learning or approaching at least. Kind of the number of papers
0:01:50	exponentially
0:01:50	increasing over last twelve years.
0:01:53	So deep neural network is essentially neural network
0:01:56	you can unfold that in space. You form a big network.
0:02:01	OR
0:02:02	AND
0:02:03	Either way or both. You can unfold that over time. If you don't unfold
0:02:07	that neural network over time because of reccurent network, okay.
0:02:11	But there's another very big branch of deep learning, which I would call Deep Generative
0:02:16	Model.
0:02:17	Like a type of neural network you can also unfold in space and in time.
0:02:22	If it's unfolded in time, you would call it a dynamic model.
0:02:26	Essentially the same concept. You unfold the network.
0:02:31	Oh. You know
0:02:32	in same direction in terms of time
0:02:36	But in terms of space they are unfolded in the
0:02:39	opposite direction. So I'm gonna elaborate this part. And for example
0:02:43	our very commonly used model.
0:02:46	You know a Gaussian Mixture Model, hidden Markov model, really has the
0:02:53	neural network unfolding in time.
0:02:57	But if you make that unfolding in space you get big Generative Model
0:03:00	which hasn't been very popular in our community.
0:03:05	You know I'm going to survey whole bunch of work related to this area, ah,
0:03:09	you know through the my discussion with many people here.
0:03:14	But anyway so the main message of this talk is eventually to
0:03:20	hope and I think there's a promising direction that is already taking place in machine
0:03:26	learning community
0:03:27	I don't know how many of you actually went to International Conference on Machine Learning
0:03:30	(ICML) this year, just a couple of months ago in Beijing.
0:03:33	But there's huge ammount of work in Deep Generative Model and some very interesting
0:03:36	development, which I think I'd like to share with you at high level,
0:03:41	so you can see that all this deep learning, although it just started in terms
0:03:46	of application in our
0:03:47	speech community, we should be very proud of that.
0:03:51	Hmm, now,
0:03:52	In number of machine learning communities there's huge amount of work going on
0:03:57	in Deep Generative Model. So I hope I can share with you some of recent
0:03:59	development with you to
0:04:02	to enforce the message that
0:04:05	a good combination between the two
0:04:08	which have
0:04:10	complementary strengths and weaknesses can be actually get together to further
0:04:15	advance deep learning in our community here.
0:04:19	Okay, so now. These are very big slides. I'm not going to go through all
0:04:23	of details. I'm just going to highlight a few
0:04:25	things so in order to enforce the message that
0:04:30	generative model and
0:04:31	neural network model can be helping each other.
0:04:34	I'm just going to highlight a few key attributes of
0:04:38	both approaches. They are very different approaches.
0:04:41	I'm going to highlight that very briefly. First of all
0:04:45	in terms of structure they are both graphical in nature as a network, okay.
0:04:48	You think about this deep generative model, typically some of these
0:04:56	we call that a Dynamic Bayesian network. You actually have joint probability between ?? label
0:05:01	and the observation.
0:05:03	And which is not the case for deep neural network,
0:05:05	okay.
0:05:06	In the literature you see many other terms
0:05:10	that relate to deep generative model like probabilistic graphical model,
0:05:14	such as stochastic neurons,
0:05:17	sometimes it's called the stochastic generative network as you see in literature. They all belong
0:05:21	to this
0:05:22	category. So if your mindset is over here, even though you see some neural words
0:05:28	describing that
0:05:29	you know you won't be able to read all this literature, so the mindset is
0:05:32	very difficult when you study these two.
0:05:34	So the strenght
0:05:35	of deep generative model is that,
0:05:39	this is very important to me,
0:05:42	how to interpret, okay.
0:05:44	So everybody that I talked, including the lunchtime when I talk to students,
0:05:49	they complain. I say: have you heard about deep neural network? and everybody says yes,
0:05:52	we do.
0:05:54	To what extent have you started looking to that? and they said we don't want
0:05:57	to do that because we cannot
0:05:58	even interpret what's in the hidden layer, right.
0:06:01	And that's true
0:06:02	and that actually is quite very deciding. I mean if you
0:06:05	read into this ?? science literature in terms of connectionist model
0:06:09	really the whole design is that you need to have a representation here to be
0:06:13	distributed.
0:06:13	So each neuron can represent different concept
0:06:17	and each
0:06:18	concept can be represented by different neurons, so the very design
0:06:21	it's not meant to be interpretable,
0:06:23	okay.
0:06:24	And that actually creates some difficulty for many
0:06:27	and this model is just opposite. It's very easy to interpret because the very nature
0:06:33	of generative story.
0:06:34	You can tell what the process is
0:06:36	and then of course if you want to do
0:06:39	a classification or some other application in machine learning
0:06:42	you simply just have to ..
0:06:44	and for forecast we simply have base route to invert that, that's exactly what in
0:06:48	our community
0:06:49	we have been doing for thirty years hidden Markov model. You get the prior, you
0:06:52	get generative model and
0:06:53	you multiply them and then you do it. Except at that time we didn't know
0:06:57	how to make that
0:06:58	deep for this type of model. And there are some piece of work that I'm
0:07:01	going to survey.
0:07:02	So that's one big part of the advantage of this model.
0:07:05	Of course everybody know that what I just mentioned there.
0:07:09	In deep generative model actually the information flow is from top to down.
0:07:13	You actually have .. what top simply means is that you know you get a
0:07:16	label or you get a higher level concept
0:07:18	and the lower level down simply means you can rotate to fit into that.
0:07:22	Everybody know that in a neural network
0:07:25	the information flow is from bottom to up, okay. So you fit the data and
0:07:29	you compute whatever output and then
0:07:30	you go either way you want.
0:07:31	In this case
0:07:33	the information come from top to down. You generate the information
0:07:37	and then if you want to do classification, you know, any other machine
0:07:42	learning applications, you know you can do Bayesian. Bayesian is very essential for this.
0:07:49	But there's whole list of those. I don't have time to go through, but just
0:07:52	you know those are high lights, these
0:07:54	we have to say. So the main strenght of deep neural network that actually gained
0:07:59	popularity
0:07:59	over the previous years, really is mainly due to these strenghts.
0:08:04	It's easier to do a computation in terms of
0:08:10	so this what I wrote is a regular compute, okay.
0:08:13	So if you
0:08:14	look into exact what kind of compute is involved here
0:08:17	it's just the millions of millions of millions of times of computing
0:08:21	of the big matrix by a vector.
0:08:23	You do that many times. ?? place very small model role
0:08:27	it's very regular.
0:08:28	And therefore GPU is really
0:08:31	ideally suited for this kind of computation
0:08:33	and that's not the case for this model.
0:08:36	So if you compare between these two then you really will understand that if you
0:08:41	can pull
0:08:42	some of these advantages into this model
0:08:44	and pull some of this advantage in this column into this one
0:08:48	you have integrated model. And that's kind of the message I'm going to convey and
0:08:53	I'm going to
0:08:54	give you example to show how this can be done.
0:08:57	Okay, so in terms of interpretability it's very much related to
0:09:04	how to incorporate the main knowledge
0:09:06	and network constraint into the model. And for deep neural network it's very hard.
0:09:12	What people have done that, I have seen many people in this conference and also
0:09:16	in a ??
0:09:17	tried very hard it's not very natural.
0:09:20	What is
0:09:22	This is very easy
0:09:23	I mean you can code your domain control knowledge directly into the system. For example
0:09:29	like distorted speech, voice speech, you know
0:09:32	in the summation, into special domain, summation of
0:09:35	either wave-form domain is a noise
0:09:37	plus
0:09:38	the clean speech you get by observation. That's so simple you just cut that into
0:09:43	one layer, into summation or
0:09:44	you can call them in terms of Bayesian probability very easily.
0:09:47	This is not that easy to do. People tried to do that, it's not just
0:09:51	as easy.
0:09:51	So to encode
0:09:53	a domain knowledge into network constraint of the problem
0:09:57	into
0:09:58	your deep learning system. This has great advantage.
0:10:01	So I'm actually, I mean this is just a random selection
0:10:03	of things you know. There's very nice paper over here
0:10:06	Acoustic Phonetics.
0:10:08	All this knowledge at speech production
0:10:11	and this kind of nonlinear
0:10:12	phonology
0:10:14	and this is an example of this is noise robust. You put the phase information
0:10:19	of the speech and noise you can come up with
0:10:22	very nice conditional distribution. It's kind of complicated
0:10:24	but this one can be put directly
0:10:26	into generative model and this is some example of this. Whereas in deep neural networks
0:10:31	it's very hard to do.
0:10:33	So the question is that do we want to throw away all these knowledge in
0:10:36	the deep learning
0:10:37	and my answer is of course no. Most of people will say no, okay.
0:10:45	And people from the outside of speech (community) there was a yes. I talk about
0:10:48	some people in machine learning,
0:10:49	anyway so since this is speech conference I really want to emphasise that.
0:10:54	So the real
0:10:55	solid reliable knowledge that we attained
0:10:58	from speech science
0:10:59	that has been reflected by local talks are
0:11:03	such as yesterday's talk, talking about how some patterns have been shaped by you by
0:11:09	?? and perceptionists. They were really playing a role in deep generative model.
0:11:14	But very hard to do that in deep neural network.
0:11:17	So with this main message in mind
0:11:20	I'm going to go through three parts of the talk as I put them in
0:11:24	my abstract here.
0:11:25	So I need to go very briefly
0:11:27	through all these three topics.
0:11:30	Okay, so the first part is to give very brief history of how deep speech
0:11:36	recognition started.
0:11:38	So this is a very simple list. There are so many papers around. Before the
0:11:43	rise of the deep learning around
0:11:45	2009 and 2010. There are lots of papers around. So I hope I actually have
0:11:50	a reasonable
0:11:52	sample of the work around here.
0:11:54	So I don't have time to go through, especially for those of you who are
0:11:58	in ?? open house
0:12:00	There was in 1988, I think in 1988
0:12:03	ASRU and at that time there's no U, it's just
0:12:05	ASR. And there is some very nice paper around here and then quickly
0:12:10	you know it was
0:12:10	superseded
0:12:11	superseded by the hidden Markov model approach.
0:12:15	So I'm not going to go through all these
0:12:17	so except to point out that
0:12:20	neural network
0:12:22	has been very popular for awhile.
0:12:24	But towards this you know,
0:12:26	plus ten years
0:12:28	before the deep learning actually took over neural network approach
0:12:33	essentially didn't really make
0:12:36	such a strong impact compared with deep learning network that people have been seeing.
0:12:41	So I just give you one example to just show you how unpopular
0:12:45	the neural network was at that time.
0:12:48	So this is about 2008 or 2006, about nine years ago.
0:12:53	So this is the optimization that I think
0:12:56	is predecessor
0:12:57	of ?? IOPPA.
0:12:58	So they actually got several of together, locked us up into hotel
0:13:03	near Washington, DC.
0:13:05	airport somewhere.
0:13:07	Essentiall the goal is to say that well the speech
0:13:09	recognition is stuck there, so you come over here and help us brainstorm next generation
0:13:15	of speech recognition and understand technology
0:13:18	and then we actually spent about four or five days in the hotel and at
0:13:22	the end we wrote very thick report,
0:13:25	twenty some pages of report.
0:13:26	So there is some interesting discussion about history and the idea is that
0:13:31	if government give you unlimited resource and gives you fifteen years what is it you
0:13:35	can't do, right?
0:13:36	So most of the people in our discussion,
0:13:39	we all focused on neural network, essentially
0:13:41	margin is here,
0:13:42	macro-random field is here, conditional-random field is here and graphical model here.
0:13:48	So it
0:13:50	that was just couple of years before that deep learning actually came out at that
0:13:54	time
0:13:54	so neural network was actually one of the
0:13:57	two's around.
0:13:58	Haven't really make a big impact.
0:14:01	So on the other hand the graphical model was actually mentioned here because it's related
0:14:06	to deep generative model.
0:14:08	So I'm going to show you a little bit, well this is slide about deep
0:14:12	generative model, actually I made some list over here.
0:14:15	One of the
0:14:18	but anyway so. This let's go over here.
0:14:21	I just want to highlight couple of
0:14:25	related to
0:14:27	introduction of deep neural network in the field.
0:14:30	Okay so one of, this is ?? John Riddle?
0:14:32	actually we spent a summer in ?? in 1989,
0:14:37	or 1988.
0:14:39	Fifteen and some years ago. So we spent really interesting summer altogether.
0:14:45	So
0:14:46	and that's kind of the model, deep generative model, the two versions we actually put
0:14:52	together
0:14:52	and at the end we actually brought a very thick report that were about eighty
0:14:56	pages of report.
0:14:58	So this is deep generative model and it turned out that this model
0:15:02	actually both of those models are implemented in neural networking.
0:15:06	And thinking about neural network as simply just function of function of mapping
0:15:09	so if you map the hidden representation
0:15:12	from you know
0:15:13	as part of deep generative model into whatever observation you have
0:15:18	MFCC. Everybody used MFCC at the time.
0:15:22	You actually need to have done the mapping and that was done in neural network
0:15:27	in both versions
0:15:28	and this is statistical version which we
0:15:31	call the hidden dynamic model. It's one of the conversion
0:15:34	of deep generative model.
0:15:36	It didn't succeed. I'll show you the reason why. Now we understood what.
0:15:40	Okay, so it's interesting enough in this
0:15:43	model we actually used, if you read the report, it actually turned out that model
0:15:47	was here since Geoff told me that
0:15:49	the video for this workshop is still around there so it's called ?? sign. I
0:15:53	think I mentioned to ?? pick it out.
0:15:56	It turned out that learning of this workshop, which details are in this report
0:16:00	is actually use the back propagation to do it. Now direction isn't from up to
0:16:03	down, since your model is
0:16:05	top down, the propagation must be bottom up.
0:16:08	So nowadays
0:16:10	when we do speech recognition the error
0:16:14	function is a softmax or sometime you can use the mean square error.
0:16:18	And the measure is in terms of your label.
0:16:23	This is the opposite. The error is measured in terms
0:16:26	of matching between how generative model can match with the observation. And then when you
0:16:31	want to
0:16:31	learn you go bottom up learning. Which actually turned out to be better propagation. So
0:16:35	that propagation doesn't have to be done (up to bottom)
0:16:37	it can be bottom up. Depending on what kind of models you have.
0:16:40	But key is that this is
0:16:41	a gradient descent method.
0:16:44	So actually we got disappointing result for switchboard. You know because we tended to be
0:16:48	a bit off game.
0:16:49	And now we understand why. Not at that time. I'm sure some of you experienced
0:16:52	it. I have a lot
0:16:53	of thinking about how deep learning and this can be integrated together.
0:16:59	So at the same time
0:17:02	Okay so this is a fairly simple model, okay. So you have this hidden representation
0:17:07	and it has
0:17:08	specific constrains built into the model,
0:17:11	by the way which is very hard to do when you do bottom-up neural network.
0:17:15	And for generative model
0:17:16	you can put them very easily down there, so for example
0:17:18	articulatory trajectory has to be smooth
0:17:22	and then specific form of the smoothness can be built indirectly
0:17:26	by simply writing the generative probabilities. Not in the deep neural network.
0:17:31	So at the same time
0:17:33	we actually, also this was done in ??
0:17:38	and we were able to even put this nonlinear phonology in terms of
0:17:43	writing the phonemes into the invidiual constituents at the top level and ?? also has
0:17:49	very nice paper, some fifteen years ago, talking about this.
0:17:53	And also the robustness can be directly integrated into
0:17:57	articulator model simply by generative model. Now for deep neural network it's very hard to
0:18:01	do.
0:18:01	For example you can actually
0:18:05	this is not meant to be seen. Essentially this is one of the conditional likelihood
0:18:10	that covers
0:18:11	one of the links. So everytime you have got the link
0:18:15	you have conditional dependency parent to children that have differnt neighbours.
0:18:22	And then you can specify them in terms of
0:18:24	conditional distribution. Once you do that you formed a model
0:18:27	you can embed
0:18:28	whatever knowledge you have, you think is good, into the system. But anyway
0:18:33	but the problem is that the learning is very hard
0:18:35	and that problem of the learning in machine community only was solved just within last
0:18:41	year.
0:18:42	At that time we just didn't really know. We were so naive.
0:18:47	We didn't really understand all the limitations of learning. So just to show you we
0:18:50	talk, okay. One of the
0:18:51	things we did was that, I actually worked on this with my colleagues Hagai Attias.
0:18:55	He is actually one of the
0:18:56	he is my colleague working not far away from me at that time, some ten
0:19:01	years ago.
0:19:02	So he was the one who invented this very initial base. Which is very well
0:19:06	known.
0:19:07	So the idea was as follows. You have to break up these pieces into the
0:19:11	modules, right.
0:19:12	For each module you have this, this is actually
0:19:14	continuous
0:19:17	dependence of the continuous hidden representation
0:19:20	and it turned out that the way to learn this,
0:19:23	you know in a principle, what is to do is EM (Expectation maximization). It's variational
0:19:26	EM.
0:19:26	So the idea is very crazy.
0:19:28	So you said you cannot solve that regressively and that's well known. It's loopy neural
0:19:34	network. Then you just cut all important things you
0:19:37	carry out. Hoping that M-Step can make it up. That's very crazy idea.
0:19:41	And that's the best around time that was there.
0:19:43	But it turned out that you've got the auxiliary function and you form is still
0:19:48	something very
0:19:49	similar to our EM, you know in HMM. For the general model you don't have
0:19:55	to look you can get rigorous solution.
0:19:57	But now when you have deep it's very hard. You have to make up for
0:20:01	it. And that ?? is just as ??bad-ass
0:20:03	many people could ?? on deep neural network. This ?? deep generative model
0:20:08	probably have more
0:20:09	than otherwise. Although they patched themselves
0:20:12	to be
0:20:13	you know very rigorous. But if you really walk on that, so I can pick
0:20:17	out of this, so it's
0:20:18	for this approach we get surprisingly good inference results for continuous variables.
0:20:22	And in one version what we did was actually we used phonemes
0:20:27	you know as a hidden representation and it turned out it tracked. And once you
0:20:31	do this you
0:20:32	check the phoneme really precisely.
0:20:34	As a byproduct this worked as we created
0:20:38	this worked as we created database for formant tracking
0:20:42	but if we actually do
0:20:45	inference only the linguistic unit which is the problem
0:20:48	of recognition we didn't really make much progress on this.
0:20:51	But anyway so I'm going to show you some of these preliminary results to show
0:20:56	you how this
0:20:57	is one way that led to the deep neural network.
0:21:00	So when we actually simplify the model in order to finish the decoding we actually,
0:21:07	this is actually ?? result
0:21:09	and we would bring out all of analysis for different kinds of phones.
0:21:12	So when we use this kind of generative model with deep structure it actually corrected
0:21:17	many errors
0:21:18	which are related the short phones.
0:21:20	And you understand why because you designed model to make that happen and then you
0:21:24	know if
0:21:25	everything is done recently well you actually get results. So we actually look
0:21:28	at not only corrected short phones for the vowel
0:21:32	but also it correct the a lots of
0:21:34	consonants because they're up with each other.
0:21:36	It's just because the model design whatever hidden trajectory that you get
0:21:40	it's influenced, the parts of the vowel is influenced
0:21:45	by the adjacent sound.
0:21:47	And that's
0:21:47	this is due to the coarticulation.
0:21:49	This work will be very naturally built into the system
0:21:51	and one of things I am very much struggling with deep neural network is that
0:21:55	you can't even build this kind of
0:21:56	information that easily, okay.
0:21:59	This is to convince you how things can be breached.
0:22:03	It's very easy to interpret the results. So we look at the error we
0:22:07	know wow these are quite a big data assumption.
0:22:11	Without the have to go through for example in this these examples of these are
0:22:14	the same sounds, okay.
0:22:15	You just speak fast then you get something like this
0:22:17	and then we actually looked at the error and we said Ohh.
0:22:20	You know
0:22:22	that's exactly what happened. You know mistake was made in the
0:22:27	Gaussian Mixture Model because it doesn't take into account these particular dynamics. Now this one
0:22:31	was pulling correct error
0:22:32	And I'm going to show you in deep neural network things are reversed, so that's
0:22:37	related to ??. But in the same time
0:22:39	in machine learning community also the speech
0:22:42	there is a very interesting model for the deep generative model developed
0:22:46	and that's called the Deep Belief Network.
0:22:47	Okay,
0:22:48	so in the earlier literature before about three or four years ago
0:22:52	DBN, Deep Belief Network, NTA I mix each other, even by the authors
0:22:56	it's just because most people don't understand what it is
0:22:59	so this is very interesting paper that is starting in 2006
0:23:02	many people, most people in machine learning, regard this paper to be the start of
0:23:07	deep learning.
0:23:08	And thus the generative model so you prefer to say deep
0:23:12	generative model actually started the deep learning rather than deep neural network.
0:23:17	But this model has some intriguing probabilities
0:23:21	that really at the time attracted my attention here.
0:23:25	It's totally not obvious, okay.
0:23:28	So for those of you who know RBM and DBM you know when you are
0:23:32	stacking up this undirected model
0:23:34	sever time you get DBN, that's
0:23:37	you might think that the whole thing will be undirected,
0:23:40	you know bottom-up machine, no. It's actually directed model coming down.
0:23:44	You have to read this paper to understand why.
0:23:47	So why do they? I said someone was wrong. I couldn't understand what happened.
0:23:50	But on the other hand it's much simpler than the model I showed you earlier
0:23:54	for deep network we get the temporal dynamics.
0:23:56	This one it's not temporal dynamics over here.
0:23:59	So
0:24:01	the most intriguing aspect of DBN
0:24:03	as described in this paper is that inference is easy.
0:24:06	Normally you think inference is hard. That's the tradition.
0:24:10	It's given fact if you have these multiple dependencies on the top it's very hard
0:24:15	to make voice
0:24:16	and there's special constraint built into this model. Namely the restriction in the connections of
0:24:21	RBM
0:24:22	because of that it makes inference. It's just a special case.
0:24:25	This is very intriguing, so I thought this idea may help
0:24:29	the deep general model I showed you earlier.
0:24:32	So he came to reason me, you know. We discussed it.
0:24:36	It took him a while to explain what this paper is.
0:24:40	Most of people at Microsoft at that time couldn't understand what's going on.
0:24:45	So now let's see how
0:24:46	and then of course what we get together this deep generative model
0:24:50	and the other deep generative model I talked about with you I actually worked on
0:24:54	for almost ten
0:24:54	years at Microsoft. We were working very hard on this.
0:24:57	And then we came up with the conclusion that well we have to use fewer
0:25:00	clues to fix problem.
0:25:01	And they don't match, okay. The reason why they don't match is whole new story
0:25:05	why they don't match.
0:25:06	The main reason is actually not just temporal difference, it's the way you prioritize
0:25:12	the model and also the way to represent
0:25:15	the information is very different
0:25:17	despite the fact that they're both generative models.
0:25:19	It turned out that this model is very good for speech synhesis and ?? has
0:25:22	very nice paper
0:25:23	using this model to do synthesis. And it's very nice to do
0:25:26	image generation. I can see that very nice probably.
0:25:30	Not for continuous speech it is very hard to do
0:25:33	and for speech for general synthesis it's good it's because if you have segment with
0:25:38	whole
0:25:39	context into account, like syllable in Chinese it is good, but for English it is
0:25:42	not that easy to do.
0:25:44	But anyway so we need to have few kluges to fix together, to merge these
0:25:48	two models together.
0:25:49	And that sort of led to the end.
0:25:51	So the first kluge is that
0:25:54	you know
0:25:55	the temporal dependency is very hard. If you have temporal dependency you automatically loop and
0:26:00	then
0:26:00	everybody in machine learning at that time knew, most of speech persons, so I thought
0:26:05	that
0:26:05	machine learning that I show you early on actually just didn't work well, it didn't
0:26:09	worked out well. And most of people who were
0:26:12	very much versed in machine learning who say there's no way to learn that.
0:26:15	Then cut the dependency. It's way to do it, cut the dependency in the hidden
0:26:20	dimension, in the hidden revision
0:26:21	and loose all the powers of
0:26:23	deep generative model
0:26:25	and that's the Geoff Hinton's idea, well it doesn't matter, just use a big window.
0:26:30	If it fixes the clues and that actually
0:26:34	is one of things that actually helped
0:26:36	to solve the problem
0:26:38	and the second Kluge is that you can reverse direction
0:26:40	because
0:26:41	the inference in generative model is very hard to do as I showed earlier.
0:26:45	Now if you reverse direction
0:26:48	from top-down to bottom-up
0:26:52	and then you don't have to solve that problem. And that's why it would be
0:26:56	just a deep neural network, okay. Of course
0:26:58	everybody said: we don't know how to train them, that was in 2009.
0:27:02	Most people don't know how to ??
0:27:03	and then he said that's how DBN can help.
0:27:07	And then he did a fair amount of work on DBN to initialize that ??
0:27:12	approach.
0:27:12	So this is very well-timed academic-industrial collaboration. First of all
0:27:16	it's because speech recognition industry has been searching for new solutions when principle
0:27:22	deep generative model could not deliver, okay. Everybody
0:27:24	was very upset about this at the time.
0:27:27	And at the same time academia developed deep learning tool
0:27:30	DBN, DNN, all the hybrid stuff that's going on.
0:27:33	And also CUDA library was released around that time. It's very recent times.
0:27:40	So this is probably one of the earliest catching on
0:27:44	for this GPU computing power over here.
0:27:47	And then of course big training data in ASR that has been around
0:27:52	and most people, if you actually do
0:27:55	Gaussian Mixture Model for HMM where a lot of data performance accelerates, right.
0:28:00	And then this is one of things that in the end really is powerful. You
0:28:04	can increase the size and depth
0:28:06	and
0:28:07	you know put in a lot of things
0:28:08	into to make it really powerful.
0:28:11	And that's the scalability advantage that I showed you early on. That's not the case
0:28:15	for any shallow model.
0:28:18	Okay, so in 2009 I and three of my colleagues didn't know what's
0:28:23	happening. So we actually got together to
0:28:26	to do this
0:28:27	to this workshop
0:28:28	to show that
0:28:29	this is useful thing, you know, to bring stuff.
0:28:32	So it wasn't popular at all. I remember
0:28:35	you know Geoff Hinton and I we actually got together to
0:28:40	who we should invite to give us
0:28:42	speech in this workshop.
0:28:44	So I remember that one invitee which shall be nameless here
0:28:47	he said: Give me one week to think about, and at the end he said:
0:28:50	it's not worth my time to fly to Vancouver. That's one of them.
0:28:53	The second invitee, I remember this clearly, said: This is crazy idea. So in the
0:28:57	e-mail he said
0:28:58	What you do is not clear enough for us.
0:29:01	So we said you know
0:29:02	waveform may be useful for ASR.
0:29:04	And then the emails said: Oh why?
0:29:07	So we said that's just like using pixel for image recognition. That was popular.
0:29:12	For example convolutional network there are pixels.
0:29:15	We take similar approach. Except it is waveform.
0:29:17	And the answer was: No, no, no that's not same as pixel. It is more
0:29:22	like using photons.
0:29:23	You know making kind of joke essentially. This one didn't show up either. But anyway
0:29:28	so
0:29:30	anyway so this workshop actually has
0:29:34	a lot of brainstorming I had to analyze, all the errors I showed you early
0:29:38	on.
0:29:39	But it's really good
0:29:41	workshop for about four or five years that was
0:29:44	five years ago now.
0:29:45	So now I move to part 2
0:29:48	to discuss achievements. So actually in my original post I had whole bunch of slides
0:29:53	on vision.
0:29:54	So the message for the vision is that if you go to vision community
0:29:59	they look at deep learning to be
0:30:01	just even
0:30:02	maybe thirty time
0:30:04	thirty times more popular than deep learning in speech.
0:30:07	So they actually, the first time they did that was actually first time they
0:30:12	actually got the results.
0:30:16	and noone believed it's the case. At the time I was given a lecture
0:30:20	at Microsoft about Deep Learning
0:30:22	and then right before I, actually Bishop
0:30:25	was doing the lecture together with me
0:30:30	and then this deep learning just came out and Geoff Hinton sent e-mail to me:
0:30:34	Look at the matching! How much bigger it is.
0:30:36	And I showed them. People were like: I don't believe it. Maybe a special case.
0:30:40	You know. And it turned out it's just much
0:30:42	just as good.
0:30:43	Even better than speech. I actually cut all the slides out. Maybe some time I
0:30:46	will show you.
0:30:47	So this is big area to go. So today I am going to focus on
0:30:50	speech.
0:30:51	So one of things that we found during that time
0:30:55	is that we have very interesting discovery that we actually used the model that I
0:30:59	showed you there
0:31:00	and also deep neural network here.
0:31:03	And that actually is the number that we analyzed
0:31:06	error pattern very carefully. So it's very good, you know for TIMIT.
0:31:10	You can disable language model, right.
0:31:12	Then you can understand the errors for acoustic ?? very effectively
0:31:15	and I tried to do that afterwards, you know, to do other tasks
0:31:20	and it's very hard once you put language model in there you just couldn't
0:31:23	do any analysis. So it's very good at the time we did this analysis.
0:31:26	So now the error pattern in the comparison
0:31:30	is, I don't have time to go through except just to mention that.
0:31:33	So DNN made many new errors on short undershoot vowels.
0:31:37	So it sort of undo what this model is about to do
0:31:40	and then we thought of why would that happen and of course at the end
0:31:43	we had a very big window so if the sounds
0:31:45	are very short, information is captured over here and your input is about eleven frames,
0:31:48	you know, you got the fifteen frame it
0:31:50	captures kind of noise coming from different phones of course error is made over here.
0:31:54	So we can understand why.
0:31:56	And then we asked why this model corrects errors? It's just because
0:31:59	you make
0:32:00	you deliberately make a hidden representation
0:32:04	to reflect
0:32:05	what sound pattern looks like.
0:32:07	In the hidden space. And it's nice for whom you can see
0:32:10	but if you have the articulations, how do they see? So sometimes we use former
0:32:14	to illustrate what's going on there.
0:32:18	Another important discovery at Microsoft is that we actually found that using spectrogram
0:32:23	we produce much better
0:32:26	autoencoding results in terms of speech analysis.
0:32:30	Encoding results
0:32:32	?? and that was very surprising at the time.
0:32:36	And that really conforms to the basic deep learning theme that
0:32:39	you know the earliest features are better
0:32:42	then the processed features here. So I show you, this is actually project
0:32:48	we did together in 2009.
0:32:49	So we used spectrogram
0:32:51	to do binary coding of
0:32:53	of spectrogram.
0:32:55	So I don't have time to go through that. You read the auto-encoding book if
0:33:01	you can.
0:33:02	In literature you can all see this.
0:33:03	So the key is that
0:33:04	you use the target to be the same as input and then you use small
0:33:07	number of bits in the middle.
0:33:09	And you want to see whether that would actually
0:33:11	?? all the ?? down here. And the way to evaluate it is to look
0:33:15	at
0:33:15	you know what kind of errors you have.
0:33:17	So the way we did is we used the vector quantizer as a baseline
0:33:21	of 312 bits.
0:33:23	And then reconstruction
0:33:24	looks like this. So this is the original one, this is the shallow model, right.
0:33:29	Now using deep auto-encoder we get much closer to this in terms of errors
0:33:34	we simply have just much lower coding error
0:33:38	using identical number of bits.
0:33:39	So it really shows that if you build deep structure you extract this bottom-up feature.
0:33:45	Both ?? you condense more
0:33:47	information in terms of reconstructing the original signal.
0:33:50	And then we actually found that
0:33:53	for spectrogram this result is the best.
0:33:55	Now for MFCC we still get some gain, but gain is not nearly as much,
0:34:00	sort of indirectly
0:34:01	convinces me. There's Geoff Hinton's
0:34:03	original activities ?? everybody's
0:34:06	to spectogram.
0:34:07	So maybe we should have do the waveform, probably not anyway.
0:34:10	Okay so of course the next step is once we are all convinced that
0:34:14	error analysis shows that
0:34:17	deep learning can correct a lot of errors, not for all but for some
0:34:21	which we understand why. You just pick up the power and also capacity they had.
0:34:27	So on average it does a little bit better
0:34:29	based upon
0:34:30	this analysis.
0:34:33	Based upon this analysis it does slightly better.
0:34:36	But if you look away
0:34:38	but if you look at the error pattern you really can see
0:34:41	that this has a lot of power, but it also has some shortcomings as well.
0:34:45	So that both have pros and cons but one's errors are very different and it
0:34:49	actually gives you the hint that
0:34:51	you know is worthwhile to pursuit.
0:34:53	Of course this was all very interesting
0:34:56	evidence to show.
0:34:57	And then to scale up to industrial scale we had to do
0:35:00	lot of things so many of my colleagues actually were working with me
0:35:04	on this. So first of all
0:35:06	we need to extend the output
0:35:08	from small number of phones
0:35:11	at the states
0:35:12	into very large
0:35:13	and that actually at that time is motivated by
0:35:16	how to save huge Microsoft investment in speech decoder software.
0:35:20	I mean if you don't do this
0:35:22	then you know if you do some other kind of output coding
0:35:27	and they would also had to ?? atypical feature to do it. The one that
0:35:31	would fully believed
0:35:31	that it's going to work.
0:35:32	But it turned out if you need to change decoder, you know, we just have
0:35:36	to say wait a little bit.
0:35:38	So
0:35:41	and at the same time we found that using content dependent model gives much higher
0:35:46	accuracy
0:35:46	than content independent model for large tasks, okay.
0:35:49	Now for small tasks we defined so much better. I think
0:35:53	it's all related to
0:35:54	a capacity saturation problem if you have too much
0:35:57	but since a lot of data
0:35:59	in
0:36:01	in the training for large tasks
0:36:03	you actually keen
0:36:04	to form a very large output and that turn out
0:36:07	to have you know
0:36:09	double benefit.
0:36:10	One is that you increased accuracy and number two is that you don't have to
0:36:13	change anything about decoder.
0:36:14	And industry loves that.
0:36:17	You have both
0:36:18	that's actually ??. I can't recall why actually took off.
0:36:22	And then we summarize what enabled this type of model
0:36:24	and industrial knowledge about how to construct a very large units in DA
0:36:29	is very important
0:36:30	and that essentially come from
0:36:32	everybody's what here
0:36:34	that actually used this kind of content dependent model for Gaussian Mixture Model, you know,
0:36:39	that has been around for
0:36:40	almost twenty some years.
0:36:42	And also
0:36:43	it depends upon industrial knowledge on how to make encoding of such huge and highly
0:36:48	efficient using
0:36:50	our conventional
0:36:51	HMM decoding technology.
0:36:53	And of course how to make things practical.
0:36:57	And this is also very important enabling factor. If GPU didn't come up
0:37:03	roughly at time, didn't become popular at that time
0:37:06	all these experiments would take months to do.
0:37:08	Without all this belief, without all this fancy infrastructure.
0:37:14	And then
0:37:15	people may not have patiance to wait to see the results, you know push that
0:37:18	forward.
0:37:19	So let me show you some very
0:37:22	brief summary of the major
0:37:26	result obtained in early days.
0:37:29	So if we use three hours of training, this is TIMIT for example, we have
0:37:34	got
0:37:34	this is number I show you, it's not much about ?? percent of gain.
0:37:38	Now if you increase the data up to
0:37:41	ten times more thirty some hours you get twenty percent error rate.
0:37:46	Now if you do more.
0:37:48	For SwitchBoard, this is the paper that my colleague published here,
0:37:52	you get more data, another ten times so you get two orders of magnitude to
0:37:57	increase
0:37:58	and the relative gain actually
0:38:00	sort of
0:38:01	increase, you know, ten percent, twenty percent, thirty percent. This is actually
0:38:06	so of course if you increase
0:38:08	the size of training data
0:38:10	the baseline will increase as well, but relative gain is even bigger.
0:38:14	And if people look at this result there's
0:38:16	nobody
0:38:17	in their mind who would say not to use that.
0:38:20	And that's how
0:38:21	and then of course a lot of companies
0:38:24	you know
0:38:26	actually still
0:38:28	implement, DNN is fairly easy to implement for everybody because
0:38:33	I missed one of the points over there. It actually turned out if you use
0:38:37	large amount of data
0:38:38	it turned out that the original
0:38:41	idea of using DBN to regularize that model doesn't lead to
0:38:44	be that anymore. And in the beginning ?? how it happened.
0:38:49	But anyway, so now let me come back to the main thing of the talk.
0:38:53	How generative model
0:38:54	and deep neural network may be helping each other.
0:38:57	So the kluge one was that to use this to be
0:39:02	at that time
0:39:03	we have to keep this now for this conference we see
0:39:07	?? using LSTM with neural network and that fixed this problem.
0:39:12	So this problem is fixed.
0:39:14	This problem is fixed automatically.
0:39:17	At that time
0:39:19	we thought we need to use DBN. Now with use of big data there's no
0:39:23	need anymore.
0:39:24	And that's very well understood now. Actually there are many ways to understand that. You
0:39:28	can think about as
0:39:29	regulization view point
0:39:31	and yesterday at the table with students I mentioned that and people said: What is
0:39:36	regularization?
0:39:37	And you have to understand more in terms of the optimization view point
0:39:41	so actually if you stare at back-propagation formula for ten minutes you figure out why.
0:39:47	Which I actually have slide there, it's very easy to understand why from many perspectives.
0:39:52	With a lots of data you really don't need that.
0:39:54	And that's automatically fixed.
0:39:57	You know kind of by industrialization we tried lots of data
0:40:00	it's fixed and now this is not fixed yet. So this is actually the main
0:40:03	topic
0:40:04	that I'm going to use for the next twenty minutes.
0:40:07	So before I do that I will actually try to summarize some of
0:40:11	the major ... actually I and my colleagues wrote this book
0:40:14	and in this chapter we actually grouped
0:40:16	the major advancement of deep neural network into several categories
0:40:22	so I'm going to go through that quickly.
0:40:24	So one is the optimization,
0:40:26	innovation.
0:40:27	So I think the most important advancement
0:40:31	over the previous, you know the early success of the I showed you early on
0:40:36	what's the development of sequence discriminative training and
0:40:39	this contributed additional ten percent of error rate reduction.
0:40:42	Also many groups of people have done this.
0:40:45	Like for us at Microsoft, you know this is our first intern coming to our
0:40:49	place to do this.
0:40:50	And we tried on TIMIT we didn't know all the subtleties of the importance of
0:40:56	regularization and
0:40:56	we got all the formula right, all of everything right
0:40:59	and the result wasn't very good.
0:41:01	But I think
0:41:02	Interspeech accepting our paper and this we understand that this
0:41:06	and then later on
0:41:09	we got more a more papers, actually a lot of papers were published in Interspeech.
0:41:13	That's very good.
0:41:15	Okay now, the next theme is about 'Towards Raw Input', okay.
0:41:21	So what I showed you early on was the speech coding and analysis part
0:41:26	that we know that is good. We don't need MFCC anymore.
0:41:29	So it was bye MFCC, so
0:41:31	probably it will disappear
0:41:33	in our community. Slowly over the next few years.
0:41:36	And also we want to say bye to Fourier transforms, so I put the question
0:41:42	mark here partly because
0:41:43	actually, so for this Interspeech I think two days ago Herman ?? had a very
0:41:48	nice paper on
0:41:49	this and I encourage everybody to take a look at.
0:41:52	You just put the raw information in there
0:41:55	which was done actually about three years ago by Geoff Hinton students, they truly believed
0:42:00	it. I couldn't
0:42:01	I tried that about 2004, that was the hidden Markov model
0:42:04	error.
0:42:05	And we understood all kind of problem, how to normalize users input and I say
0:42:09	it's crazy
0:42:10	and then when they published the result
0:42:13	in
0:42:14	ICASSP. I looked at these results and error was terrible. I mean there's so much
0:42:17	of error.
0:42:17	So nobody took attention. And this year we brought the attention to this.
0:42:21	And the result is almost as good as using, you know,
0:42:25	using Fourier transforms.
0:42:27	So far we don't want to throw away yet,
0:42:29	but maybe next year people may throw that away.
0:42:33	Nice thing is .. I was very curious about this. I say
0:42:37	at the terms of that to get that result they just randomize everything rather than
0:42:41	using Fourier transforms
0:42:42	to initialize it and that's very intriguing.
0:42:46	Too many references to list I was running all the time. I had ?? list.
0:42:50	But yesterday when I went through this adaptation session there's so many good papers around.
0:42:55	I just don't have patience for them anymore.
0:42:57	So go back to ?? adaptation papers. There are a lot of new
0:43:02	advancements. So another important thing is transfer learning
0:43:05	at that place very important role in multi-lingual acoustic modelling.
0:43:10	So that was tutorial that I was .. actually Tanja was giving in a workshop
0:43:17	I was attending.
0:43:18	I also mention that
0:43:20	for generative model
0:43:22	for shallow model before
0:43:24	this one almost never
0:43:26	multilingual
0:43:28	of course
0:43:28	modelling
0:43:30	actually improved things.
0:43:32	But it never actually beat the baseline
0:43:36	in terms of ..
0:43:39	so think about cross-lingual for example, multi-lingual and cross-lingual
0:43:42	and deep learning actually beat the baseline. So there's whole bunch
0:43:44	papers in this area which I won't have time to go through all here.
0:43:47	Another important innovation is nonlinear regularization, so for
0:43:50	regulation dropout if you don't dropout it's good to know.
0:43:54	And this is special technique. Essentially it's just 'kill all you know' or
0:43:57	randomly and you get the better result.
0:44:03	And in terms of output units
0:44:05	now
0:44:06	is very popular units is to rectify linear units
0:44:09	and now there's some very interesting
0:44:11	many interesting theoretical analogies why this is better than this.
0:44:16	At least while in my experience .. actually I programmed this, it's change of our
0:44:20	lifes
0:44:21	to go from this to this.
0:44:23	Deep learning
0:44:24	really increases.
0:44:26	And we understand now why it happens.
0:44:29	Also (in terms of) accuracy different groups report different results.
0:44:32	Some groups reports they reduced error rate, some groups .. nobody reported increase in error
0:44:37	rates for now.
0:44:38	So in any case (it) speed up
0:44:40	the convergence dramatically.
0:44:43	So I'm going to show you another architecture over here which is going to link
0:44:48	to
0:44:49	a generative model.
0:44:51	So this is a model called Deep Stacking Network.
0:44:55	But its very design is deep neural network, okay. It's information from bottom up.
0:45:00	So the difference between this model and conventional deep neural network is that
0:45:04	for every single layer you can actually
0:45:07	integrate the input for each layer and then do some special processing here.
0:45:15	Especially you can alternate
0:45:17	layers into linear and nonlinear, if you do that you can dramatically increase your
0:45:23	speech convergence
0:45:26	in deep learning.
0:45:27	And there's some another theoretical analysis which is actually put in one of the books
0:45:31	I wrote.
0:45:32	So you actually can convert many complex
0:45:35	propagation,
0:45:37	non-convex problem into
0:45:38	somewhat
0:45:41	kind of ??property measure problem related to
0:45:44	convex optimization so we can understand our probability ??.
0:45:46	So we did that a few years ago and we wrote a paper on this.
0:45:49	And this idea can also be used for this
0:45:53	potential network, which I don't have the time to go through here. And the reason
0:45:56	why I bring that up is
0:45:57	because it's actually related to some recent work
0:46:00	that I have seen
0:46:01	for generative model which were taking convertion of each other, so let me compare between
0:46:07	two of
0:46:08	them to give you some example to show how to
0:46:10	both
0:46:11	networks can help each other.
0:46:13	So when developped this deep stacking network the activation function had to be fixed.
0:46:20	Either logistic or ReLu which are both
0:46:22	reasonably well
0:46:23	you know compared to
0:46:25	with each other.
0:46:28	Now look at this architecture.
0:46:31	Almost identical architecture.
0:46:33	So now
0:46:35	if you change the
0:46:38	activation function to be something very strange, I don't expect you to know anything about
0:46:42	this
0:46:43	and this is actually work done by Mitsubishi people.
0:46:46	There's a very nice paper over here in the technical ??
0:46:50	I spent a lot of time talking to them and they even came to
0:46:52	Microsoft, so actually I listened to some of them and their demo.
0:46:56	So the activation function for this model is called the Deep Unfolding Model
0:47:00	that's is derived from inference method in generative model.
0:47:06	Which is not fixed as in the ?? I showed you earlier. So to stop
0:47:11	this model .. it looks like deep neural network, right?
0:47:14	But the beginning
0:47:16	the initial phase of their generative model which is specific about,
0:47:20	I hope many of you know the non-negative matrix factorization. This is specific technique
0:47:26	which actually is a shallow generative model.
0:47:29	It actually makes a very simple assumption that
0:47:32	the
0:47:33	observed noisy speech or mixed speakers' speech is the sum of two sources
0:47:40	in spectral domain.
0:47:41	What was they make the assumption
0:47:43	and then they of course they have to enforce that each
0:47:46	you know
0:47:47	each vector is positive because of the magnitude of spectra.
0:47:52	What they do is an iterative technique and that becomes a iterative technique.
0:47:58	And that
0:47:59	model automatically embed the main knowledge about how observation
0:48:04	is obtained, you know, through the mix between the two.
0:48:08	And then this work essentially said how to apply that inference technique iteration. Every single
0:48:13	iteration I treat that as a different
0:48:16	layer.
0:48:18	After this they do the back propagation training.
0:48:21	And the backward iteration is possible
0:48:24	because
0:48:25	the problem is very simple, so the application here is a speech enhancement
0:48:29	therefore objective function is a mean-square error, very easy. So the generative model
0:48:34	actually generative model gives you
0:48:39	the
0:48:40	the generative observation
0:48:42	and then
0:48:43	your output is clean speech.
0:48:45	Okay then you do mean-square error you actually adapt all this way
0:48:48	and the results are very impressive. So now this is why
0:48:52	I showed you can design deep neural network
0:48:55	if we use this
0:48:57	type of
0:48:58	activation function you automatically build in the constraints that you use in the generative model
0:49:03	and that's
0:49:04	very good example to show
0:49:06	the message that I'm going to,
0:49:09	actually I put in the beginning of the (presentation) it's
0:49:11	hope of deep generative model. So this is
0:49:14	shallow model and it's easy to do it. Now for deep generative model
0:49:18	it's very hard to do.
0:49:19	And one of reasons I put this as a topic today is partly because
0:49:25	all this conference
0:49:27	it's just three months ago
0:49:30	in Beijing's ICML conference
0:49:33	there's a very nice development
0:49:35	of deep generative models' learning methods.
0:49:40	They actually linked this
0:49:42	neural network and Bayes net together
0:49:44	through some transformation
0:49:46	and because of that .. the main idea of .. whole bunch of papers including
0:49:51	Michael Jordan,
0:49:52	whole bunch, you know, a lot of very well known people
0:49:54	in machine learning for deep generative model
0:49:56	so the main
0:49:58	point of this set of work, I just want to use one simple sentence to
0:50:03	summarize them,
0:50:03	is that
0:50:04	when you originally tried to do E step I showed you early on
0:50:09	you have to factorize them in order to get each step done
0:50:12	and that was approximation
0:50:13	and there was very nice ?? developped. A ?? so large it's practically useless
0:50:18	in terms of inferring the top layer
0:50:24	discrete event.
0:50:25	Now the whole point is that now we can relax that constraint for factorization
0:50:30	and like before three years ago if you do that if you use a rigorous
0:50:35	dependency
0:50:36	you don't get any reasonable analytical solution so you cannot do EM.
0:50:42	Now this
0:50:43	idea is to say that while you can approximate
0:50:48	that factorisation,
0:50:49	you can approximate that dependency in E step learning
0:50:52	not through
0:50:55	factorization which is called mean field approximation,
0:50:57	but use deep neural network to approximate.
0:51:01	So this is example to show that deep neural network actually help you to solve
0:51:05	deep generative model problem and
0:51:07	so this is well know Max Welling, a very good friend of mine in machine
0:51:12	learning.
0:51:14	And he told me that the paper never show that.
0:51:17	And they really developed the
0:51:20	the theorem to prove that if network is large enough
0:51:24	the approximation error can approach
0:51:26	zero. Therefore the variational learnings
0:51:31	can be eliminated and that's a very engine
0:51:33	developed that really give me a little evidence to show that,
0:51:36	to see that this is
0:51:38	a promising approach. I think machine learning community development tool,
0:51:42	our speech community developed verification
0:51:45	and also methodology as well,
0:51:47	but if
0:51:48	you know we actually cross connect
0:51:50	to each other we are gonna to make much more progress and that this type
0:51:55	of development
0:51:55	really
0:51:56	gives some
0:51:58	promising direction
0:52:00	towards the main message I put out at the beginning.
0:52:03	Okay, so now I am gonna show you some deeper results that I want to
0:52:07	show you.
0:52:09	Another better architecture that we have known is what's called the reccurent network, if you
0:52:14	read
0:52:14	this Beaufays' paper LSTM, look at that result. For
0:52:18	voice search the error rate jumped down to about ten percent. That's very impressive result.
0:52:22	Another type of architecture is to integrate the convolution
0:52:27	and non-convolution together. That was ??
0:52:30	in the previous result. As the author worth of any better result is in though.
0:52:33	??
0:52:33	So these are the state-of-the-art for switchboard (SWBD) task.
0:52:37	So now I'm going to concentrate on this type of
0:52:40	recurrent network here.
0:52:43	Okay, so this coming down to one of my main messages here.
0:52:47	So we fixed this kluge
0:52:51	by
0:52:51	a recurrent network.
0:52:54	We also fix this kluge automatically
0:52:58	by
0:53:00	just using big data.
0:53:01	Now how do we fix this kluge?
0:53:05	So first of all I'll show you some analysis on recurrent network vs. deep generative
0:53:11	model
0:53:11	so that's called hidden dynamic model I showed you early on, okay.
0:53:14	And so far analysis hasn't been applied to LSTM.
0:53:17	So some further analysis may
0:53:20	actually automatically give rise to LSTM using some analysis on this.
0:53:24	So this analysis is very preliminary
0:53:27	and so if you stare at the equotation
0:53:29	for recurrent network it looks like best one. So essentially you have state of the
0:53:33	art equotation
0:53:34	and it's recursive.
0:53:35	Okay,
0:53:36	from previous hidden layer to this.
0:53:40	And then you get the output
0:53:43	that produces the label.
0:53:45	Now if you look at this deep generative model - hidden dynamic model
0:53:48	identical equotation,
0:53:50	okay? Now what's the differece?
0:53:52	The difference is that the input now is the label. Actually if you put the
0:53:56	label
0:53:57	you cannot drive it. So you have to make some connection between labels and continuous
0:54:01	variable
0:54:02	and that's what in phonetic
0:54:03	people call phonology to phonetic interface, okay.
0:54:06	So we use some very basic assumption
0:54:08	that the interface is simply, that each label corresponds to target vector,
0:54:14	actually the way that we implement early distribution, you can do that to account for
0:54:18	speaker
0:54:18	differences, etcetera. Now the output
0:54:21	for this recursion gives you the observation
0:54:24	and that's the recurrent filter type of model.
0:54:28	And that's engineering model and there's neural network model, okay. So every time I was
0:54:32	teaching
0:54:32	?? I called ?? on this.
0:54:34	So we fully understood all the constrains for this type of model.
0:54:39	Now for this model it looks the same, right?
0:54:41	So if you reverse direction you convert one model to another.
0:54:44	And for this model it's very easy to put a constraint, for example
0:54:49	the dynamics
0:54:50	of
0:54:53	matrix here that governs
0:54:56	the internal dynamics in the hidden domain actually can be made sparse and then you
0:55:00	can put
0:55:02	realistic constrain there for example in our
0:55:04	earlier implementation of this we put this critical dynamics
0:55:08	therefore you can guarantee it doesn't oscillate. When we do articulation we need phone boundaries.
0:55:12	This is the speech production mechanism
0:55:15	you can put them simply to fix the sparse matrix.
0:55:17	Actually one of the slides I'm gonna show you is all about this.
0:55:22	In this one we cannot do it, everything has to be a structure.
0:55:25	There's just no way you can say that why, you want that dynamics
0:55:29	to behave in certain way.
0:55:32	You just don't have any mechanism to design the structure of this and this is
0:55:36	very natural, it's by physical
0:55:37	properties that design this. Now because of
0:55:40	this correspondence and because of the fact that now we can do
0:55:44	deep inference
0:55:47	if all this machine learning technology actually are fully developed
0:55:51	we can very naturally bridge the two (models together).
0:55:53	It turned out if you do more
0:55:55	rigorous analysis
0:55:56	by
0:55:57	making the inference of this to be fancier
0:56:00	our hope that
0:56:02	this
0:56:03	multiplicative
0:56:04	kind of unit would automatically emerge from this type of model so that has not
0:56:08	been shown yet.
0:56:10	So of course this is just, you know, very high-level view comparison between the two
0:56:15	there are a lot of detail comparison you can make in order to bridge the
0:56:19	two,
0:56:19	so actually my colleague Dong Yu wrote this book that's just coming out very soon.
0:56:26	So in one of the chapters we put all these comparisons: interpretability, parametrization, methods
0:56:32	of learning and nature of representation and all the differences.
0:56:36	So it gives a chance to actually understand
0:56:38	how deep generative model in terms of dynamics
0:56:42	and recurrent network in terms of recurrence can
0:56:44	be matched with each other, so I will read that over here.
0:56:48	So I have the final five, three more minutes, five more minutes. I will go
0:56:53	very quickly.
0:56:54	Everytime I talk about it I was running out of time.
0:56:57	So
0:56:59	so the key concept is called embedding.
0:57:01	Okay, so actually you can find the literature in nineties, eighties to have this
0:57:07	basic idea around.
0:57:09	For example in this special issue of
0:57:12	Artifical Intelligence, very nice paper over here, I had chance to read them all.
0:57:15	And very insightful and some of the chapters over here are very good.
0:57:18	So the idea is that each physical entity or linguistic
0:57:23	you know
0:57:24	entity:
0:57:25	word, phrase, but even whole article, whole paragraph
0:57:29	can be embedded into
0:57:30	continuous-space vector. It could be big ??, you know.
0:57:34	Just to let you know it's special issue on this topic.
0:57:38	And that's why it's important concept.
0:57:41	The second important concept, which is much more advanced
0:57:44	which is described by a few books over here. I really enjoyed reading some of
0:57:49	those and I invite those
0:57:50	people come to visit me.
0:57:52	We have a lot to discuss on that. You can actually even embed the structure
0:57:56	into
0:57:57	next structure symmetric into a vector
0:58:01	where you can recover the structure completely through the vector
0:58:04	operation and the concept is called tensor-product representation.
0:58:08	So I don't have .. if only I had three hours I can go through
0:58:11	all of this.
0:58:11	But for now I'm going to elaborate about this for next two minutes.
0:58:16	So
0:58:17	this is the neural network recurent model and this is very nice, I mean this
0:58:21	is fairly informational paper
0:58:22	to show that embedding can be done as part of the
0:58:25	as a byproduct of the recurrent neural network that
0:58:28	paper was published in Interspeech several years ago.
0:58:34	And then I'll talk very quickly about semantic embedding at MSR, so
0:58:39	the difference between this set of work and the previous work was that
0:58:42	everything is completely unsupervised
0:58:44	so in the company if you have supervision you should grab it, right.
0:58:48	So we actually took initiative to actually take some
0:58:51	very smart
0:58:52	exploitation of supervision signals
0:58:54	at virtually no cost.
0:58:57	So the idea here was that this is the model that we have essentially for
0:59:01	each branch it's deep neural network. Now different
0:59:03	branches can actually link together
0:59:05	through what's called the, you know, cosine distance.
0:59:08	So that
0:59:09	distance can be measured
0:59:10	in terms of
0:59:11	a vector, in a vector space.
0:59:13	And now we do MMI learning,
0:59:16	so if you get hot dog in this one, if your document is talking about
0:59:20	fast food or something, even if
0:59:22	there's no word in common you pick up.
0:59:24	And because of supervision actually link them together.
0:59:27	Like if you have dog racing here
0:59:29	they have the same word although they will be very far apart from each other.
0:59:33	And that can be automatically done.
0:59:37	And that some people told me that topic model can do
0:59:39	similar things, so if we compare that with the topical model
0:59:42	it turned out that ??
0:59:45	and using this
0:59:46	deep semantic model
0:59:48	can do much, much better.
0:59:49	So, now multi-modal. Just one more slide.
0:59:53	So it turned out that not only text you can embed into it,
0:59:57	image can be embedded, speech can be embedded and can do something very similar
1:00:01	to the one I showed you earlier.
1:00:03	And this is the paper that was in yesterday talk about embedding.
1:00:09	That's ver nice, I mean it's very similar concept.
1:00:12	So I looked at this and I said wow it's just like the model that
1:00:15	we did for the text.
1:00:16	But it turned out that application is very different.
1:00:18	So actually
1:00:20	I don't have time to go through here. I encourage to read on some papers
1:00:24	over here. Let's skip this.
1:00:25	So this was just to show you some application for this
1:00:27	semantic model. You can do all the things. From web search
1:00:30	we apply them, quite nicely. For machine translation you have one entity
1:00:34	to be one language
1:00:37	some of the list of the paper that were published you can find some detail.
1:00:40	You actually can make summary, summarization and entity ranking.
1:00:45	So let's skip this. This is final slide, the real final slide.
1:00:49	I don't have any summary slides, this is my summary slide.
1:00:51	So I copied the main message here now. Elaborate could be more. After going through
1:00:55	whole hour of presentation.
1:00:57	Now in terms of application we have seen
1:01:00	speech recognition.
1:01:01	The green is
1:01:03	neural network, the red is deep generative model. So
1:01:07	I say a few words about deep generative model and dynamic model
1:01:11	that's generative models side and LSTM is other side. Now speech enhancement
1:01:16	I showed you these types of models
1:01:19	and then
1:01:20	on the generative model side I showed you this one
1:01:25	and this is shallow generative model that actually can
1:01:28	give rise to deep structure which is corresponding to
1:01:31	deep
1:01:33	stacking network I showed you early on. Now for algorytm we have get back propagation
1:01:37	here.
1:01:38	That's single unchallenged
1:01:40	algorytm for deep neural network.
1:01:42	Now for deep generative model there are two algorytms. They are both called
1:01:45	BP.
1:01:47	So one is called Belief Propagation, for those of you who know machine learning.
1:01:51	The other one is BP, same as this.
1:01:54	That only came up within two years.
1:01:57	Due to this new advancement
1:02:00	of porting deep neural network
1:02:02	into the inference step
1:02:04	of this type of model, so I call BP and BP.
1:02:08	And in terms of neuroscience you call this one to be wake and you call
1:02:11	the other sleep.
1:02:12	And in the sleep you generate things you get hallucination and then when you're awake
1:02:16	you have perception.
1:02:17	You get information there. I think that's all I want to say. Thank you very
1:02:20	much.
1:02:29	Okay. Anyone one or two quick questions?
1:02:37	Very interesting talk.
1:02:40	I don't want to talk about your main point which is very interesting.
1:02:43	Actually just very briefly about one of your side messages which is about waveforms.
1:02:48	Which is about waveforms. So you know the ?? paper there weren't really putting in
1:02:54	waveforms.
1:02:54	They are putting in the waveforms, take the absolute value, floor it, take all
1:02:58	logarithm, average over, but you know so you had to do a lot of things.
1:03:03	Secondly the other papers that there's been a modest
1:03:07	amount of work in last few years on doing this sort of thing,
1:03:10	pretty generally people do it with matched training test conditions
1:03:14	if you have mismatched conditions, good luck with
1:03:16	waveform. I always hate to say something is impossible but good luck.
1:03:24	Thank you very much. ?? good for everything.
1:03:27	And look at presentation that was very nice, thank you.
1:03:32	Any other quick questions?
1:03:36	If not I invite Haizhou
1:03:40	to give a plaque.

Achievements and Challenges of Deep Learning - From Speech Analysis And Recognition To Language And Multimodal Processing

Keynotes

Li Deng, Microsoft Research, Redmond, USA