Speech Transcript - Achievements and Challenges of Deep Learning - From Speech Analysis And Recognition To Language And Multimodal Processing

Like the first of thing. I will thank the organisers for having

this opportunity to share with you

some of my personal views

on this very hot topic here. So,

I think the goal of this tutorial really is to

help diversifying the deep learning approach. Just like the theme of this conference,

Interspeech For Diversifying

the Language, okay.

So I have a long list of people to thank. Oh, so I want.

Yeah, thank you.

So I have long ... long list of people here to thank.

Especially Geoff Hinton. I worked with him for some period of time.

And Dong Yu and whole bunch of Microsoft colleagues.

Oh, who,

hmm,

contributed a lot to the material

I'm going to go through.

And also I would like to thank many of the colleagues sitting here who had

a lot of discussions with me.

And their opinions also shaped some of the content that I am going to go

through with you over the next hour.

Yeah, so the main message of this talk

is that deep learning is not the same as deep neural network. I think in

this community most of people

mistake deep learning with deep neural network.

And most ...

So deep learning is something that everybody here would know. I mean just look at

... I think I counted close to 90 papers somewhere

related to the deep learning or approaching at least. Kind of the number of papers

exponentially

increasing over last twelve years.

So deep neural network is essentially neural network

you can unfold that in space. You form a big network.

AND

Either way or both. You can unfold that over time. If you don't unfold

that neural network over time because of reccurent network, okay.

But there's another very big branch of deep learning, which I would call Deep Generative

Model.

Like a type of neural network you can also unfold in space and in time.

If it's unfolded in time, you would call it a dynamic model.

Essentially the same concept. You unfold the network.

Oh. You know

in same direction in terms of time

But in terms of space they are unfolded in the

opposite direction. So I'm gonna elaborate this part. And for example

our very commonly used model.

You know a Gaussian Mixture Model, hidden Markov model, really has the

neural network unfolding in time.

But if you make that unfolding in space you get big Generative Model

which hasn't been very popular in our community.

You know I'm going to survey whole bunch of work related to this area, ah,

you know through the my discussion with many people here.

But anyway so the main message of this talk is eventually to

hope and I think there's a promising direction that is already taking place in machine

learning community

I don't know how many of you actually went to International Conference on Machine Learning

(ICML) this year, just a couple of months ago in Beijing.

But there's huge ammount of work in Deep Generative Model and some very interesting

development, which I think I'd like to share with you at high level,

so you can see that all this deep learning, although it just started in terms

of application in our

speech community, we should be very proud of that.

Hmm, now,

In number of machine learning communities there's huge amount of work going on

in Deep Generative Model. So I hope I can share with you some of recent

development with you to

to enforce the message that

a good combination between the two

which have

complementary strengths and weaknesses can be actually get together to further

advance deep learning in our community here.

Okay, so now. These are very big slides. I'm not going to go through all

of details. I'm just going to highlight a few

things so in order to enforce the message that

generative model and

neural network model can be helping each other.

I'm just going to highlight a few key attributes of

both approaches. They are very different approaches.

I'm going to highlight that very briefly. First of all

in terms of structure they are both graphical in nature as a network, okay.

You think about this deep generative model, typically some of these

we call that a Dynamic Bayesian network. You actually have joint probability between ?? label

and the observation.

And which is not the case for deep neural network,

okay.

In the literature you see many other terms

that relate to deep generative model like probabilistic graphical model,

such as stochastic neurons,

sometimes it's called the stochastic generative network as you see in literature. They all belong

to this

category. So if your mindset is over here, even though you see some neural words

describing that

you know you won't be able to read all this literature, so the mindset is

very difficult when you study these two.

So the strenght

of deep generative model is that,

this is very important to me,

how to interpret, okay.

So everybody that I talked, including the lunchtime when I talk to students,

they complain. I say: have you heard about deep neural network? and everybody says yes,

we do.

To what extent have you started looking to that? and they said we don't want

to do that because we cannot

even interpret what's in the hidden layer, right.

And that's true

and that actually is quite very deciding. I mean if you

read into this ?? science literature in terms of connectionist model

really the whole design is that you need to have a representation here to be

distributed.

So each neuron can represent different concept

and each

concept can be represented by different neurons, so the very design

it's not meant to be interpretable,

okay.

And that actually creates some difficulty for many

and this model is just opposite. It's very easy to interpret because the very nature

of generative story.

You can tell what the process is

and then of course if you want to do

a classification or some other application in machine learning

you simply just have to ..

and for forecast we simply have base route to invert that, that's exactly what in

our community

we have been doing for thirty years hidden Markov model. You get the prior, you

get generative model and

you multiply them and then you do it. Except at that time we didn't know

how to make that

deep for this type of model. And there are some piece of work that I'm

going to survey.

So that's one big part of the advantage of this model.

Of course everybody know that what I just mentioned there.

In deep generative model actually the information flow is from top to down.

You actually have .. what top simply means is that you know you get a

label or you get a higher level concept

and the lower level down simply means you can rotate to fit into that.

Everybody know that in a neural network

the information flow is from bottom to up, okay. So you fit the data and

you compute whatever output and then

you go either way you want.

In this case

the information come from top to down. You generate the information

and then if you want to do classification, you know, any other machine

learning applications, you know you can do Bayesian. Bayesian is very essential for this.

But there's whole list of those. I don't have time to go through, but just

you know those are high lights, these

we have to say. So the main strenght of deep neural network that actually gained

popularity

over the previous years, really is mainly due to these strenghts.

It's easier to do a computation in terms of

so this what I wrote is a regular compute, okay.

So if you

look into exact what kind of compute is involved here

it's just the millions of millions of millions of times of computing

of the big matrix by a vector.

You do that many times. ?? place very small model role

it's very regular.

And therefore GPU is really

ideally suited for this kind of computation

and that's not the case for this model.

So if you compare between these two then you really will understand that if you

can pull

some of these advantages into this model

and pull some of this advantage in this column into this one

you have integrated model. And that's kind of the message I'm going to convey and

I'm going to

give you example to show how this can be done.

Okay, so in terms of interpretability it's very much related to

how to incorporate the main knowledge

and network constraint into the model. And for deep neural network it's very hard.

What people have done that, I have seen many people in this conference and also

in a ??

tried very hard it's not very natural.

What is

This is very easy

I mean you can code your domain control knowledge directly into the system. For example

like distorted speech, voice speech, you know

in the summation, into special domain, summation of

either wave-form domain is a noise

plus

the clean speech you get by observation. That's so simple you just cut that into

one layer, into summation or

you can call them in terms of Bayesian probability very easily.

This is not that easy to do. People tried to do that, it's not just

as easy.

So to encode

a domain knowledge into network constraint of the problem

into

your deep learning system. This has great advantage.

So I'm actually, I mean this is just a random selection

of things you know. There's very nice paper over here

Acoustic Phonetics.

All this knowledge at speech production

and this kind of nonlinear

phonology

and this is an example of this is noise robust. You put the phase information

of the speech and noise you can come up with

very nice conditional distribution. It's kind of complicated

but this one can be put directly

into generative model and this is some example of this. Whereas in deep neural networks

it's very hard to do.

So the question is that do we want to throw away all these knowledge in

the deep learning

and my answer is of course no. Most of people will say no, okay.

And people from the outside of speech (community) there was a yes. I talk about

some people in machine learning,

anyway so since this is speech conference I really want to emphasise that.

So the real

solid reliable knowledge that we attained

from speech science

that has been reflected by local talks are

such as yesterday's talk, talking about how some patterns have been shaped by you by

?? and perceptionists. They were really playing a role in deep generative model.

But very hard to do that in deep neural network.

So with this main message in mind

I'm going to go through three parts of the talk as I put them in

my abstract here.

So I need to go very briefly

through all these three topics.

Okay, so the first part is to give very brief history of how deep speech

recognition started.

So this is a very simple list. There are so many papers around. Before the

rise of the deep learning around

2009 and 2010. There are lots of papers around. So I hope I actually have

a reasonable

sample of the work around here.

So I don't have time to go through, especially for those of you who are

in ?? open house

There was in 1988, I think in 1988

ASRU and at that time there's no U, it's just

ASR. And there is some very nice paper around here and then quickly

you know it was

superseded

superseded by the hidden Markov model approach.

So I'm not going to go through all these

so except to point out that

neural network

has been very popular for awhile.

But towards this you know,

plus ten years

before the deep learning actually took over neural network approach

essentially didn't really make

such a strong impact compared with deep learning network that people have been seeing.

So I just give you one example to just show you how unpopular

the neural network was at that time.

So this is about 2008 or 2006, about nine years ago.

So this is the optimization that I think

is predecessor

of ?? IOPPA.

So they actually got several of together, locked us up into hotel

near Washington, DC.

airport somewhere.

Essentiall the goal is to say that well the speech

recognition is stuck there, so you come over here and help us brainstorm next generation

of speech recognition and understand technology

and then we actually spent about four or five days in the hotel and at

the end we wrote very thick report,

twenty some pages of report.

So there is some interesting discussion about history and the idea is that

if government give you unlimited resource and gives you fifteen years what is it you

can't do, right?

So most of the people in our discussion,

we all focused on neural network, essentially

margin is here,

macro-random field is here, conditional-random field is here and graphical model here.

So it

that was just couple of years before that deep learning actually came out at that

time

so neural network was actually one of the

two's around.

Haven't really make a big impact.

So on the other hand the graphical model was actually mentioned here because it's related

to deep generative model.

So I'm going to show you a little bit, well this is slide about deep

generative model, actually I made some list over here.

One of the

but anyway so. This let's go over here.

I just want to highlight couple of

related to

introduction of deep neural network in the field.

Okay so one of, this is ?? John Riddle?

actually we spent a summer in ?? in 1989,

or 1988.

Fifteen and some years ago. So we spent really interesting summer altogether.

and that's kind of the model, deep generative model, the two versions we actually put

together

and at the end we actually brought a very thick report that were about eighty

pages of report.

So this is deep generative model and it turned out that this model

actually both of those models are implemented in neural networking.

And thinking about neural network as simply just function of function of mapping

so if you map the hidden representation

from you know

as part of deep generative model into whatever observation you have

MFCC. Everybody used MFCC at the time.

You actually need to have done the mapping and that was done in neural network

in both versions

and this is statistical version which we

call the hidden dynamic model. It's one of the conversion

of deep generative model.

It didn't succeed. I'll show you the reason why. Now we understood what.

Okay, so it's interesting enough in this

model we actually used, if you read the report, it actually turned out that model

was here since Geoff told me that

the video for this workshop is still around there so it's called ?? sign. I

think I mentioned to ?? pick it out.

It turned out that learning of this workshop, which details are in this report

is actually use the back propagation to do it. Now direction isn't from up to

down, since your model is

top down, the propagation must be bottom up.

So nowadays

when we do speech recognition the error

function is a softmax or sometime you can use the mean square error.

And the measure is in terms of your label.

This is the opposite. The error is measured in terms

of matching between how generative model can match with the observation. And then when you

want to

learn you go bottom up learning. Which actually turned out to be better propagation. So

that propagation doesn't have to be done (up to bottom)

it can be bottom up. Depending on what kind of models you have.

But key is that this is

a gradient descent method.

So actually we got disappointing result for switchboard. You know because we tended to be

a bit off game.

And now we understand why. Not at that time. I'm sure some of you experienced

it. I have a lot

of thinking about how deep learning and this can be integrated together.

So at the same time

Okay so this is a fairly simple model, okay. So you have this hidden representation

and it has

specific constrains built into the model,

by the way which is very hard to do when you do bottom-up neural network.

And for generative model

you can put them very easily down there, so for example

articulatory trajectory has to be smooth

and then specific form of the smoothness can be built indirectly

by simply writing the generative probabilities. Not in the deep neural network.

So at the same time

we actually, also this was done in ??

and we were able to even put this nonlinear phonology in terms of

writing the phonemes into the invidiual constituents at the top level and ?? also has

very nice paper, some fifteen years ago, talking about this.

And also the robustness can be directly integrated into

articulator model simply by generative model. Now for deep neural network it's very hard to

do.

For example you can actually

this is not meant to be seen. Essentially this is one of the conditional likelihood

that covers

one of the links. So everytime you have got the link

you have conditional dependency parent to children that have differnt neighbours.

And then you can specify them in terms of

conditional distribution. Once you do that you formed a model

you can embed

whatever knowledge you have, you think is good, into the system. But anyway

but the problem is that the learning is very hard

and that problem of the learning in machine community only was solved just within last

year.

At that time we just didn't really know. We were so naive.

We didn't really understand all the limitations of learning. So just to show you we

talk, okay. One of the

things we did was that, I actually worked on this with my colleagues Hagai Attias.

He is actually one of the

he is my colleague working not far away from me at that time, some ten

years ago.

So he was the one who invented this very initial base. Which is very well

known.

So the idea was as follows. You have to break up these pieces into the

modules, right.

For each module you have this, this is actually

continuous

dependence of the continuous hidden representation

and it turned out that the way to learn this,

you know in a principle, what is to do is EM (Expectation maximization). It's variational

EM.

So the idea is very crazy.

So you said you cannot solve that regressively and that's well known. It's loopy neural

network. Then you just cut all important things you

carry out. Hoping that M-Step can make it up. That's very crazy idea.

And that's the best around time that was there.

But it turned out that you've got the auxiliary function and you form is still

something very

similar to our EM, you know in HMM. For the general model you don't have

to look you can get rigorous solution.

But now when you have deep it's very hard. You have to make up for

it. And that ?? is just as ??bad-ass

many people could ?? on deep neural network. This ?? deep generative model

probably have more

than otherwise. Although they patched themselves

to be

you know very rigorous. But if you really walk on that, so I can pick

out of this, so it's

for this approach we get surprisingly good inference results for continuous variables.

And in one version what we did was actually we used phonemes

you know as a hidden representation and it turned out it tracked. And once you

do this you

check the phoneme really precisely.

As a byproduct this worked as we created

this worked as we created database for formant tracking

but if we actually do

inference only the linguistic unit which is the problem

of recognition we didn't really make much progress on this.

But anyway so I'm going to show you some of these preliminary results to show

you how this

is one way that led to the deep neural network.

So when we actually simplify the model in order to finish the decoding we actually,

this is actually ?? result

and we would bring out all of analysis for different kinds of phones.

So when we use this kind of generative model with deep structure it actually corrected

many errors

which are related the short phones.

And you understand why because you designed model to make that happen and then you

know if

everything is done recently well you actually get results. So we actually look

at not only corrected short phones for the vowel

but also it correct the a lots of

consonants because they're up with each other.

It's just because the model design whatever hidden trajectory that you get

it's influenced, the parts of the vowel is influenced

by the adjacent sound.

And that's

this is due to the coarticulation.

This work will be very naturally built into the system

and one of things I am very much struggling with deep neural network is that

you can't even build this kind of

information that easily, okay.

This is to convince you how things can be breached.

It's very easy to interpret the results. So we look at the error we

know wow these are quite a big data assumption.

Without the have to go through for example in this these examples of these are

the same sounds, okay.

You just speak fast then you get something like this

and then we actually looked at the error and we said Ohh.

You know

that's exactly what happened. You know mistake was made in the

Gaussian Mixture Model because it doesn't take into account these particular dynamics. Now this one

was pulling correct error

And I'm going to show you in deep neural network things are reversed, so that's

related to ??. But in the same time

in machine learning community also the speech

there is a very interesting model for the deep generative model developed

and that's called the Deep Belief Network.

Okay,

so in the earlier literature before about three or four years ago

DBN, Deep Belief Network, NTA I mix each other, even by the authors

it's just because most people don't understand what it is

so this is very interesting paper that is starting in 2006

many people, most people in machine learning, regard this paper to be the start of

deep learning.

And thus the generative model so you prefer to say deep

generative model actually started the deep learning rather than deep neural network.

But this model has some intriguing probabilities

that really at the time attracted my attention here.

It's totally not obvious, okay.

So for those of you who know RBM and DBM you know when you are

stacking up this undirected model

sever time you get DBN, that's

you might think that the whole thing will be undirected,

you know bottom-up machine, no. It's actually directed model coming down.

You have to read this paper to understand why.

So why do they? I said someone was wrong. I couldn't understand what happened.

But on the other hand it's much simpler than the model I showed you earlier

for deep network we get the temporal dynamics.

This one it's not temporal dynamics over here.

the most intriguing aspect of DBN

as described in this paper is that inference is easy.

Normally you think inference is hard. That's the tradition.

It's given fact if you have these multiple dependencies on the top it's very hard

to make voice

and there's special constraint built into this model. Namely the restriction in the connections of

RBM

because of that it makes inference. It's just a special case.

This is very intriguing, so I thought this idea may help

the deep general model I showed you earlier.

So he came to reason me, you know. We discussed it.

It took him a while to explain what this paper is.

Most of people at Microsoft at that time couldn't understand what's going on.

So now let's see how

and then of course what we get together this deep generative model

and the other deep generative model I talked about with you I actually worked on

for almost ten

years at Microsoft. We were working very hard on this.

And then we came up with the conclusion that well we have to use fewer

clues to fix problem.

And they don't match, okay. The reason why they don't match is whole new story

why they don't match.

The main reason is actually not just temporal difference, it's the way you prioritize

the model and also the way to represent

the information is very different

despite the fact that they're both generative models.

It turned out that this model is very good for speech synhesis and ?? has

very nice paper

using this model to do synthesis. And it's very nice to do

image generation. I can see that very nice probably.

Not for continuous speech it is very hard to do

and for speech for general synthesis it's good it's because if you have segment with

whole

context into account, like syllable in Chinese it is good, but for English it is

not that easy to do.

But anyway so we need to have few kluges to fix together, to merge these

two models together.

And that sort of led to the end.

So the first kluge is that

you know

the temporal dependency is very hard. If you have temporal dependency you automatically loop and

then

everybody in machine learning at that time knew, most of speech persons, so I thought

that

machine learning that I show you early on actually just didn't work well, it didn't

worked out well. And most of people who were

very much versed in machine learning who say there's no way to learn that.

Then cut the dependency. It's way to do it, cut the dependency in the hidden

dimension, in the hidden revision

and loose all the powers of

deep generative model

and that's the Geoff Hinton's idea, well it doesn't matter, just use a big window.

If it fixes the clues and that actually

is one of things that actually helped

to solve the problem

and the second Kluge is that you can reverse direction

because

the inference in generative model is very hard to do as I showed earlier.

Now if you reverse direction

from top-down to bottom-up

and then you don't have to solve that problem. And that's why it would be

just a deep neural network, okay. Of course

everybody said: we don't know how to train them, that was in 2009.

Most people don't know how to ??

and then he said that's how DBN can help.

And then he did a fair amount of work on DBN to initialize that ??

approach.

So this is very well-timed academic-industrial collaboration. First of all

it's because speech recognition industry has been searching for new solutions when principle

deep generative model could not deliver, okay. Everybody

was very upset about this at the time.

And at the same time academia developed deep learning tool

DBN, DNN, all the hybrid stuff that's going on.

And also CUDA library was released around that time. It's very recent times.

So this is probably one of the earliest catching on

for this GPU computing power over here.

And then of course big training data in ASR that has been around

and most people, if you actually do

Gaussian Mixture Model for HMM where a lot of data performance accelerates, right.

And then this is one of things that in the end really is powerful. You

can increase the size and depth

and

you know put in a lot of things

into to make it really powerful.

And that's the scalability advantage that I showed you early on. That's not the case

for any shallow model.

Okay, so in 2009 I and three of my colleagues didn't know what's

happening. So we actually got together to

to do this

to this workshop

to show that

this is useful thing, you know, to bring stuff.

So it wasn't popular at all. I remember

you know Geoff Hinton and I we actually got together to

who we should invite to give us

speech in this workshop.

So I remember that one invitee which shall be nameless here

he said: Give me one week to think about, and at the end he said:

it's not worth my time to fly to Vancouver. That's one of them.

The second invitee, I remember this clearly, said: This is crazy idea. So in the

e-mail he said

What you do is not clear enough for us.

So we said you know

waveform may be useful for ASR.

And then the emails said: Oh why?

So we said that's just like using pixel for image recognition. That was popular.

For example convolutional network there are pixels.

We take similar approach. Except it is waveform.

And the answer was: No, no, no that's not same as pixel. It is more

like using photons.

You know making kind of joke essentially. This one didn't show up either. But anyway

anyway so this workshop actually has

a lot of brainstorming I had to analyze, all the errors I showed you early

on.

But it's really good

workshop for about four or five years that was

five years ago now.

So now I move to part 2

to discuss achievements. So actually in my original post I had whole bunch of slides

on vision.

So the message for the vision is that if you go to vision community

they look at deep learning to be

just even

maybe thirty time

thirty times more popular than deep learning in speech.

So they actually, the first time they did that was actually first time they

actually got the results.

and noone believed it's the case. At the time I was given a lecture

at Microsoft about Deep Learning

and then right before I, actually Bishop

was doing the lecture together with me

and then this deep learning just came out and Geoff Hinton sent e-mail to me:

Look at the matching! How much bigger it is.

And I showed them. People were like: I don't believe it. Maybe a special case.

You know. And it turned out it's just much

just as good.

Even better than speech. I actually cut all the slides out. Maybe some time I

will show you.

So this is big area to go. So today I am going to focus on

speech.

So one of things that we found during that time

is that we have very interesting discovery that we actually used the model that I

showed you there

and also deep neural network here.

And that actually is the number that we analyzed

error pattern very carefully. So it's very good, you know for TIMIT.

You can disable language model, right.

Then you can understand the errors for acoustic ?? very effectively

and I tried to do that afterwards, you know, to do other tasks

and it's very hard once you put language model in there you just couldn't

do any analysis. So it's very good at the time we did this analysis.

So now the error pattern in the comparison

is, I don't have time to go through except just to mention that.

So DNN made many new errors on short undershoot vowels.

So it sort of undo what this model is about to do

and then we thought of why would that happen and of course at the end

we had a very big window so if the sounds

are very short, information is captured over here and your input is about eleven frames,

you know, you got the fifteen frame it

captures kind of noise coming from different phones of course error is made over here.

So we can understand why.

And then we asked why this model corrects errors? It's just because

you make

you deliberately make a hidden representation

to reflect

what sound pattern looks like.

In the hidden space. And it's nice for whom you can see

but if you have the articulations, how do they see? So sometimes we use former

to illustrate what's going on there.

Another important discovery at Microsoft is that we actually found that using spectrogram

we produce much better

autoencoding results in terms of speech analysis.

Encoding results

?? and that was very surprising at the time.

And that really conforms to the basic deep learning theme that

you know the earliest features are better

then the processed features here. So I show you, this is actually project

we did together in 2009.

So we used spectrogram

to do binary coding of

of spectrogram.

So I don't have time to go through that. You read the auto-encoding book if

you can.

In literature you can all see this.

So the key is that

you use the target to be the same as input and then you use small

number of bits in the middle.

And you want to see whether that would actually

?? all the ?? down here. And the way to evaluate it is to look

you know what kind of errors you have.

So the way we did is we used the vector quantizer as a baseline

of 312 bits.

And then reconstruction

looks like this. So this is the original one, this is the shallow model, right.

Now using deep auto-encoder we get much closer to this in terms of errors

we simply have just much lower coding error

using identical number of bits.

So it really shows that if you build deep structure you extract this bottom-up feature.

Both ?? you condense more

information in terms of reconstructing the original signal.

And then we actually found that

for spectrogram this result is the best.

Now for MFCC we still get some gain, but gain is not nearly as much,

sort of indirectly

convinces me. There's Geoff Hinton's

original activities ?? everybody's

to spectogram.

So maybe we should have do the waveform, probably not anyway.

Okay so of course the next step is once we are all convinced that

error analysis shows that

deep learning can correct a lot of errors, not for all but for some

which we understand why. You just pick up the power and also capacity they had.

So on average it does a little bit better

based upon

this analysis.

Based upon this analysis it does slightly better.

But if you look away

but if you look at the error pattern you really can see

that this has a lot of power, but it also has some shortcomings as well.

So that both have pros and cons but one's errors are very different and it

actually gives you the hint that

you know is worthwhile to pursuit.

Of course this was all very interesting

evidence to show.

And then to scale up to industrial scale we had to do

lot of things so many of my colleagues actually were working with me

on this. So first of all

we need to extend the output

from small number of phones

at the states

into very large

and that actually at that time is motivated by

how to save huge Microsoft investment in speech decoder software.

I mean if you don't do this

then you know if you do some other kind of output coding

and they would also had to ?? atypical feature to do it. The one that

would fully believed

that it's going to work.

But it turned out if you need to change decoder, you know, we just have

to say wait a little bit.

and at the same time we found that using content dependent model gives much higher

accuracy

than content independent model for large tasks, okay.

Now for small tasks we defined so much better. I think

it's all related to

a capacity saturation problem if you have too much

but since a lot of data

in the training for large tasks

you actually keen

to form a very large output and that turn out

to have you know

double benefit.

One is that you increased accuracy and number two is that you don't have to

change anything about decoder.

And industry loves that.

You have both

that's actually ??. I can't recall why actually took off.

And then we summarize what enabled this type of model

and industrial knowledge about how to construct a very large units in DA

is very important

and that essentially come from

everybody's what here

that actually used this kind of content dependent model for Gaussian Mixture Model, you know,

that has been around for

almost twenty some years.

And also

it depends upon industrial knowledge on how to make encoding of such huge and highly

efficient using

our conventional

HMM decoding technology.

And of course how to make things practical.

And this is also very important enabling factor. If GPU didn't come up

roughly at time, didn't become popular at that time

all these experiments would take months to do.

Without all this belief, without all this fancy infrastructure.

And then

people may not have patiance to wait to see the results, you know push that

forward.

So let me show you some very

brief summary of the major

result obtained in early days.

So if we use three hours of training, this is TIMIT for example, we have

got

this is number I show you, it's not much about ?? percent of gain.

Now if you increase the data up to

ten times more thirty some hours you get twenty percent error rate.

Now if you do more.

For SwitchBoard, this is the paper that my colleague published here,

you get more data, another ten times so you get two orders of magnitude to

increase

and the relative gain actually

sort of

increase, you know, ten percent, twenty percent, thirty percent. This is actually

so of course if you increase

the size of training data

the baseline will increase as well, but relative gain is even bigger.

And if people look at this result there's

nobody

in their mind who would say not to use that.

And that's how

and then of course a lot of companies

you know

actually still

implement, DNN is fairly easy to implement for everybody because

I missed one of the points over there. It actually turned out if you use

large amount of data

it turned out that the original

idea of using DBN to regularize that model doesn't lead to

be that anymore. And in the beginning ?? how it happened.

But anyway, so now let me come back to the main thing of the talk.

How generative model

and deep neural network may be helping each other.

So the kluge one was that to use this to be

at that time

we have to keep this now for this conference we see

?? using LSTM with neural network and that fixed this problem.

So this problem is fixed.

This problem is fixed automatically.

At that time

we thought we need to use DBN. Now with use of big data there's no

need anymore.

And that's very well understood now. Actually there are many ways to understand that. You

can think about as

regulization view point

and yesterday at the table with students I mentioned that and people said: What is

regularization?

And you have to understand more in terms of the optimization view point

so actually if you stare at back-propagation formula for ten minutes you figure out why.

Which I actually have slide there, it's very easy to understand why from many perspectives.

With a lots of data you really don't need that.

And that's automatically fixed.

You know kind of by industrialization we tried lots of data

it's fixed and now this is not fixed yet. So this is actually the main

topic

that I'm going to use for the next twenty minutes.

So before I do that I will actually try to summarize some of

the major ... actually I and my colleagues wrote this book

and in this chapter we actually grouped

the major advancement of deep neural network into several categories

so I'm going to go through that quickly.

So one is the optimization,

innovation.

So I think the most important advancement

over the previous, you know the early success of the I showed you early on

what's the development of sequence discriminative training and

this contributed additional ten percent of error rate reduction.

Also many groups of people have done this.

Like for us at Microsoft, you know this is our first intern coming to our

place to do this.

And we tried on TIMIT we didn't know all the subtleties of the importance of

regularization and

we got all the formula right, all of everything right

and the result wasn't very good.

But I think

Interspeech accepting our paper and this we understand that this

and then later on

we got more a more papers, actually a lot of papers were published in Interspeech.

That's very good.

Okay now, the next theme is about 'Towards Raw Input', okay.

So what I showed you early on was the speech coding and analysis part

that we know that is good. We don't need MFCC anymore.

So it was bye MFCC, so

probably it will disappear

in our community. Slowly over the next few years.

And also we want to say bye to Fourier transforms, so I put the question

mark here partly because

actually, so for this Interspeech I think two days ago Herman ?? had a very

nice paper on

this and I encourage everybody to take a look at.

You just put the raw information in there

which was done actually about three years ago by Geoff Hinton students, they truly believed

it. I couldn't

I tried that about 2004, that was the hidden Markov model

error.

And we understood all kind of problem, how to normalize users input and I say

it's crazy

and then when they published the result

ICASSP. I looked at these results and error was terrible. I mean there's so much

of error.

So nobody took attention. And this year we brought the attention to this.

And the result is almost as good as using, you know,

using Fourier transforms.

So far we don't want to throw away yet,

but maybe next year people may throw that away.

Nice thing is .. I was very curious about this. I say

at the terms of that to get that result they just randomize everything rather than

using Fourier transforms

to initialize it and that's very intriguing.

Too many references to list I was running all the time. I had ?? list.

But yesterday when I went through this adaptation session there's so many good papers around.

I just don't have patience for them anymore.

So go back to ?? adaptation papers. There are a lot of new

advancements. So another important thing is transfer learning

at that place very important role in multi-lingual acoustic modelling.

So that was tutorial that I was .. actually Tanja was giving in a workshop

I was attending.

I also mention that

for generative model

for shallow model before

this one almost never

multilingual

of course

modelling

actually improved things.

But it never actually beat the baseline

in terms of ..

so think about cross-lingual for example, multi-lingual and cross-lingual

and deep learning actually beat the baseline. So there's whole bunch

papers in this area which I won't have time to go through all here.

Another important innovation is nonlinear regularization, so for

regulation dropout if you don't dropout it's good to know.

And this is special technique. Essentially it's just 'kill all you know' or

randomly and you get the better result.

And in terms of output units

now

is very popular units is to rectify linear units

and now there's some very interesting

many interesting theoretical analogies why this is better than this.

At least while in my experience .. actually I programmed this, it's change of our

lifes

to go from this to this.

Deep learning

really increases.

And we understand now why it happens.

Also (in terms of) accuracy different groups report different results.

Some groups reports they reduced error rate, some groups .. nobody reported increase in error

rates for now.

So in any case (it) speed up

the convergence dramatically.

So I'm going to show you another architecture over here which is going to link

a generative model.

So this is a model called Deep Stacking Network.

But its very design is deep neural network, okay. It's information from bottom up.

So the difference between this model and conventional deep neural network is that

for every single layer you can actually

integrate the input for each layer and then do some special processing here.

Especially you can alternate

layers into linear and nonlinear, if you do that you can dramatically increase your

speech convergence

in deep learning.

And there's some another theoretical analysis which is actually put in one of the books

I wrote.

So you actually can convert many complex

propagation,

non-convex problem into

somewhat

kind of ??property measure problem related to

convex optimization so we can understand our probability ??.

So we did that a few years ago and we wrote a paper on this.

And this idea can also be used for this

potential network, which I don't have the time to go through here. And the reason

why I bring that up is

because it's actually related to some recent work

that I have seen

for generative model which were taking convertion of each other, so let me compare between

two of

them to give you some example to show how to

both

networks can help each other.

So when developped this deep stacking network the activation function had to be fixed.

Either logistic or ReLu which are both

reasonably well

you know compared to

with each other.

Now look at this architecture.

Almost identical architecture.

So now

if you change the

activation function to be something very strange, I don't expect you to know anything about

this

and this is actually work done by Mitsubishi people.

There's a very nice paper over here in the technical ??

I spent a lot of time talking to them and they even came to

Microsoft, so actually I listened to some of them and their demo.

So the activation function for this model is called the Deep Unfolding Model

that's is derived from inference method in generative model.

Which is not fixed as in the ?? I showed you earlier. So to stop

this model .. it looks like deep neural network, right?

But the beginning

the initial phase of their generative model which is specific about,

I hope many of you know the non-negative matrix factorization. This is specific technique

which actually is a shallow generative model.

It actually makes a very simple assumption that

the

observed noisy speech or mixed speakers' speech is the sum of two sources

in spectral domain.

What was they make the assumption

and then they of course they have to enforce that each

you know

each vector is positive because of the magnitude of spectra.

What they do is an iterative technique and that becomes a iterative technique.

And that

model automatically embed the main knowledge about how observation

is obtained, you know, through the mix between the two.

And then this work essentially said how to apply that inference technique iteration. Every single

iteration I treat that as a different

layer.

After this they do the back propagation training.

And the backward iteration is possible

because

the problem is very simple, so the application here is a speech enhancement

therefore objective function is a mean-square error, very easy. So the generative model

actually generative model gives you

the

the generative observation

and then

your output is clean speech.

Okay then you do mean-square error you actually adapt all this way

and the results are very impressive. So now this is why

I showed you can design deep neural network

if we use this

type of

activation function you automatically build in the constraints that you use in the generative model

and that's

very good example to show

the message that I'm going to,

actually I put in the beginning of the (presentation) it's

hope of deep generative model. So this is

shallow model and it's easy to do it. Now for deep generative model

it's very hard to do.

And one of reasons I put this as a topic today is partly because

all this conference

it's just three months ago

in Beijing's ICML conference

there's a very nice development

of deep generative models' learning methods.

They actually linked this

neural network and Bayes net together

through some transformation

and because of that .. the main idea of .. whole bunch of papers including

Michael Jordan,

whole bunch, you know, a lot of very well known people

in machine learning for deep generative model

so the main

point of this set of work, I just want to use one simple sentence to

summarize them,

is that

when you originally tried to do E step I showed you early on

you have to factorize them in order to get each step done

and that was approximation

and there was very nice ?? developped. A ?? so large it's practically useless

in terms of inferring the top layer

discrete event.

Now the whole point is that now we can relax that constraint for factorization

and like before three years ago if you do that if you use a rigorous

dependency

you don't get any reasonable analytical solution so you cannot do EM.

Now this

idea is to say that while you can approximate

that factorisation,

you can approximate that dependency in E step learning

not through

factorization which is called mean field approximation,

but use deep neural network to approximate.

So this is example to show that deep neural network actually help you to solve

deep generative model problem and

so this is well know Max Welling, a very good friend of mine in machine

learning.

And he told me that the paper never show that.

And they really developed the

the theorem to prove that if network is large enough

the approximation error can approach

zero. Therefore the variational learnings

can be eliminated and that's a very engine

developed that really give me a little evidence to show that,

to see that this is

a promising approach. I think machine learning community development tool,

our speech community developed verification

and also methodology as well,

but if

you know we actually cross connect

to each other we are gonna to make much more progress and that this type

of development

really

gives some

promising direction

towards the main message I put out at the beginning.

Okay, so now I am gonna show you some deeper results that I want to

show you.

Another better architecture that we have known is what's called the reccurent network, if you

read

this Beaufays' paper LSTM, look at that result. For

voice search the error rate jumped down to about ten percent. That's very impressive result.

Another type of architecture is to integrate the convolution

and non-convolution together. That was ??

in the previous result. As the author worth of any better result is in though.

So these are the state-of-the-art for switchboard (SWBD) task.

So now I'm going to concentrate on this type of

recurrent network here.

Okay, so this coming down to one of my main messages here.

So we fixed this kluge

a recurrent network.

We also fix this kluge automatically

just using big data.

Now how do we fix this kluge?

So first of all I'll show you some analysis on recurrent network vs. deep generative

model

so that's called hidden dynamic model I showed you early on, okay.

And so far analysis hasn't been applied to LSTM.

So some further analysis may

actually automatically give rise to LSTM using some analysis on this.

So this analysis is very preliminary

and so if you stare at the equotation

for recurrent network it looks like best one. So essentially you have state of the

art equotation

and it's recursive.

Okay,

from previous hidden layer to this.

And then you get the output

that produces the label.

Now if you look at this deep generative model - hidden dynamic model

identical equotation,

okay? Now what's the differece?

The difference is that the input now is the label. Actually if you put the

label

you cannot drive it. So you have to make some connection between labels and continuous

variable

and that's what in phonetic

people call phonology to phonetic interface, okay.

So we use some very basic assumption

that the interface is simply, that each label corresponds to target vector,

actually the way that we implement early distribution, you can do that to account for

speaker

differences, etcetera. Now the output

for this recursion gives you the observation

and that's the recurrent filter type of model.

And that's engineering model and there's neural network model, okay. So every time I was

teaching

?? I called ?? on this.

So we fully understood all the constrains for this type of model.

Now for this model it looks the same, right?

So if you reverse direction you convert one model to another.

And for this model it's very easy to put a constraint, for example

the dynamics

matrix here that governs

the internal dynamics in the hidden domain actually can be made sparse and then you

can put

realistic constrain there for example in our

earlier implementation of this we put this critical dynamics

therefore you can guarantee it doesn't oscillate. When we do articulation we need phone boundaries.

This is the speech production mechanism

you can put them simply to fix the sparse matrix.

Actually one of the slides I'm gonna show you is all about this.

In this one we cannot do it, everything has to be a structure.

There's just no way you can say that why, you want that dynamics

to behave in certain way.

You just don't have any mechanism to design the structure of this and this is

very natural, it's by physical

properties that design this. Now because of

this correspondence and because of the fact that now we can do

deep inference

if all this machine learning technology actually are fully developed

we can very naturally bridge the two (models together).

It turned out if you do more

rigorous analysis

making the inference of this to be fancier

our hope that

this

multiplicative

kind of unit would automatically emerge from this type of model so that has not

been shown yet.

So of course this is just, you know, very high-level view comparison between the two

there are a lot of detail comparison you can make in order to bridge the

two,

so actually my colleague Dong Yu wrote this book that's just coming out very soon.

So in one of the chapters we put all these comparisons: interpretability, parametrization, methods

of learning and nature of representation and all the differences.

So it gives a chance to actually understand

how deep generative model in terms of dynamics

and recurrent network in terms of recurrence can

be matched with each other, so I will read that over here.

So I have the final five, three more minutes, five more minutes. I will go

very quickly.

Everytime I talk about it I was running out of time.

so the key concept is called embedding.

Okay, so actually you can find the literature in nineties, eighties to have this

basic idea around.

For example in this special issue of

Artifical Intelligence, very nice paper over here, I had chance to read them all.

And very insightful and some of the chapters over here are very good.

So the idea is that each physical entity or linguistic

you know

entity:

word, phrase, but even whole article, whole paragraph

can be embedded into

continuous-space vector. It could be big ??, you know.

Just to let you know it's special issue on this topic.

And that's why it's important concept.

The second important concept, which is much more advanced

which is described by a few books over here. I really enjoyed reading some of

those and I invite those

people come to visit me.

We have a lot to discuss on that. You can actually even embed the structure

into

next structure symmetric into a vector

where you can recover the structure completely through the vector

operation and the concept is called tensor-product representation.

So I don't have .. if only I had three hours I can go through

all of this.

But for now I'm going to elaborate about this for next two minutes.

this is the neural network recurent model and this is very nice, I mean this

is fairly informational paper

to show that embedding can be done as part of the

as a byproduct of the recurrent neural network that

paper was published in Interspeech several years ago.

And then I'll talk very quickly about semantic embedding at MSR, so

the difference between this set of work and the previous work was that

everything is completely unsupervised

so in the company if you have supervision you should grab it, right.

So we actually took initiative to actually take some

very smart

exploitation of supervision signals

at virtually no cost.

So the idea here was that this is the model that we have essentially for

each branch it's deep neural network. Now different

branches can actually link together

through what's called the, you know, cosine distance.

So that

distance can be measured

in terms of

a vector, in a vector space.

And now we do MMI learning,

so if you get hot dog in this one, if your document is talking about

fast food or something, even if

there's no word in common you pick up.

And because of supervision actually link them together.

Like if you have dog racing here

they have the same word although they will be very far apart from each other.

And that can be automatically done.

And that some people told me that topic model can do

similar things, so if we compare that with the topical model

it turned out that ??

and using this

deep semantic model

can do much, much better.

So, now multi-modal. Just one more slide.

So it turned out that not only text you can embed into it,

image can be embedded, speech can be embedded and can do something very similar

to the one I showed you earlier.

And this is the paper that was in yesterday talk about embedding.

That's ver nice, I mean it's very similar concept.

So I looked at this and I said wow it's just like the model that

we did for the text.

But it turned out that application is very different.

So actually

I don't have time to go through here. I encourage to read on some papers

over here. Let's skip this.

So this was just to show you some application for this

semantic model. You can do all the things. From web search

we apply them, quite nicely. For machine translation you have one entity

to be one language

some of the list of the paper that were published you can find some detail.

You actually can make summary, summarization and entity ranking.

So let's skip this. This is final slide, the real final slide.

I don't have any summary slides, this is my summary slide.

So I copied the main message here now. Elaborate could be more. After going through

whole hour of presentation.

Now in terms of application we have seen

speech recognition.

The green is

neural network, the red is deep generative model. So

I say a few words about deep generative model and dynamic model

that's generative models side and LSTM is other side. Now speech enhancement

I showed you these types of models

and then

on the generative model side I showed you this one

and this is shallow generative model that actually can

give rise to deep structure which is corresponding to

deep

stacking network I showed you early on. Now for algorytm we have get back propagation

here.

That's single unchallenged

algorytm for deep neural network.

Now for deep generative model there are two algorytms. They are both called

BP.

So one is called Belief Propagation, for those of you who know machine learning.

The other one is BP, same as this.

That only came up within two years.

Due to this new advancement

of porting deep neural network

into the inference step

of this type of model, so I call BP and BP.

And in terms of neuroscience you call this one to be wake and you call

the other sleep.

And in the sleep you generate things you get hallucination and then when you're awake

you have perception.

You get information there. I think that's all I want to say. Thank you very

much.

Okay. Anyone one or two quick questions?

Very interesting talk.

I don't want to talk about your main point which is very interesting.

Actually just very briefly about one of your side messages which is about waveforms.

Which is about waveforms. So you know the ?? paper there weren't really putting in

waveforms.

They are putting in the waveforms, take the absolute value, floor it, take all

logarithm, average over, but you know so you had to do a lot of things.

Secondly the other papers that there's been a modest

amount of work in last few years on doing this sort of thing,

pretty generally people do it with matched training test conditions

if you have mismatched conditions, good luck with

waveform. I always hate to say something is impossible but good luck.

Thank you very much. ?? good for everything.

And look at presentation that was very nice, thank you.

Any other quick questions?

If not I invite Haizhou

to give a plaque.

Achievements and Challenges of Deep Learning - From Speech Analysis And Recognition To Language And Multimodal Processing

Keynotes

Li Deng, Microsoft Research, Redmond, USA