good morning everybody

I'm very happy to see you all in this morning

Professor Li Deng, who proposed the keynote this morning

its not so easy to introduce him, because

he is very well known in the community

he is the fellow of a lot of societies like

ISCA, IEEE, American Acoustical Society

he has proposed several hundreds of papers during the last years

and different talks

Li Deng did his PhD in the University of Wisconsin

He started his carrier in University of Waterloo

He will talk to us today about two very important topics

very important to all of us

one is how to move out of GMM

its not so bad because I start my carrier with GMM

I need some new ideas to do

something else

the second topic will deal with the dynamic of speech

we all also know but dynamic is very important

we will not take more time on his talk, I prefer to listen him

thank Li

thank you, thank the organization and Haizhou

to invite me to come here to give the talk

it is the first time I've attended Odyssey

I've read of lot of thing that the community has been doing

As Jean has introduced

now I think not only in speech recognition but also in speaker recognition

there are few fundamental tools so far

one is GMM, one is MFCC

in common

last year, I've learned a lot of other thing from this community

it turns out that, the main thing by this talk is to say

both of these components may have potential to replaced with much better results

I touch a little bit on MFCC, I don't like MFCC

so I think Hynek hates MFCC also

now only until recently, when we was not doing deep learning

there are evidences to show that all components maybe replaced certain in speech recognition, people

have seen that it is coming

hopefully, after this talk, you may think about whether in speaker recognition, these components can

be replaced

to get better performance

the outline has three parts

In the first part, I will give a little bit about quick tutorial

having several hours of tutorial material

over last few months, so it is a little challenging to compress them down.

to this short tutorial

rather talking about all the technical details

I've decided to just tell the story

I also notice that in the next section after this talk

there are few papers related to this

Restricted Boltzmann Machines, Deep Belief Network

Deep Neural Network in connection with HMM

at the end of this talk, you may be convinced that this may be replaced

as well

we can consider in the future, with much better speech recognition performance thing than that

we have

and also Deep Convex Network, Deep Stacking Network

so I think over last 20 years, people have been working on segment models, hidden

dynamic models

and 12 years ago, I even had

a project with John Hopkins University working on this

and the results were not very promising

now we are beginning to understand the great idea we proposed over there

it did not work well at that time

It is only after we do this, we realize how we can put them together

and that is the final part

the first part

how many people here ever attended one of my tutorial over last year

OK, its a small number of people,

this one you have to know, deep learning, sometime you have hierarchical learning in the

literature

essentially, refer to a class of machine learning technique

largely developed since 2006

by ... you know actually, this is the key paper

that actually introduced a fast learning algorithm for this called Deep Belief Network

in the beginning, this is mainly done on image recognition, information retrieval and other applications

and we, actually Microsoft was the first to collaborate with University Toronto

researchers to bring that to speech recognition

and we show very quickly that not only for small vocabulary

it does very well but also for large vocabulary does even better

this really happens

you know in the part, for small recognition, it worked well for larger sometime it

failed

but this is the bigger tasks we have, the better success we have, I try

to analyze

to you that why it happens

and Boltzmann machine, we will talk Boltzmann machine in the following talks, I think, Patrick

has two papers on that

and Restricted Boltzmann machine

and this is a little bit confusing, so if you read the literature

very often deep neural network and deep belief network

which are defined over here which are totally different concepts

one is a component of another

just for save of convenience, the authors often get confused

they called deep neural network as DBN

and DBN is also referred to Dynamic Bayes network

even more confusing

one of thing that

for tutorial, for people attended my tutorial, I gave a quiz

people know all this

last week, we got a paper accepted for publication, the one I wrote together with

Geoffrey Hinton and with 10 authors all together

work in this area

we try to clarify all this, so we have unified terminologies

when you read the literature, you know how to map one to another

and Deep auto-encoder, I don't have time to go to here, and I will say

about some new developments

to me it is more interesting because some limitations of some others

This is the hot topic, here I list whole recent workshops and special issues

and actually, in Interspeech 2012

you see tens of papers in this area most in speech recognition

and actually, in one of the area, just format areas with 2 full sections for

this topic, just for recognition

and some others, we have more, and special issue

PAMI, its mainly related to machine learning aspects and also computer visual applications

I try to put a few speech papers there as well.

and DARPA program

2009, I think last year they stopped

and I think in December, there is another workshop related to this topic, it is

very popular

I think because people see the good results are coming, and I hope that

one message of this talk is to convince you that is a good technology so

you want to seriously consider adapting some of this essences

tell some stories about this so

so the first time, this is the first time

when deep learning showed promising in speech recognition

and activities grow rapidly since then and that was around

two and half years ago

or three and half years ago, whatever

in NIPS, NIPS is a machine learning workshop

every year

so I think one year before that

so actually, talked with Geoffrey Hinton

a professor at Toronto, he has shown me that

he showed me that the Science paper, he actually has a poster there

the paper was well written and the diagram was really promising

in term of information retrieval for document retrieval

so I looked this, after that we started talking about

maybe we can work on speech

he worked on speech long time ago

so we decided to have this workshop, and actually we work together before

my colleague, Dong Yu, and my self and Geoffrey, we actually decided to have

a propose accepted which presents whole deep learning in preliminary work

and that time most people do TIMIT, a small experiment

and we turn out that this workshop gives a lot of excitement

so I give a tutorial, 90 minutes

about 45 minutes tutorial, and Geoffrey, I talk about speech, and he talks about deep

learning at that time, and we decided

to get people interesting this

so the curriculum is as follows, for NIPS

at the end of the final day workshop

each organizer presents a summary of the workshop

and the instruction for that it is a short presentation, it should be funny,

should not be too serious

every organizer is instructed to prepare few slides to summarize

your workshop in the way that your impression to people attending the workshop

this is the slide, we prepared

speechless summary presentation of the workshop on speech

because, we don't really want to talk too much, just go up there, and show

that slide

no speech there, just animations

so we said that, we met in this year

so this is supposed to be industrial people

and this is supposed to be academic people

so they are smart and deeper

and they say that, can you understand human speech

and they say that, they can recognize phonemes

and they say, that 's a nice first step and what else do you want?

and they said they want to recognize speech in noisy environments and then

and then he said maybe we can work together

so we have all concepts together

that's all presentation

we decided to do small vocabulary first

and then quickly we moved I think in December of 2010

move to very large vocabulary

to our surprise, the bigger vocabulary you have, the better success you get

very unusual

and myself analyze the area with details

you know we have been working on it before 20 some years

one is surprise to me, convince me to work in this area individually

was that every pattern that I see from the recognizer it is very different from

HMM

absolutely, it is better, the area is very different, that means it is good for

me to do that

anyway, I talk about DBN

one of concept is deep belief network, that is one of that Hinton published in

2006

2 papers there

nothing to do with speech, it's called deep belief network. its pretty hard to read,

if you are not in the field for while

and this is another DBN called dynamic Bayesian network

few months ago, Geoffrey sent an email to me saying that look at this

acronym DBN, DBN

he suggests that before you give any talk you check

mostly, in speech recognition, people do Dynamic Bayes Network more

anyway, I will a little bit technical contents on it, time is running up quickly

number one is the first concept is restricted Boltzmann machine

actually, I have 20 slides, so I just take one slide over these 20

so think about this as visible

it can be label, label can be one of visible units

we do discriminative learning, other thing is observation, think about of observation, and other thing

forget about this

think about MFCC, think about label

or speech label, senome or other labels

so we put them together as observation and we have hidden layer here

and then the difference between Boltzmann machine and neural network is that

the standard neural network is one direction, from bottom up

now Boltzmann machine is both directional, you can go up and down, now connections between

neighboring units in this layer and that layer are cut off

if you don't do that, it is very hard to learn

so one of thing is that in deep learning they start with restricted Boltzmann machine

is that

if you have bi-direction of connections

and if you do all in detailed maths, write energy functions, you can write down

the conditional probabilities of hidden units given it and the other around.

so if you put energy right, actually you can get the conditional probability of this

given this to be Gaussian

which is that something people like that, this is conditional you can introduce whole thing

as Gaussian mixture model

so you may think that is just Gaussian mixture model so I can do it

each other

the difference is that, this you can get almost exponentially large number of mixture components

rather than finite

I think in speaker recognition, its about 400 or 1000 mixtures whatever

and here if you get 100 units

you get almost unlimited number of components

but they are tied together

Geoffrey has done very detailed mathematics to show that this is very powerful way of

doing Gaussian model

actually, you get product of experts rather than mixture of experts

that to me it is one of key inside that we get from him

that is RBM, so think about this as RBM

think about this as visible

this observation and hidden and we put them together we have it

it is very hard to do speech recognition on it

this is a generative model, you can do speech recognition, but if you do that,

the result is not very good

dealing discrimination tasks with a generative model you are limited by some of the

you don't directly focus on what you want

however, you can use it as building block

to build DBN (deep belief network)

the way we do it actually in Toronto

if we think about this as building block

you can do learning, after you do learning of this I just skip

it will take whole hour to talk about that learning, but assume that you know

how to do that

after you learn this, you can treat this as feature extraction from what you get

here

and you treat as stacking up

deep learning researchers argue that this becomes the feature of that

and then you can do further I think it is brain architecture

think visual cortex, 6 layers

you can build up, whatever that can learn over here become the hidden feature

hopefully, if you learn that right you can extract the important information from data that

you have

and then you can use feature on the feature and stacking up

why we are stacking up, actually it puts interesting theoretical results

that actually shows that if you unroll this single DBN

sorry, one layer of RBM

in term of belief network, actually it is equal to infinity length

because, every time this is related to learning

learning is actually go up and down, every time you go up and down, it

is equivalent to show that it

you actually get one layer higher, now the restriction here is that

all the weights have to be tied, it is not very powerful

but now we can untie the weights by doing separated learning

what we do it, it is very powerful model

anyway, so the reason why you this one goes down, this one goes up and

down is that if you

actually, I don't have time to go here, but believe me

so if you stack up this one, one layer up

and then you can mathematic show that this is equivalent to having

just one layer RBM at the top and then belief network going down

and this actually called Bayes network

so look at belief network is similar to Bayes network

but now if you look at this, it is very difficult to learn

so for each any one going down over here something in machine learning called explaining

away effect

so the inference becomes very hard, generation is easy

and then the next invention in this whole theory is that

just reverse order

and you can turn into neural network, it turns out that it is not theory

in that aspect as that works well

and practice it works really well

actually, I am looking to some of theories of this

so this is the full picture of DBN

so DBN consists of bi-directional connections here

and then single direction goes down

so if you do this, you actually can use that as generative model that you

can do recognition on this

unfortunately, the result is not good

a lot of steps that people reach the current state

I am going to show you all steps here

so number one RBM is useful, gives you feature extraction

and you stack up RBM few layers up

and you can get DBN, actually at the end you need to do some discriminative

learning at the end

uh, so let's see, but generally, the capacity is just very good

this is the first time, I saw

the generative capability from Geoffrey, I was also amazed

so this is that example that he gave me

so if you train, use this digit

the database is called MNIST

an image database, everybody uses it, as like our speech TIDIGIT

you put them here and you learn it

you know according to this standard technique

you actually now put 1 of digit here you want to synthesize 1

you put 1 here and all other are 0, and then you run

you can actually get something really nice, if you put 0 here

this is different from the traditional generative process

the reason why they are different because of stochastic process

it can memorize

some of numbers are corrupted

most of time you get realistic

last time, in one of tutorial I gave

I give the tutorial shown this result , how about of speech synthesis people in

the audience

they said that is great, I will do speech synthesis now

you get one sentence, fixed number, not human one

human do writing, every time for different writing

intermediately, go back to write draft propose, and ask me to help them

this is very good, stochastic components there, the result looks like how human are doing

now, we want to use for recognition, this is the architecture

I am amazed, I had a lot of discussion with Patrick yesterday

I just feel that, when you have generative model you really need to

you put image here, and move up here, and this becomes the feature

and all you do that you turn on this unit by one

and run a long time until convergence

and you look the probability for this

to get number, OK

and turn of other units, and run run, and see which number is high

I suggest that you don't do that waste your time

number one is it takes long time to do recognition, number two we don't know

to generate to the sequence

and he said the result is not very good, so we did not do it

we abandon the concept of generation, to do everything generative, that's how we do.

and that's how deep neural network is born

so all you do is that you just treat all the connections to be

that why at the end my conclusion is that the theory of deep learning is

very weak

ideally DBN goes down, it generate the model

in practice, you say it is not good, just forget about this, think about

and eliminate this, and make whole weights moving up

we modify this, the easiest way to do it just forget about this, you know

just change make it go up, make this go down again, people don't like it

in the beginning, I supposed it is horrible, that is crazy to do it

just break the theory to build the DBN

finally, what is the best result, what we do that is really as same as

what multilayer perceptron has been doing except it just

has very deep layers

and now if you do that typically, randomize

you know, all the weights, you learn this as standard arguments

20 some years ago saying that

mathematics proves that the deeper you go, the more

the lower level you go, because the label is the top level

so you do back-propagation taking the derivative of error from here go down here

the gradient is very small

you know sigmoid function sigmoid (1-sigmoid)

so the lower you go, the more chance that gradient term vanishes

they even don't back-propagation for deep networks so look that it, it seems impossible to

learn, they gave up

and then now, one of very interesting things that comes out of deep learning is

to say

rather using random numbers, can be interesting to using DBN to plug in there, that

some thing I don't like that

look the argument why it is good, what we do is that we train to

this DBN

over here

the weights DBN, you just use the generative model for the training

and once you trained, you fix these weights, you just copy the whole weights into

this, deep neural network to initialize these

after that you do back propagation

again, these weighting is very small, but its OK

you already got DBN over here

you got RBM, it should be RBM, not DBN anymore

it is not too bad,

so you see exactly how to train this, it is just that using random is

not good

if you use DBN's weights over here is not too bad, but over here, you

modify

you just run recognition, for MNIST

the error goes down to 1.2% that is whole Geoffrey Hinton's idea

and he published inside a paper about this, at that time, it seems to be

very good

but I am going to tell you that MNIST result 1.2% error, but with few

more generations of networks, I will show you, we are able to get 0.7%

and same kind of philosophy goes to speech recognition

I will go quickly, in speech all of you think about how to do sequence

modeling

it is very simple

now we have deep neural network

what we do that we normalize that using softmax

to make that to be, similar to the talk yesterday, a kind of calibration

and we get posterior probabilities and divided by prior you get generative probabilities, and just

use HMM to do that

that why called DNN-HMM

the first experiment we did on TIMIT

with just phonemes, easy

each state, one of three states is a phoneme, very good result, I can show

you

you then move to large vocabulary, one of thing that we do in our company

you know Microsoft called them as senomes

rather we have a phone, we cut it in dependent phone

that becomes our infrastructure

so we don't change all this

rather we use 40 phones, what happen if we use 9000.

you know, the senomes, long time ago people could not do that, 9000 here, crazy

300, 5000, every time you have 15 million weights here, it is very hard to

train

now we bought very big machines

a GPU machine, parallel computing

so we replace this by ... it can be very large

this is very large, and input is also very large as well

so we use a big window

we have a big output, big input, very deep, so there are 3 components

why big input-long window

which could not be done in HMM

do you know why? because

I have a discussion with some experts it could not be done for speaker recognition,

UBM

for speech recognition, the reason why it couldn't be done, because

first of all you have to diagonalize HMM

but its not big, if you do too big, Gaussian becomes sparseness problem

covariance matrix

for the end, all we do that make it simple as possible, just plug whole

long window

and then feed whole thing, we get million of parameters

typically, this number is around 2000

2000 here, every layer, 4 million parameters here, another 4 million, another 4 million

and just use GPU to train the model together

here is not too bad

so if we use about 11 frames

now, it is even extended to 30 frames

but in HMM, we never imagine of doing it

we don't even normalize this, we just the roll

values over here

in the beginning, I still use MFCC, delta MFCC, delta

multiply by 11 or 15 whatever

then we have a big input

which is still small compared with hidden unit size

and train this whole thing, and every thing works really well

and we don't need to worry the correlation modeling, because correlation is automatic captured by

the whole the weights here

the reason I bring it here, just to show you that, this is not just

phone

we went through history, literature, we never saw put this one as speech until this

first data

now just give you a photo here, GMM everybody know

HMM, GMM, so whole point is to show you that

the same kind of architecture if you look at HMM

you can also see GMM is very shallow

all you do it that for each state the output 1 is score for GMM

over here, you can see many layers

so you build feature up and down, this one shows deep versus shallow

here is the result. We wrote the paper together, it will be appear in November

and that result summarize

four groups research together over last three years

since 2009

university of Toronto, Google, and

and our Microsoft research was the first one who

actually serious work for speech recognition

Google data and IBM data

they all confirm the same kind of effectiveness

here is the TIMIT result

it is very nice, all people think that TIMIT is very small

if you don't start with this, you get scared away.

so I will go back in the 2nd part of this talk, it is monophone

hidden trajectory model, I did many years ago

to get this number, need 2 years to do this

I wrote the training algorithm, very good my colleagues wrote the decoder for me, this

is very good number

for TIMIT, and it is very hard to do decoding

the first time we try this DBN

deep neural network

I wrote this paper with ... we do is MMI training

you can do back propagation through the MMI function for whole sequence

so we got 22%, it is almost 3%

and then we look the errors between this and this are very different, especially, for

very short samples

it is not really good, but for the very long side is much better

I've never seen that before

so do this, this kind of work which is compared with HMM

this result has been done for 20 some years ago

this is error, 27% error around 4% up

around 10 years, 15 years, the error drops 3%

and this and this is very similar in term of error

so you see the error is very different

so the first experiment is voice search

at that time, voice search is very an important task, and now voice search goes

to everywhere

in Siri has voice search, in Window phone we have that

even in Android phones

very important topic

so we have data, we have worked on this one, very large vocabulary

and summer of 2010

we first to in our group, just boost that because the it is so different

from TIMIT

and we actually don't even change parameters at all

all parameters, learning rate

from our previous work in TIMIT

and we got down here, that is the paper we wrote

just appear this year

and then this is the result that we got

if you actually want to look at exactly how this is done

most of the thing provide

in this paper is read speech

to tell you how to train the system

but you need to use GPU to implement, without GPU, it takes 3 months, just

for experiments

for large vocabulary, for GPU is really quick

most of thing is the same, you do this, you do this

we try to provide theory as much as possible

so if you want to know how to do this in some applications take a

look at this

so this is the first time

the effects of increasing the depth of DNN for large vocabulary

so our systems, the accuracy go up like this

and the baseline, using HMM, discriminative training MPE learning

around 65, this is just neural network

single layer neuron is doing better than all this

and you increase it, you get it

what you go there, some kind of overfit, data is not very good, we label

24 hours

data at that time, so we say

do more, we try 48 hours

this one drops big

so the more data you have the better can you get

some of my colleagues said that why we don't use Switchboard

I say this is too big for me, we don't do it

so actually, we do this Switchboard

and then we got a huge gain

even more gain that I showed you here

it just because of more data

so typical problem

is not really spontaneous speech, but this is spontaneous as well

so this for spontaneous speech as well

it seems with limited data we go up here quite heavy

and then you get 1 order, or 2 orders magnitude more data there

so you have much more GPUs to run, much better softwares

every thing runs well

it turns out that same kind of read speech

we publish over here

let me show you some of the results

this is the result, this is the table in our recent paper

with Toronto group

so standard GMM based HMM

with 300 hours of data

has error rate about 23 some percent

we do very carefully

tune the parameters, this parameter have been tuned (the number of layers)

and we got from here to here

and that is actually attracted a lot of people attention

and then we realize that

we got 2000 hours, and this result from that is even better

and at that time, it is Microsoft result

and then one of recent paper, publishes the result that

of course, when you do that people argue that you have 29 million parameters

and people always you know

pick, picking people in speech community people

obviously, uh, you got more parameters, of course you're going to win what

so what if you use the same number of parameters

we said fine, we do that

so we use the sparseness coding

to actually cut up all the weights

and the number of non-zero parameters is 15 million

with the smaller number of parameters,

we get even better result

it's amazing... the capacity of deep network is just tremendous

you cut all the parameters

in the beginning, we don't

typically, you expect to be similar right

get rid of the lower

you get slightly gain

but that doesn't carry off before we get more data anyway

so this is, maybe

within the statistical variation, but so

with the smaller number of parameters

then GMM, HMM which is trained using discriminative training

we get about something 30 something % error reduction

more so than our TIMIT, and also more so than our

our voice search

and then this is another paper, and then IBM came along

and then Google came along, they say you know, it's better result, I think they

want to do as well

so you can see that thesis's Google result

thesis's about 5000 hours, amazing right

they just have better infrastructure

mapping this all that, so they manage to do that on 5000, 6000 hours

so this number just came up

actually that number

so actually this will be in the Interspeech papers, if you go to see them

so one of the thing Google does is that they don't put this baseline result

they just give a number,

just ask what number they have

so... sorry.. sorry

with more data they have, with the same data they don't have the number, either

they

they just don't bother to do

they all believe more data is better

so with a lot more data they got this

and then we just with about how many, about three

uh, with this much data

I take about 12%, is better when we got more data

they should put a number here, anyway

so I'm, we're not nick picking on this

and thesis's the number I show, thesis's Microsoft's result, the number from here to here

from here to here for different

these are 2 different test sets

and all these, all the people are here, you should know, this is very important

for our review

ah now, this is IBM result

ah sorry, this is voice search result that I showed you early

this is 20%

it's not bad

so because you have 20 hours of data, so

it turns out the more data you have

the more error reduction you have

and for TIMIT, we get only about 3-4 absolute, about ten something percent

now, and this is the

so this broadcast result is from IBM

and I heard that in Interspeech, they have much better result than this

so if you're interested, look at it

my understanding is that

from what I heard, is that their result is comparable to this

some people say even better

so if you want to know exactly IBM is doing, they would have even better

infrastructure

in term of distributed learning

compared to most other places

but anyway so this kind of error reduction

has been unheard of in the history, I mean in this area about 25 years

and the first time we got these results, we're just stunned

and Google, this is also Google's result, and even Youtube speech which is much more

difficult

spontaneous with all the noise

they also manage to get something from here

this time they're pretty honest to put this over here with the same amount of

data

14 hours they got more

but in our case, we also get 2000 hours, we actually get more gain

rapid gain ah yes

so the more data you have

and then of course, to get this, you have to tune the depth

the more that you have, the deeper you can go

and the bigger you may wan to have

and the more gain you have

and this is the story I want to comment

without, you really have to change major things in the system architecture

OK, so once

one thing that we found

so my colleagues Dong Yu and myself and ah and

recently found was that

so in most of the thing that we

I believe in old days IBM and

and Google and our early work

we actually use DBN to initialize our model off-line

we said can we get rid of that, that training is very tricky, not many

people know how to do that

if for certain recipe, you have to look at the pattern

it's not obvious thing how to do that because the learning

there's the keyword in the learning called the contraction divergence you might hear that word

in the later

part of the talk today

contrastive divergence on theory,

essentially the idea is you should you know iterate

you should do multi-column simulation

Gibbs sampling for infinite turns

but in practice, it's too long

it's a cut it to one

and of course from that, you can, have to use variational learning

variational bump to prune for better result

it's a bit tricky

that's why it's better to get rid of it

so our colleagues, so actually have a patent filed just some few months ago

on this, and also a paper from my colleague

would actually use ... for the

for switchboard task

and they show that

you actually can do comparable things to RBM learning

so might I would say now, for large vocabulary

we don't even have to learn much about DBN

so .. the theory so far is not clear

exactly what kind of power you have

but I might sense is that

if you have a lot of unlabeled data in the future

it might help

but we also did some preliminary example to show it's not the case any more

so it's not clear how to do that

so I think at this point we really have to get better theory

if you take a better theory, and also kind of comparable

you know it's a

although all these issues cannot be settled

so the idea of discriminative pre-training is that

you just train the standard ..um

standard

multi-layer perceptron using you know

thesis's easy to train. For shallow, you can train, the result's not very good

and then every time

you do you fix this

you add a new one, and you do. You need to fix the lower layer

from the previous shallower layer

and that's good, that's the spirit, It's very similar to

OK .. the spirit is very similar to layer by layer learning

now every time

when we add up a new layer, we inject

discriminant labeled information

and that's very important, if you do that, nothing goes wrong

so if you just use all the random number, to go over here and do

that, and nothing is going to work

uh, but except

there's some exception here, but I'm not going to say much about

but once you do this

layer by layer with

the spirit is still similar to DBN right, layer by layer

but you inject discriminative learning

I believe it's very natural thing to do

we talked about this right

so we learn

the generative learning in DBN

you know, layer by layer, to be very careful

you don't just do it to much

and then if you inject some discriminant information

it's bound to happen

you get new information there, not just looking at the data itself

and it turns out that if we do, we get

we actually in some experiment we even get slightly better result than DBN training

so it's not clear the generative learning

plays, is going to play a more important role

as some people claimed

OK so I'm done with

the

the deep neural network, so I spend a few minutes to tell you a bit

more about

some different other different kind of architecture called deep convex network

which to me is kind of more interesting

so I spend most time on this

so actually we have a few papers published, it turned out that

so the idea of this network is that

while this is actually done for MNIST

so when use this architecture

we actually get so much better result than DBN

so we're very excited about this

but the point is that the learning has to.. you know

we have to simplify this network

it turns out learning now

the whole thing is actually convex optimization

So I do not have time to go through all this

we have time for the parallel implementation

which is almost impossible

for deep neural network

and the reason for those of you who've been actually working on neural network, you

notice that

the learning for

discriminant ... discriminant learning phase

which is called the fine tuning phase

and are typically the stochastic weighted descent

you cannot distribute

so this one cannot be distributed

so I'm not going to

I really want to use

this architecture to try speech recognition task so usually we have lots of discussion

so maybe 1 year from now

so if it's working well for you discriminant learning task

I'm glad that

this now is going to define the task

for discrimination that I.. I .. had

discussion so

that gives me the opportunity to try this

I love to try, I love to report the result

even it's negative, I'm happy to share with you

OK, so thesis's a good architecture

and another architecture that we tried

is that we split the hidden layers into 2 parts

we do the crossproduct, so that overcomes

some of the DBN weakness

originally not being able to do correlation in the input

and people just try a few tricks

you know more than correlation

it did not work well, almost impossible

so thesis's very easy to implement

and most of the learning here is convex optimization

and often get very good result over others

there's another architecture called the tensor

so the same kind of correlation

modeling for tensor version

also can be carried out into

deep neural network

so we actually, my colleague, we actually submit a paper in Interspeech

I think if you're interested in this one, should go there to take a look

at it

so the whole point is that

now rather than doing the stacking using input output concatenation

you can actually do the same thing for each of hidden neural network

so in this paper, we actually evaluate that on the switchboard

and we get additional 5% relative gain out of the best we have got so

far. So this is a good staff

so the learning becomes trickier

because when you do .. so the back propagation

and you have to think about how to do this

it adds some additional nuisance in term of effective computation

but the result is good

so now I'm going to the second part, I'm going to skip most of them

OK skip most of them

OK so this uh... I actually wrote a book on this

so this is

dynamic Bayesian network as deep one

the reason why it's deep is there are many layers

so you get the target

you get articulation

you get environment, all together this

so we tried that

and the implementation of this is very hard

so I just go quickly and then to go to the bottom

uh, so, uh, this is one of the paper that

uh, I wrote uh, together with

one of the experts, who actually

this is my colleague who actually invented this variational Bayes

and then ... to work with him

to implement this variational Bayes

into this kind of ...

dynamic Bayesian network

and the result is very good

although the journal we published is wonderful

so you can actually synthesize

you can track all these formants in very precise manner

and then some articulatory problem, it's very amazing, but once you do recognition

the result is not very good

so I'm going to tell you why, if we have time

and then of course

one of the problems

so this 2006 we actually

so we realize that kind of learning is very tricky

essentially you approximate things you don't know what you approximate for

that's one of the problems of deep Bayesian, it's very

but you can get some insights

you work with all the experts in the [ ... ]

at the end at the bottom line

we really don't know how to interpret

but you... but is just

you don't know how much you lose right

so we actually have the simplified version that I spend all time working on, and

that gives me this result

that's actually the paper

so this is .. is about

about 2-3 percent better than the best

context dependent HMM

I'm happy at that time; we stopped at this

once we do this

and it's so much better than this

so in other words, DBN

related, or at least in TIMIT task

it does so much better than

than .. than dynamic Bayesian kind of work

and then we're happy about this

now of course I won't

yes, so this is the history of dynamic model

and a whole bunch of thing going on there

and the key is how to embed

such dynamic property into the DBN framework

if you embed the property of

big chunk into

dynamic Bayesian network is not going to work ... due to technical reasons

but the other way around has a hope, that's one of the

so the part 3 will going to tell you

which I'm running out of time

I'm actually going to show you

first of all some of the lessons

so thesis's the deep belief network or Deep Neural Network

and this, I used the * here, to refer that to as Dynamic Bayesian Network

so one

so all these hidden dynamic models .. is the special case of the Bayesian network

you can see that, or otherwise I showed you earlier on

there were a few key differences that we learned

one is that for DBN

it's distributed implementation

so in our current system, for this system

in our HMM/GMM system

we have the concept that this particular model

is related to a

this particular model is related to e right

you have this concept right, and of course you need training to mix them together

but you still have the concept

whereas in this neural network.. no .. each weight

codes all class information

I think it's very powerful concept here

so you learn things and get distributed

it's like neural system right

you don't say this particular neuron contains visual information

it can also code audio information together

so this has better

neuron basis compared with conventional techniques

also ...... when we did this model

we just set one single bit wrong

at that time, we all said ... we don't have parsimonious model representation.

that's just wrong

5 years ago, 10 years ago, may be OK right

now in our current age

just use massive parameters if you know how to learn them

and also you know how to regularize them well

and just turn on that the DBN has a mechanism

to automatically regularize things well

and that is not proven yet, I don't have the theory to prove that

but in our ... u know every time you stack up

u can intuitively understand that

u don't overfit right

because if u do overfit, u do this many years ago

but if u do this, u know keep going deep, u don't over fit because

whatever information that u get applied

the new parameters

actually sort of take into account

the feature from lower parameters, so it doesn't count as lower

model parameters any more, so automatically u have the mechanism to do this

but in DBN, u don't have that property

u need to stop, it doesn't have that property

so this 's very strong

and another key difference

is something I talked about earlier

product vs mixture

mixture is you sum up probability distribution

and product is you take product between them

so when you take the product, you actually exponentially expand the power of representation

So these all the key differences between these two type of model.

Another important thing is that for this learning we combine generative and discriminative.

Although the final result we got, we still think that discriminative is more important than

generative.

But at least in the initialization, we use the generative model and DBN to initialize

the whole system and discriminative learning to adjust the parameters.

The generative model we did earlier is purely generative.

Finally, longer windows or shorter windows.

In the earlier case, I am still not very happy about longer window.

Because every time you model dynamics which I've actually talked about this, about free method.

How to build dynamics into the model, they both have a very short history, not

long history.

No history of research actually focused on dynamics.

There is so many limitations, you have to use short window. in long window, nothing

works hard, we've tried all these.

So deep recurrent network is something that many people working on now.

In our lab, in the summer, much as all the projects relate to this. Maybe

not all, at least very large percentage.

It has worked well for both acoustic model and language model. I would say that,

recurrent network has been working well for acoustic modeling.

In language modeling, there is a lot of good project in the recurrent network.

The weakness of this approach, there is a generic temporal dependency.

I have no idea what it is, there is not constraint, one following another. This

kind of temporal model is not very big.

The dynamics in DBN is much better.

In term of interpretation, in term of generative capability, in term of physical speech production

mechanism, it is just better. The key is how to combine them together.

We don't like this, and we have shown that all this does not capture the

essence of speech production dynamics.

There is huge amount of information redundancy, think about you have a long window here

and every time you shift ten millisecond and 90% of the information overlapping.

And some people may argue that it doesn't matter and they did experiment to show

that it doesn't help at all.

The importance of optimization techniques is the Hessian-free method.

I am not sure in language modeling, you may not do that actually, but in

acoustic modeling, this is a very popular technique.

And also, another point is that recursive neural network for parsing in NLP has been

very successful.

I think last year in ICML, they actually presented the result of recursive neural net

which is not quite the same as this, but used the structure for the parsing,

they actually got state-of-the-art result for the parsing.

The conclusion of this slide is it's an active and exciting research area to work

on.

So the summary is as follows. I provide historical accounts of two fairly separate research.

One is based upon DBN, the other one is based on Dynamic Bayes Network in

speech.

So I actually hopefully show you that speech research motivates the use of deep architectures

from speech production and perception mechanisms.

And HMM is a shallow architecture with GMM to link linguistic units to observations.

Now I have shown you that, I didn't have time to talk about this, the

point is this kind of model has less success then it is expected.

And now we are beginning to understand why that is a limitation over here, and

actually I have shown some potential possibilities of overcoming that kind of limitations in the

neural network framework.

So one of the thing that we understand why this kind of limitation that has

been developed in the past has not be able to take advantage of the dynamics

into the deep network.

It's because we didn't have the distributed representation, didn't have massive parameters, didn't have fast

parallel computing and we didn't have product of experts.

All these things are good for this, but the dynamics are actually good for this,

and how to merge them together, I think is a very popular research that actually

work on.

You can actually make the deep network to be scientific in terms of speech perception

and recognition

So the outlook the future direction is that so far we have DBN DNN to

replace HMM GMM .

I would expect in within three five years, you may not be able to see

GMM especially in recognition.

I think in industry.If I am wrong then shoot me.

The dynamic properties of model of this Dynamic Bayesian Network speech has the potential to

replace HMM.

And the Deep Recurrent Neural Networks, which I have probably tried to argue that there

is a need to go beyond unconstrained temporal density while making it easier to learn.

Adaptive learning is so far not so successful yet, we tried a few projects, it

is harder to do it.

The scalable learning is hard, for industry at least is, for academic don't worry about

it.

As well as NIST define it into small tasks, you will be very happy to

work on that. But for industry this is a big issue.

Reinventing our infrastructure at the industrial scale. I think we have time to go through

all the applications.

Spoken language understanding, has been one of the successful application I've shown you.

Information retrieval, language modeling, NLP, image recognition, but the speaker recognition is not yet.

The final bottom line here is that the deep learning so far is weak in

theory, I have I have convinced you about it with all the critics.

In Bengio case, he randomize everything first. And then if you do that, of course,

it is bad.

So the key is that, if you get something did so best, I think to

me what generative model maybe useful in that case. But the key of this learning

is if you put a little bit discrimination over here, it is probably better.

So probably the best is you use the structure here and also this, and we

know how to train that now. I think both width and depth is important.

We tried that, we didn't fix the measurement, we just used algorithm to cut out

all the way. We didn't lose anything, in fact from the result I showed you,

it still gains a little bit.

Cross validation.

That's no way, there is no theory on how to do that.

But in particular case, some of the networks that I've shown you, I have theory

to do that, I can control that.

There's some networks, you can do theory. That means you can automatically determine it from

data. But for this deep belief network, it is weak in theory.

He is also doing deep graphical model.

Two years ago, he gave this ? on how to learn the topology of deep

neural network, in term of width and depth.

And he was using Indian Buffet Process.

At the end, everything has to be done by Monte Carlo simulation and for five

by five, he said simulation take about several days.

I think that approach is not scalable, unless people improve that aspect.

That also motivates more of the academic research on machine learning to make that scale.

I think the idea is good, but the technique is so slow to do anything

about this.

For deep neural network, stochastic gradient is still doing the best, it is good enough.

But my understanding is, we are actually playing around with this. He wants to add

the recurrence some more complex architecture, stochastic gradient isn't strong enough.

There is a very nice paper done by Hinton's group, one of his PhD student.

Who actually used Hessian free optimization to do DBN learning.

They actually showed that result is just one single figure, very hard to interpret that

one, the paper in ICML 2010. It's doing better in this compared with using DBN

to initialize neural network.

To me, it is very significant. We are still borrowing this for more complex network,

more complex second order method, probably it will be necessary.

And also the other advantage of Hessian free is the second order, it can be

parallelized for bigger batch training rather than minibatch training, and that makes big difference.

We tried that one, it doesn't work well for DBN, we need to have a

lot of data. Probably the best for DBN network is stochastic gradient .

If you are using the other networks, some later networks that we have talked about.

They are naturally suited for batch training.

In some more modern version of the network, batch training is desirable. They are designed

for those architecture, it is for parallization.