Speech Transcript - Closing discussion "What's wrong with ASR and what can we do about it"

so i mean i don't have this is all one solely basically let's go back

to the they to the first day all the way and what is that you

didn't ask what is it whatever it is that you didn't say

you sort of each that maybe you said

that now is a good opportunity because it happens very often and their questions

but of comes out of don't know how fast but then you come home and

say all i wish that be so this so alex definitely yes it is one

thing to say

sorry for saying something but

well i'm close the circle of the

basically at the beginning of the meeting we

learned a lot

well it's i want a lot about what happens and people's brains and so forth

that's

i think that something systems prove that we can understand language

well i thought that's my

but it here the and you're basically tell us all that stuff but have read

out from between right after that talk i'm right now is basically you know probably

the wrong idea

are you buying well whatever you why are we do we need to learn something

from those are

and how can we do that

that there are so real sounds so you have two choices a one choice you

have it is about the models used all or something probably existence proof

but

the other is the think about computation

and build a model that way

models don't have to be the say so there's certainly used to pass

and i think learning for hence is a very bright

i pay some attention to that

recipes to which a lot a lot from the altar system

to the extent to understand it and we understand part of

so i think there's two avenues of information to mine once

physiology the others computation

but still there's also the model suggests also so what kind of the evidence we

can take advantage of

so i have a couple sense for this question too so you know regarding i

don't and tuesday the low resource data and we had some zero-resourced arts and that

sort of right most work and it turns out when you actually start removing supervision

from the system

the things that actually allow you to discover units as a speech automatically are not

the same features that we use for supervised process is not the same models that

we use this process

and so somehow

it is the case that i think well a lot of people might not be

interested in sort of that extreme of research because it might not always be practical

from one you don't insist that you can sell and so forth i think that

style of work where you're forced to sort of connect yourself to something like for

example of the and many was talking about with human language acquisition and make something

consisting

between those things can send you to new classes and models and you representations that

you're forced into that i think could eventually be that can be fed back into

the supervised case for forgetting

i'm glad that you go also back to the early days like monday and tuesday

be more skillful of optimism and not for the two for thursday where we all

i like of the coach our

i like to remind you that indeed i think that a community is diving into

new types of models

well below for worse because of course always then you start some new paradigm everybody

chime suddenly turned quickly you may get also discouraged but additionally these nonlinear systems each

of these scores neural networks or something they are very good in being able to

construct all kinds of

architectures highly parallel architectures

we have to think about the new a select up think the models the maximum

likelihood is gone right and this ordinance along i think there is a plenty of

work to do that if i may speak for myself i i'm big deal believer

in highly parallel assistance a layer

there is a many the use of speech being provided

and then the big issue is how do you pick up the most appropriate you

which is which might be appropriate for the situation so adaptation not abide by adapting

the parameters of the model by adapting like picking up the right processing stream very

much along the lines i was quite impressed

what chris while it was telling us that when he added a lot of noise

of course many few euros where good but the ones who were what were still

a very good so essentially my purse i'm speaking for myself no my view is

like that system should be highly parallel

the trained on the whatever data are available but not like one global model on

many parallel models

and it is possible different and independent models and then the big issue is that

you pick up a good one so this is that one direction i'm thinking about

i don't know what other people think about it

but i think that there is a whole i think that whole new a whole

new area of research is and whole possibility for new paradigms is coming

i mean that's what we see all the past few years with the re you

re invention or rediscovery of alternatives to gmm models

i didn't mean to speak i mean i just one mean and give you some

space for thinking what you want to sail you want to ask

so i would just like to

pos ask a question about the possible eventual test of the field in a feature

so it happens i mean i'm not old enough to see this but for example

for

for coding it happened it after their strong technology transfer understand much more established

the research fields

i it didn't die

trips freak terribly right

and this will happen one day with automatic speech recognition

we have some stop these methods and then

they won't be that much things to research i this is going to happens some

are applied

and i was wondering how much time do we have

because

we are already seen a very strong twenty times try and there's a lot of

investment by

all the major

technology by using the market

so are we close to really sorting is not i don't mean sorting semantic context

that's not condition

but are we close to

study some standards

and then is done

because what i we got the research on

how close are we

and years twenty years because the but my carrier right maybe it's side effect

i've life yes for the

i think i people that

that's good

this is average spectral for your funding sources

it's a can be all close hope that there is going to do and we

will that

stick i think i tell my students i come i still is that they are

they getting the speech recognition they are safe for life that this was my experience

somehow i think

comparing speech coding to

speech recognition just doesn't fly at all

i mean speech coding

unless you're going to try for their

utopia of three hundred bits per second which does then requires synthesis coding

there's just no comparison

very straightforward and eventually yes

standards with set

the field i same

could be

to about a coding of pictures

very trivial to cover pictures

we have an impact three impact for

it's all done

picture understanding which is very much like

is the thing

sort of book

i do think

that to

the feel this very far from that

but i think the field

will kill it

if it assumes that it as the solutions

and then continue

to plough through just working the solutions that we have right now

all done so one other thing that i would probably a

like to see happen is are

rather than sitting around and talking about what's wrong with the field

is possibly construct certain experiments

that could point

to what's going on

just

for example when steve was storing before

i was thinking

so you have a mismatch in acoustics and you have a mismatch and language

try to fix one without the other

and C

what is the result where it falls

so i think it's a wonderful want to remind people jump ears was advising us

to design a clear experiments with the answers

so that science can of speech can grow steadily step by step

rather than the rapture for computers and unproven theories

i have are

maybe a couple happens observations

we talk about neural nets

right now as an improvement and i'm sure it's obviously an improvement

it actually goes in the opposite direction

what we're all advising ourselves to do that is it does nothing about any independence

assumption it's just building a better gmm which is the place where you said that

wasn't a problem

it's not modeling dependence

except to the extent that we model longer feature sequences which we tried to do

with the gmms also

in terms of

where we will you know when will we sell but obviously not five years but

that doesn't mean ever

so it would be nice if we could come up with the right model obviously

that would be the best answer

i'm not sure that

speech coding and image coding i don't believe they were saw by coming up with

the right answer i think they were saul by coming up with good enough

answers that

wouldn't have been practical

twenty five years ago because the computing was not enough to

implement those solutions but they are now

and so those

fairly simple fairly brute force

expensive methods now we're practical and work just well enough

so i think speech recognition could go the same way it doesn't you know it

could i if we if someone is very smart pick the right answer that's great

but if you

look at how much we've improved over say the last twenty five to fifty years

there's been a big improvement

say and twenty five years

and if you imagine the improvement from twenty five years to now ago to now

maybe two more times

and the so this is next you know grows exponentially so fifty years from now

i think we could say with almost absolute certainty

speech recognition will be completely cell to all intents and purposes that is it'll work

for all the things you want to do little work very well it'll be fast

it'll be cheap there will be no more research in it

because you will have

computers with

i don't know what the right term is but change of the ninth

memory and computation where you know ten to the fifteenth computation and you'll have modeled

all those differences

by brute force it won't it still would never work to train on one thing

and then tested another but you want have to you will have trained on everything

you know you will of trained on samples of everything so that it just works

the doom and gloom doesn't have to work that way it would just be nicer

to find a more elegant solution sooner

bcmvn this is also positive value there is a just for fast

i don't know nine is probably this probably few more data people in this room

this is a actually would point there's a ten to nine some neurons in auditory

cortex so that must be turned to the nines

tend to the nines away so first solving the problem and maybe it is the

right way to go

i think there is another aspect that's missing which is a

looking at is speech recognition this is a little

no acoustic signal and you're model

model for

i think we need to bring in the context and

we are moving towards that

feature where the palestinians about the context about your personality

but the personalisation all these things should be

incorporated into whatever model

and that will be used some of these ambiguities that if you just looking at

the acoustics

that's another you know feature you know it

actually i would also like to continue on what she was telling us that there

is another one solution to speech recognition there is many right i mean there are

some just like there is many cars and many bicycles and many what side i

mean is something solutions we need solution to a problem

and of course what we keep thinking about all the time is that we will

so you can find peace i think it's okay to find many other so many

smaller solutions it is not questioning my mind that recognition made enormous progresses i mean

actually even i use it here and there i mean of informal will go voice

and this is this is already quite something say so google voice is a good

example since we have a over here i mean i where the solution came to

the point where it's becoming use for just like a car used for do we

all agree that this is not ideal way of

moving people from one place to another it works to some extent so i maybe

we should also think not only about this solution but about many

solutions to

i wasn't those say that

and this relates to

about data

one thing we see anything this is that

given our models language acoustic models

young a particular size

with a C V

and

and in that sense what you say about what was also somewhat

you were kind of suggesting and symbols of classifiers and rocky suggesting a personalisation their

estimate well because

we also and all that if i build the model just for you

and acoustic model just for you are language models just for you it really works

well

and

maybe is not the most a layer and solution but

given enough data and enough context

and in of computational resources that works really well

and i think don't want to see a lot of work in that direction the

prize will have to pay is that

you have to let a whoever's building the recognizer for you what there is no

one's or microsoft whatever

you have to let them access your data

and without that you will have to label within a speaker in the and then

a context system which might be good but not as well as it can be

or you may also provide the means for the user to a modified to technology

in such a way that it works best for that even user and a given

task right you don't have to the i'd necessarily of on the big brother whatever

for me thanks but if you provided technology

which is that have a just like actually most of the technology which we are

using thing about the car i mean you know you can drive it fast you

can drive it slow you can drive you crazy you can drive it safely and

it's a little bit up to you technology basically was provided in such a way

that user can adopt

it in due to its knees i'm use i think that it so this is

one way you the other ways you need we are trying to build is big

huge model which will and the income parse everything i'm more like

believer in many parallel models very much along the lines that human perception in general

because you need wherever you're looking the sensory perception typically always find many channels each

of them looking at the problem before and way

and of course what we have available to us is to pick up the best

way and any given time and this is something which we have two and perhaps

you know but i don't want to push physical direction which i'm thinking about i'd

like to

my belief is that it just building one solution for everything is maybe not also

the best the best way of

quite

so i just wanted to say that

that the world is a dramatically different place

now that it was in nineteen so

and that

that the constraints

that row

of the current sort of formalism they don't exist anymore and i think chip you're

in shell but says that and i agree that you know if somebody didn't know

anything about what the way we do this and they started

a fresh

and thought about it in the current context it would be remarkable

that person came up with the formalism that we do have now

and

i think that

we should spend more time i don't know we should do i certainly will thinking

you know about how to do this in a different way given what we have

and what we know about the brain i mean it's remarkable how much

more we know about humans

just comment concerning the speaker-dependent stuff that you put gets it seems year

but it's not really solving the problem i mean you can make really very good

speaker dependent model but then the person i don't know switch the microphone and you

are again most or he's called alright of no use some obscure digital coding which

is completely cleared for the human beings but because of some strange digital artifacts your

whole algorithms break again

so this is i think this is somehow for the people each i'm i mean

to help get business in the i completely speaker-dependent environment

and i assume that for the people reach are in the i don't know in

the environment which is completely speaker independent it must be kind of the power of

these you know because you have a huge amount of the data which a speaker

dependent so

but it's not really sort of the problem is making the problem we came out

of our error rate and everything obviously because you can train to the speaker but

it's not really dissolution

that you're looking for

this just commands and then also somehow my

intuition or feeling is that the

i just i just know that if i understand what the people are talking about

it easier to me all the to perform a speech recognition

so it has to do something with semantic and it has to case to do

something that semantic and with the with the intelligence and the and

i don't know on so we use but this is the C just the kind

of intuition

i have a common about the semantics

my perception is that

in any many groups

i mean many companies not so low resource

they tend to treat the recognition as a black box

and semantic models are built on top of it

maybe they do a little bit of accounting like or maybe let's go phonetic matches

just in case the recognizer makes a mistake

and i

and it that's okay to get something up and running but i think that's a

stupid mistake

that the semantics and the recognition so be closer together

i have to say it's difficult to convince some of the people doing

semantics that don't have any speech background

that since would be done differently but i believe

this would be influenced

back and forth

was mentioned that is

someone starting fresh

start with the approach we do

and it probably really true

one of you hear it

the someone E mailed out so gone into that once is

now we apply all the in that station the speaker adaptation or all the compensation

development features now neural networks someone have that right

it's just not gonna work right out by

and you can i

compensate for thousands of hours that on in its current a broken

the renaissance neural networks so morgan

using neural networks in the in their fibre formalism because nobody

you know

was that interested because of all the other things that we're working so well and

why would why would anyone in their right minds what it right

but then all of a certain work back to you know we're back in this

zone where people are doing it so i'll all i'm saying is that the less

and i take from that is

you know if you can if you can work in if you can get something

that is that is that makes sense and is and that is demonstrated really good

on a small problem

well then maybe that would be pretty compelling

i mean i agree with you though it's a it's the success is pretty are

you know if i have it is something that i am i gonna say what

we think about this for forty years know exactly

we all know thirty six

and maybe they are like to do something that we should do dishes designing experiments

where we say

i will show you on the state-of-the-art systems that my method works a little bit

better

because that's it itched it is not really such a very scientific is it i

mean assigned to the experiment is that you isolate one problem and you sort of

try to change the conditions and see the things go up postings go down into

the goodwill design experiment if you get worse and you predicted be worse

given your hypotheses i think you are meaning right we are almost never

report results i that because our belief is that the only way to convince our

peers that what you are doing is used to use was used for is that

you get a low word error rate is possible on the state-of-the-art systems with the

optimal accepted task whatever it is at the moment

so i designing good experiments again going back it seems seriously to jump beers be

designed a clear definite experiments so that science can grow step by step by step

i seen that we have to learn how to do that and since you mentioned

in new networks i want to share with you might personal experience

it's different houses here is going to be and he may not even remember

but a long time ago once the post postdoc at icsi here on the experiment

very he had a context independent a hmm-model a context independent phoneme and the you

wanna model and you wanted model was doing twice as good as the hmm and

that can means to be i mean you know that we stick to neural nets

throughout the dark ages on you of neural nets N I partially because we invent

have a so but in hmms an lvcsr as but as a partially because i

truly believe that because that was an experiment which was very convincing to me if

i have a simple a gmm model

without any context-dependency to try easy to of course building to do system and context

the i mean context independent hmm model which was the only way which we between

you have to be noted at a time

and you and that is doing twice as good as the hmm why wouldn't i

stick to this at you are like model i'm glad that we did

i don't know steep if you remember this experiment i say good but i think

it actually got a piece even in transactions eventually right

you know what one other where you can get use of out of a local

optimum is change the evaluation criteria right and i think and i think that's i

mean and part what mary's than what the babel program you know have keyword searches

the task in atwv well extracted word error rate it's not always perfect and i

think another thing that

people we seems to me really are to be reporting when you put a word

report a word error rate is not just the mean word error rate but the

variance across the utterances because you can have a five percent word error rate but

if a quarter of your utterances are essentially you know eighty percent word error rate

which can happen then you know that's a good way to start figuring out how

to get your

technology a little more reliable

i was hoping you would have a comment

i feel

i feel obligated to

talk about ancient history since i'm getting a little older now

i remember when hmms started and we were certainly not the first to use them

we were sort of in the middle of that

of that previous

a revolution

the big criticism there were two big criticisms of hmms

relative to the previous method the previous method was just write the rules because we

all know about speech and say how it works and those systems which i wrote

systems like that back and the early seventies because i was a late adopter of

hmms

those systems were very simple easy to understand extremely fast

needed no training data

that sounds nice right

and they could do very well on set on simple problems without training data and

the hmm is the government argued in other people argued and sometimes we argued hmms

were too complicated require too much storage too much training too much memory and would

never be practical

well obviously things changed and it wasn't only computing power that was a big factor

but it was also learning how to make it more efficient and we do a

combination of all of those not being

re so rigid just to say we have to do it with zero data and

just what i learned in my acoustic phonetics class

we could use data

more data always helped

learning to do speaker adaptation rather than speaker dependent models

okay neural nets

neural nets work done simple problems but not on more complicated problems

and what was need i'd say the reason it works now is because we can

now do you know it two three years ago the things that we're working we're

requiring two months of computation which is just you know unacceptable completely unacceptable some bold

people did that that's great and then they figured out how to get better computers

that all of this argues that each revolution which happens that at twenty five years

cycle

is the realisation that all of the intelligent things that we thought we knew

can stevens would tell us what happens with formant frequencies and i learned all those

things all of those were not the way to go the real understanding was not

the way to go with bothered us because we'd like to think about

we like to think about you know the them phonemes and things like that

but we know that phonemes are abstractions

we know that formants are an oversimplification

everything that we learn is an oversimplification and computers are just simply more powerful than

we are

then we can anything we can write the not more powerful than the brain but

the right more powerful than anything that we can write in a in a program

so i think

that would argue against

the i i'm not i'm not saying that you shouldn't keep trying to find the

right answer but i think history has told us that the right answer is think

about more efficient ways

both you know computing will increase its increased by factor of a thousand and the

last twenty five years both segments memory and storage and it will increase by a

factor of a thousand every twenty five years forever

and that's a big number in fifty years

but at the same time we can think about algorithms that are a thousand times

more efficient

that had that has happened and it will happen

it a little you know collects that's of data other people can collect parts of

data i think it will happen that we will have corpora that include the speech

of millions of people from

hundreds of languages in hundreds of environments

and if you just imagine that it was let's just pause it that it was

simple and easy to collect millions of hours from all these environments and memorise all

of it and learn what to do with it and compute it store it all

in something that fits in your you know in the chip that's embedded in your

in your hand or something or in your you in your head

well in it just works you don't know why or how it works but it

works

so i

while i have the same desire to understand

intellectually what's going on i would that almost anything that will be of the solution

that eventually works

so i'd like to make the other side

and the other side is if you look at the history of science

what's happened is

are truly

stupendous advances have come from understanding where we are

recurrent models don't work

it's not

that we shouldn't try to push models

but the think that you're describing

engineering

i'm pam of engineering what truly understanding comes from looking at the places where our

current models fail

and all of the things that we've been doing for the past twenty years are

data

for the next

and we should be paying attention to where we fail

and that's where we're gonna find the success

so a

one the to it at a little bit

it seems like this i think that i like which we always think

the old story is if you take

an infinite number of monkeys and give them

infinite number of typewriters eventually will i shakes

and i think that's what you're suggesting

a you have a few problems number one

more is lower it did

fairly much comes took came to an end

and that industry is facing the same problem unless there is a dramatic

technological shipped

a you're not going to get

the kind of doubling that we've seen every eighteen months

in the future

basically quantum mechanics eventually getting you way

the alignments are so narrow now that there are not too many atoms or

to allow for them to continue to be

somebody else said something about

well what happen if people started a

doing this research all over again would be find the same solution

a i'm waiting now a marvellous what paul designed the nature tries to explain evolution

not just of humans but rivers and everything else in terms of

physical laws

i highly suggest reading it it's very entertaining a but basically

and then going back to the coding i think when the coding what was done

it really was fundamental in the sense that we understood

a page and spectrum where the essence so for example the coding that works on

yourself on which is really meant to code speech if this is like in the

background it totally the stories because it really as adopted to the speech signal

so wasn't just a random brute force process it really depended on first lpc then

are is a coding the residual and all of that and that's why we have

such good coders and i think

the theory behind that was of course much more trivial then it is and in

language

so i do think that

we need to continue the work that we're doing but on the other hand do

a lot for some paradigm shifts a that would be more than just are increasing

a that's stochastic ability by introducing neural nets and

from where i said i thousand miles at a neural nets essentially are a generalization

of hmm their boats stochastic models it's just that in hmm you have essentially a

single it later

so i think the point about how much data and we need to solve the

problem by brute force comes down also to the question of

artificial intelligence right

so contain with these two stage scenarios one even scarier is that one day we're

going to get a activity in to use right

and so this process this when this happened or in the way so that moment

we're going to lose control of abstraction right machines are going to be better than

this ad created their own map structures so all this prior knowledge we want to

put into our models

is going to be are way you've seen things but machines are going to have

their way of seeing things

and when is it is discussions about saying

when we have to look at the problem and things like humans and

i think well

i is already happening that machine to create in they don't obstructions and they are

not into due to less but since they are two going to do better than

as in the long term we're done we might be better of just think you

know how the so much in sync up on the not how like to think

on this

how i can express the problem okay you at generative model that see it is

to me

maybe it should be intuitive to the machine

or to the harder right and deep neural networks

to some extent

okay

doing this i would very far away from that similarity right but when we will

reach that so maybe we'll webbetter of thinking

and i

that they are really always looking in the light and basically after fifty years over

artificial intelligence essentially of developed

tremendous methods for optimization and classification there is very little more can inference and logic

so i'm very good the to field is alive and well the si can see

from this discussion it really reminds me of which it reminded us that for one

of the first the asr you the workshops and i will also remember that even

in my introduction

where people were discussing fighting and it always the desire to move the field further

and i'm very happy that i think that we use exceeded too large extent in

this asr you to so let's just keep it's going i think otherwise i will

i will pass of the microphone to one zap who has a

a sound

since to say about is it is it the time for post the room or

basically i estimate i one commander is discussion i think

what we were discussing with the data that models the adequacy of models monitored by

i think well it turned little bit speech centric

so a little bit too selfish i fine so i think we forgot about the

users have a four technologies because i have the impression

that the well rarely people would just ultimately use the output the of asr and

say this is the output them your it finishes is most of the time is

just some meat product that would be further used by someone so actually

i like the way that the better what so speaking about that the well for

you would be the wer is not the automated metric but is the click through

rate wer of foreign call center traffic it might be the customers of destruction so

they have measures forty

for a government agency it might be the number of court

but the guys

and so on and so on so i think actually there is still quite some

work to do in propagating these target metrics

back to our field that i'd i don't know if there was like sufficient work

on this maybe they are not that only interested

in at W or wer and stuff like this just the just need to get

there were done

okay so we cook is sorry i didn't i didn't mean that the

find technical common and in the i did so no

no comments on this

one

lost

Closing discussion "What's wrong with ASR and what can we do about it"

4th Day